(NLP) is a wide area of research where the worlds of artificial intelligence, computer science, and linguistics collide. It includes a bevy of interesting topics with cool real-world applications, like named entity recognition, machine translation, or machine question answering. Each of these topics has its own way of dealing with textual data.
But before diving into the deep end and looking at these more complex applications, we need to wade in the shallow end and understand how simpler tasks such as text classification are performed.
Text classification offers a good framework for getting familiar with textual data processing. There are many interesting applications for text classification such as spam detection and sentiment analysis. In this post, we will see some NLP techniques for text classification.
The basics include:
Structure extraction – identifying fields and blocks of content based on tagging
Identify and mark sentence, phrase, and paragraph boundaries – these markers are important when doing entity extraction and NLP since they serve as useful breaks within which analysis occurs.
Language identification – will detect the human language for the entire document and for each paragraph or sentence. Language detectors are critical to determining what linguistic algorithms and dictionaries to apply to the text.
Tokenization – to divide up character streams into tokens which can be used for further processing and understanding. Tokens can be words, numbers, identifiers or punctuation (depending on the use case)
Acronym normalization and tagging – acronyms can be specified as “I.B.M.” or “IBM” so these should be tagged and normalized.
Lemmatization / Stemming – reduces word variations to simpler forms that may help increase the coverage of NLP utilities.
Decompounding – for some languages (typically Germanic, Scandinavian, and Cyrillic languages), compound words will need to be split into smaller parts to allow for accurate NLP.
Entity extraction – identifying and extracting entities (people, places, companies, etc.) is a necessary step to simplify downstream processing. There are several different methods:
Regex extraction – good for phone numbers, ID numbers (e.g. SSN, driver’s licenses, etc.), e-mail addresses, numbers, URLs, hashtags, credit card numbers, and similar entities helps to identify the same.
Dictionary extraction – uses a dictionary of token sequences and identifies when those sequences occur in the text. This is good for known entities, such as colors, units, sizes, employees, business groups, drug names, products, brands, and so on, which helps to identify the same.
Complex pattern-based extraction – good for people names (made of known components), business names (made of known components), and context-based extraction scenarios (e.g. extract an item based on its context) which are fairly regular in nature and when high precision is preferred over high recall.
Phrase extraction – extracts sequences of tokens (phrases) that have a strong meaning which is independent of the words when treated separately. These sequences should be treated as a single unit when doing NLP. For example, “Big Data” has a strong meaning which is independent of the words “big” and “data” when used separately. All companies have these sorts of phrases that are in common usage throughout the organization and are better treated as a unit rather than separately. Techniques to extract phrases include:
Part of speech tagging – identifies phrases from the noun or verb clauses
Statistical phrase extraction - identifies token sequences which occur more frequently than expected by chance
Hybrid - uses both techniques together and tends to be the most accurate method.
Some Text Classification Algorithms:
1. Naive Bayes
Naive Bayes is a family of statistical algorithms we can make use of when doing text classification. One of the members of that family is Multinomial Naive Bayes (MNB). One of its main advantages is that you can get really good results when data available is not much (~ a couple of thousand tagged samples) and computational resources are scarce.
All you need to know is that Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional probabilities of occurrence of two events based on the probabilities of occurrence of each individual event. This means that any vector that represents a text will have to contain information about the probabilities of the appearance of the words of the text within the texts of a given category so that the algorithm can compute the likelihood of that text’s belonging to the category.
Support Vector Machines
2. Support Vectors Machines(SVMs):
Support Vector Machines (SVM) is just one out of many algorithms we can choose from when doing text classification. Like naive Bayes, SVM doesn’t need much training data to start providing accurate results. Although it needs more computational resources than Naive Bayes, SVM can achieve more accurate results.
In short, SVM takes care of drawing a “line” or hyperplane that divides a space into two subspaces: one subspace that contains vectors that belong to a group and another subspace that contains vectors that do not belong to that group. Those vectors are representations of your training texts and a group is a tag you have tagged your texts with.
3. Deep Learning:
Deep learning is a set of algorithms and techniques inspired by how the human brain works. Text classification has benefited from the recent resurgence of deep learning architectures due to their potential to reach high accuracy with less need for engineered features. The two main deep learning architectures used in text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
On the one hand, deep learning algorithms require much more training data than traditional machine learning algorithms, i.e. at least millions of tagged examples. On the other hand, traditional machine learning algorithms such as SVM and NB reach a certain threshold where adding more training data doesn’t improve their accuracy. In contrast, deep learning classifiers continue to get better the more data you feed them with.
Applications and Examples of Text Classification:
Text classification can be used in a broad range of contexts such as classifying short texts (e.g. as tweets, headlines, or tweets) or organizing much larger documents (e.g. customer reviews, media articles, or legal contracts). Some of the most well-known examples of text classification include sentiment analysis, topic labeling, language detection, and intent detection.
Probably the most common example of text classification is sentiment analysis: the automated process of determining whether a test is positive, negative, or neutral. Companies are using sentiment classifiers on a wide range of applications, such as product analytics, brand monitoring, customer support, market research, workforce analytics, and much more.
This is a pre-trained classifier using MonkeyLearn for classifying text in English according to their sentiment. Feel free to experiment and try different expressions to see the classifier makes the predictions:
Another common example of text classification is topic labeling, that is, understanding what a given text is talking about. It’s often used for structuring and organizing data such as organizing customer feedback by its topic or organizing news articles according to their subject.
Language detection is another great example of text classification, that is, the process of classifying incoming text according to its language. The text classification also helps us to know the language of the text.
Some real-life use cases are mentioned below.
Social media monitoring checking.
Brand monitoring checking.
Call Center service.
Voice of the customer.