Text Classification Using NLP

Codersarts AI
Sep 4, 2020
4 min read

Unstructured text is everywhere, such as emails, chat conversations, websites, and social media but it’s hard to extract value from this data unless it’s organized in a certain way. Doing so used to be a difficult and expensive process since it required spending time and resources to manually sort the data or creating handcrafted rules that are difficult to maintain. Text classifiers with NLP have proven to be a great alternative to structure textual data in a fast, cost-effective, and scalable way.

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Text Classification is one of the most widely adopted Natural Language Task in not just the IT industry today, but also in a variety of businesses. The main aim of text classification is to automate the process of classifying the text documents into one or more defined categories. Some examples of text classification are:

Sentiment Analysis: the process of understanding if a given text is talking positively or negatively about a given subject (e.g. for brand monitoring purposes).
Topic Detection: the task of identifying the theme or topic of a piece of text (e.g. know if a product review is about Ease of Use, Customer Support, or Pricing when analyzing customer feedback).
Language Detection: the procedure of detecting the language of a given text (e.g. know if an incoming support ticket is written in English or Spanish for automatically routing tickets to the appropriate team).

Environment Setup:

The project is set up in Anaconda Environment on the jupyter notebook.

Dependencies/Libraries Required:

pandas
sklearn
pickle
nltk
matplotlib
word cloud
seaborn
spacy
collections
en_core_web_sm

1. Loading The Libraries:

%matplotlib inline
from sklearn import metrics
import seaborn as sn 
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
import pickle
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score,accuracy_score
from wordcloud import WordCloud
import matplotlib.pyplot as plt 
from sklearn import model_selection, preprocessing,svm

In this step, we imported all the required libraries like seaborn, pandas(for preprocessing). nltk(For textual) etc.

Data Exploration Once the environment is set up and dependencies are installed it is time to get started and explore our data set. For this particular article, I have used a dataset consisting of more than 60000 textual sentences along with their respective targets.

data = pd.read_csv(dataset,engine='python')
data.head()

In this above code file, we imported our dataset with moreover 60k of data.

Here is how the dataset looks like

Let's check the unique columns of the Predicted_category column

data['predicted_category'].unique()

O/p: array(['affection', 'exercise', 'bonding', 'leisure', 'achievement', 'enjoy_the_moment', 'nature'], dtype=object)

WordCloud:

all_words = ' '.join([text for text in data['SentimentText']])
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In this section, the word cloud has made on column SentimentText.

In the first step, all words are joined. Then a word cloud with height 800 and width 500, with font size 110 has been plotted. (With figure size of width 10 and height 7) the word cloud interpolation is bilinear.

Data Preparation And Feature Engineering:

Sequal to data exploration is data preparation and feature engineering. In this step, we encode the Target variable and vectorize the textual data present in our data set. This could be done in multiple ways such as:

1: By using the TF-IDF encoder

2: count vectorizer

3: word2vec etc

If the data had been messier then this step would include cutting out noise as well .i.e. more of data preprocessing but since the data we have is already processed we can simply leave that part. Also, we need to split the data into training and validation set, this will come handy when we come to model evaluation.

Instead, we will use the TF-IDF vectorizer (Term Frequency — Inverse Document Frequency), a similar embedding technique that takes into account the importance of each term to document.

While most vectorizers have their unique advantages, it is not always clear which one to use. In our case, the TF-IDF vectorizer was chosen for its simplicity and efficiency in vectorizing documents such as text messages.

TF-IDF vectorizes documents by calculating a TF-IDF statistic between the document and each term in the vocabulary. The document vector is constructed by using each statistic as an element in the vector.

After settling with TF-IDF, we must decide the granularity of our vectorizer. A popular alternative to assigning each word as its own term is to use a tokenizer. A tokenizer splits documents into tokens (thus assigning each token to its own term) based on white space and special characters.

data.replace(r'\b\w{1,4}\b','', regex =True, inplace = True)
encoder = preprocessing.LabelEncoder()
data['Target'] = encoder.fit_transform(data['predicted_category'])
vectorizer = CountVectorizer()
vectorizer.fit(data['cleaned_hm'])
data['vec'] = vectorizer.transform(data['cleaned_hm'])
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(vec,data['predicted_category'],test_size=0.1)
data.head()

Let's have a look at the shape of the data.

Train_X.shape,Test_X.shape

O/p: ((54288, 17021), (6033, 17021))

Model Training This involves the selection of algorithms and training models based on that algorithm. There are multiple algorithms that could perform this kind of stuff e.g Naive Bayes, SVM, Neural nets, and so on.

SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X , Train_Y)
predictions_SVM = SVM.predict(Test_X)

here we have imported the Support vector machine model into it to train our model.

Accuracy:

The accuracy of 79.62 with an F1-score of 0.79 is achieved by SVM, which is not that bad we can tune this model and choose different features like POS, word embeddings, etc in place of cout vector formations in order to increase the accuracy and other evaluation measures of our model.

print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print(classification_report(Test_Y,predictions_SVM))
print(f1_score(Test_Y,predictions_SVM, average='weighted'))