Natural Language Processing (NLP) is a task where we make the computer understand and interpret human languages. NLP Pipeline is an automated workflow which enables it to take Human language variants (speech, text etc.) and transform it in various stages to get desired output. Our model takes Tweets based on a specific keyword, hash tags or mentions and gives us a tag cloud along with the polarity of those tweets represented with a visual index. Our workflow is described in various steps as follows such as data collection phase, preprocessing, feature extraction, partitioning, training. Afterwards, the testing and deployment is shown in Sect. 5.
4.1 Data Collection
Data collection is a phenomenon of taking information on a specific variable in an already established system. The primary objective of this step is to get data which is more reliable and feature-rich. If we fail to find content-rich data, then our conclusion would be unreliable and our informed data-driven decision-making process will have low-accuracy predictions. That’s why we have tried creating out own database instead of using an already built one. Twitter language changes according to trends, which is why it’s crucial to train our model with recent data.
In this model, we have created data; 20K for model development (training and testing) and 150K for deployment from Twitter's public database. To collect the dataset, we require a Twitter Developer Account which can be created (for free) on dev.twitter.com and then we would be able to connect Twitter API. Twitter API lets us access to the whole Twitter Public Database and we can filter those based on our requirements. Twitter offers 4 keys (API Key, API Secret Key, Access Token Key, and Access Secret Token Key) which authenticate our use of the Twitter API. We have used Knime’s Twitter API Connector node for this authentication. The next step is to use appropriate filters which would give us reliable and content-rich data. Here, we've used keywords as our primary filter mechanism. Our training part doesn't require any specific data, so we have used random, but distinct tweets. For our deployment phase, we have used keywords (Covid-19, Vaccine, Stay home) as filters to collect only relevant data. The later tweets contain other relevant features such as country, time, retweet count, favorite count, retweeted from which can be used for further analysis of the tweets if desired. This is done via Twitter Search node.
Before we could train our model with those distinct 20K tweets, we need to label it. So checked each tweet and labelled it manually with positive, negative or neutral polarity. Now, it's ready for preprocessing. Positive tweets are opinions of hope,joy, etc. while Negative tweets show fear, sadness, pessimism etc. and Neutral tweets arenot opinions, they are just facts. Figure 2 shows Tweet sample for negative, positive, and neutral.
4.2 Pre-processing
As we have discussed earlier, this step is crucial in NLP pipeline especially for Twitter. When we take data as unstructured source, most of our data is redundant, i.e., does not make much effect on the conclusion. So, we need only relevant data which can run in our model and provide valuable insights leading to data-driven decisions. But having those less relevant bits of data makes our model complex and the time complexity increases. Our goal is to minimize those time and space complexity. So, we need to transform our data into something that will make the model simple, yet effective. There are following steps to follow in preprocessing.
1. Normalization: This step filters out irrelevant information from the data and makes the data simpler. This stage consists of 4 nodes in its workflow.
-
Punctuation Eraser: This node takes the input document removes all the punctuation from it.
Example: Ah! What happened? =>Ah What happened.
-
Case Converter: This takes the document from the previous node and transforms everything into lowercase letters.
Example: John is STUPID = > john is stupid.
-
N-Char Filter: This node takes the document and removes any special character in it.
Example: i gave him $10 = > i gave him 10.
-
Number Filter: This node removes the numerical values presented in the input document.
Example: i donated 10 million = > i donated million.
2. Stop-words filter: This stage takes the document and filters out redundant words which does not provide much information. We have used the standard stop words library provided by OpenNLP.
Example: I was 10 inches wider before. => i inch wider before.
3. Tokenization: It's a phenomenon of separating a text into smaller units called 'tokens'. If we take a sentence as input, then we will get individual words as tokens. Our model uses the default node provided by Knime for this.
4. Stemming & Lemmatization: This process involves reducing individual words to its root form or 'stem'.
Example: run/runs/running = > run.
Lemmatization is a process where a pre-defined dictionary is used to reduce each token into its root form or 'lemma'.
Example: is/was/were = > be.
Our model uses Stanford Lemmatizer in this step.
4.3 Feature Extraction
Human language is complex for computers unlike numbers. In case of numbers machines are faster to process them and manipulate them into required form. A vector is used to perform calculations and find relevant conclusions. So, we need to feed the machines texts in numbers form so that they can understand it and transform it into meaningful insights for data-driven decisions. In this stage of our NLP pipeline, we convert out textual data into n-dimensional vector. For images and videos, there're pixel values which are numbers. But human language is much more complex. Feature extraction is a process where we extract and present a feature representation that we need for our specific NLP tasks. Features are the attributes that helps us understand our data better. In our model, we have used TF-IDF which is discussed as follows.
TF-IDF
In Bag of words, each word is treated equally and that's a mistake. Because some words have more relevance and feature-rich than others and we need to use that feature in our model. That's where TF-IDF comes in. It stands for Term Frequency-Inverse Document Frequency [32–35]. TF-IDF has a weighting factor which counts as the relevance for each word in a document. A word's relevance increases proportional to the number of times a word appears in the whole document. If a word appears less times, then its relevance is more.
Term frequency provides us the information about how frequently a word appears in the entire document. It tells us the probability of a word within that document. Term frequency tf(x,y) = Nb(x,y), where Nb(x,y) is the number of times a word x appears in document y. Inverse document frequency is denoted as:
idf(x,y) = log[Nb/df(x)] [1]
Where Nb is the total number of documents and df(x) is the number of documents that contain the word x.
tf-idf(x,y) = tf(x,y)*idf(x,y) [2]
When a term appears more frequently in the whole document, then the log value approaches to 0 making the term less relevant. This step ends the feature extraction part. Now, it's time to actually train the model.
4.4 Partitioning
To build an effective model which can predict new data with high accuracy, a model should be trained on a diverse dataset. Partitioning is a process where we split our data into 2 sets; train data and test data. Typically, we need to use more data to train our model, but our testing data should be enough diverse to successfully test the validity of our data. That's why we used 85 − 15 split on our dataset; 85% is used for training and 15% for testing. Our dataset consists of 20K manually labelled by us and we found that 3K tweets is enough diverse to test the validity. We have infused the sentiments with distinct colors for better visual representation purposes; Green for Positive Sentiment, Red for Negative Sentiment and Orange for Neutral.
4.5 Training
We have a labelled dataset of 20K tweets and we're using 17K tweets to train our model. The tweets are classified into 3 polarity; positive, negative, neutral sentiment. Our model consists of results from 3 types of Machine Learning models [17–31]; Decision Tree, SVM, and Logistic Regression. The simulation is performed better under this environment with 88.67% accuracy in SVM. We have discussed the machine learning models as follows.
Decision Tree
Decision Tree (DT) is a type of supervised machine learning algorithm where it's represented in a tree-based model with each node representing a test on the attribute, and each branch represents the results those tests. A Decision Tree model learns by splitting the input dataset into subsets based on an attribute value test. This step is repeated multiple times till all data is classified or the splitting doesn't add any value. This type of classification doesn't require any domain-specific knowledge to classify the data. In other words, it starts with root node at the node and asks a question every time a data tries to pass through. The answer decides which direction it would go next. Then the process continues till the data reaches the terminal node (leaf nodes). In our model, we've used gini-index as quality measure technique and no pruning is applied. Table 1 and Table 2 shows the results for DT.
Table 1
Accuracy of Decision Tree for Positive, Negative and Neutral
|
Positive[predicted]
|
Negative[predicted]
|
Neutral[predicted]
|
Positive[actual]
|
1750
|
10
|
50
|
Negative[predicted]
|
60
|
350
|
50
|
Neutral [predicted]
|
240
|
30
|
460
|
|
85.37%
|
89.74%
|
82.14%
|
Table 2
Accuracy, Error, Cohens Kappa of Decision Tree
Accuracy
|
Error
|
Cohen’s kappa
|
Correct
|
Incorrect
|
85.33%
|
14.67%
|
0.719
|
2560
|
440
|
SVM: Support Vector Machine (SVM) is a machine learning algorithm where we try to find a hyperplane to successfully classify our n-dimensional dataset. To find the best possible hyperplane, we use a plane which separates the data with maximum margin. The dimension of the hyperplane depends on the number of features used. For two features, it's a line and a plane if it's 3. Kernels are a mathematical function which is used to transform the input data to desired form. Here, we've tried the model with Polynomial, hypertangent, RBF(Radial Basis Function) and found that polynomial kernel performed best with bias, gamma and power set to 1.0 and 1.2 overlapping penalty. Table 3 and Table 4 shows the results for SVM.
Table 3
Accuracy of SVM for Positive, Negative and Neutral
|
Positive[predicted]
|
Negative[predicted]
|
Neutral[predicted]
|
Positive[actual]
|
1740
|
10
|
60
|
Negative[predicted]
|
50
|
360
|
50
|
Neutral [predicted]
|
140
|
30
|
560
|
|
90.16%
|
90.00%
|
83.53%
|
Table 4
Accuracy, Error, Cohens Kappa of SVM
Accuracy
|
Error
|
Cohen’s kappa
|
Correct
|
Incorrect
|
88.67%
|
11.33%
|
0.789
|
2660
|
340
|
Logistic Regression
Logistic Regression (LR) is a classification machine learning algorithm used to classify binary dependent variables (pass/fail), but it can be used to classify more values (multinomial). Stochastic gradient solver algorithm makes a prediction for a training instance and the model is updated based on the error on that training instance. The model repeats till error are reduced or a specific epoch is mentioned. Here, we've used 100 epochs with fixed learning rate strategy and 0.1 step size. To regularize 0.1 variance is used on gauss prior. Table 5 and Table 6 shows the results for LR.
Table 5
Accuracy of Logistic Regression for Positive, Negative and Neutral
|
Positive[predicted]
|
Negative[predicted]
|
Neutral[predicted]
|
Positive[actual]
|
1565
|
0
|
245
|
Negative[predicted]
|
20
|
320
|
120
|
Neutral [predicted]
|
45
|
0
|
685
|
|
96.01%
|
100.00%
|
65.23%
|
Table 6
Accuracy, Error, Cohens Kappa of Logistic Regression
Accuracy
|
Error
|
Cohen’s kappa
|
Correct
|
Incorrect
|
85.67%
|
14.33%
|
0.695
|
2570
|
430
|
From Fig. 3 it is observed that SVM shows high accuracy than DT and LR. From Fig. 4 it is observed that SVM has less error than DT and LR. Figure 5 shows SVM has high Cohens Kappa than DT and LR. From the results it is concluded that it is better to prefer SVM for our sentiment analysis model to analyze the Covid-19 vaccination tweets.