Collecting big data
Anglophone tweets were collected using a Twitter application programming interface (Twitter API) [28]. The application to use the Twitter API has been approved by Twitter for academic research purposes of this paper. Over six million tweets (n=6,795,462) posted from MAR 2007 to JUN 2021 were collected by searching six basic emotions [23] since the first tweet was posted on MAR/21/2007 [29].
Table 2 demonstrates six basic emotions consisting of six representative (e.g., #joy) [13] and 18 synonymous emotion hashtags (e.g., #fun, #joyful, and #enjoy) [11, 12, 14]. This paper adopted representative emotion hashtag words (see Table 1) suggested by Mohammad [13], and synonymous emotion hashtag words (e.g., #worried, #pissed, and #eww) suggested by Saravia et al. [14].
Table 1
Tweets after basic pre-processing
|
Hashtag
|
Seriously You act like you love me then ignore me Oh yea man that s how we do it
|
#anger
|
It makes me sick how many school lock downs have had to happen in the past few days all the press about CT just eggs on copycats
|
#disgust
|
I can take a lotta scary stuff but spiders cross the line
|
#fear
|
Our family would like to wish you and your beautiful families a very happy prosperous holidayseason
|
#joy
|
Sometimes I wish stopping was easier But addiction is just too strong problems addiction selfdepreciative help pls
|
#sadness
|
How could a laptop under a blanket even be comfortable cat
|
#surprise
|
Table 2
The frequency statistics and length of tweets
Emotions
|
Representative
|
Synonymous emotion hashtags
|
Frequency
|
Percent
|
Length
|
Anger
|
#anger
|
#angry #mad #pissed
|
1,159,456
|
20.54
|
84.96
|
Disgust
|
#disgust
|
#awful #disgusted #eww
|
780,674
|
13.83
|
79.17
|
Fear
|
#fear
|
#feared #fearful #worried
|
514,452
|
9.11
|
88.07
|
Joy
|
#joy
|
#enjoy #fun #joyful
|
1,102,663
|
19.53
|
91.07
|
Sadness
|
#sadness
|
#depressed #grief #sad
|
1,200,969
|
21.27
|
90.16
|
Surprise
|
#surprise
|
#strange #surprised #surprising
|
886,925
|
15.71
|
79.48
|
Total
|
6 hashtags
|
18 hashtags
|
5,645,139
|
100.00
|
85.49
|
To clear the dataset, emotion hashtag, website address, and special char- acters were removed from tweets as the basic pre-processing (see Table 3) to obtain only English words. Then, all duplicate tweets were removed to keep every tweet unique [12]. Also, tweets that had fewer than three English words and re-tweets were excluded [13]. This resulted in over five million tweets (n=5,645,139).
Table 3
Process
|
Pre-processing
|
Tweet text
|
Raw
|
None
|
that terrible moment when you end an essay with "so... yeah" and then forget to change and ediT IT BEFORE YOU HAND IT IN #anger
|
Basic
|
Only English words
|
that terrible moment when you end an essay with so yeah and then forget to change and ediT IT BEFORE YOU HAND IT IN
|
Moderate
|
Lowercased English words
|
that terrible moment when you end an essay with so yeah and then forget to change and edit it before you hand it in
|
Rigorous
|
Stop-words removed
|
terrible moment end essay yeah forget change edit hand
|
The TEC dataset [13] consisted of 21,047 tweets posted from NOV 2011 to DEC 2011. There were 558 duplicate cases (2.65%) within the TEC dataset and 731 duplicate cases (3.47%) between our dataset and the TEC dataset after the basic pre-processing. Since the proportion of duplicate cases is small, duplicate cases were not excluded to keep the original dataset.
The collected dataset consisted of publicly available information and did not store any personally identifiable information. This dataset offered two key information (i.e., tweet text as X values and six emotion hashtags as y values; see Table 1). Exploratory data analysis continues in the following section.
Exploratory data analysis
Understanding a dataset plays a vital role in improving model performance and finding missing, incorrect, or biased data. Therefore, unique characteristics and the distributions of the dataset should be identified.
Table 2 shows the frequency statistics and length of tweets (i.e., number of characters of tweets). The most frequent emotion was ‘sadness’ (21%), followed by ‘joy’ (20%) and ‘anger’ (20%), whereas ‘fear’ (9%) was the least prevalent. Each emotion consisted of a representative emotion (e.g., #anger) hashtag and additional three synonymous emotion hashtags (e.g., #angry, #mad, and #pissed).
Two additional points that can be extracted from tweets are: (i) number of characters of a tweet (Mean = 85.88, SD = 51.42, Median = 77, Min = 5, Max = 336, Q1 = 49, Q3 = 107), (ii) number of words of a tweet (Mean = 15.41, SD = 8.84, Median = 14, Min = 3, Max = 95, Q1 = 9, Q3 = 20). Table 2 reports that most tweets were less than the 140 characters limit (280 characters limit from 2017), and ‘joy’ (Length=91.07) had the greatest number of characters, whereas ‘disgust’ (Length=79.17) had the least number of characters.
[Table 2 about here.]
Figure 1 shows the word cloud of the most frequent and occasionally words after processing the rigorous pre-processing (see Table 3). Interestingly, this figure shows that ‘love’ (n=328,589) was one of the most mentioned emotion words, which implies positive valence, but do not reflect to any specific emotion.
Few tweets included multiple synonymous emotion hashtags (e.g., I am #anger #fear; 2.06%; n=116,398), and most tweets did not include multiple different emotion hashtags (e.g., I am #anger #joy; .11%; n=6,422). Also, one of ten tweets included synonymous emotion words (e.g., I am anger #fear; 13.05%; n=737,550), whereas most tweets did not include different emotion words (e.g., I am anger #joy; .30%; n=17,233).
In summary, these results show that predicting emotions with emotion words or other frequent words, such as ‘now’ (n=233,949), ‘don’ (n=217,213), and ‘one’ (n=213,669), might be difficult. These results also suggest applying deep learning algorithms that can focus on the contextual meaning of a sentence rather than traditional ML algorithms that can focus on words.
[Fig. 1 about here.]
Dataset preparation process
This section introduces the preparation of the emotion-labelled dataset (n = 5,626,219) collected in the previous section. The dataset preparation process consists of three steps: 1) pre-processing, 2) dataset selection, and 3) dataset splitting strategy. The purpose of pre-processing is to uniformly organise the input text to enhance the recognition, performance, and efficiency of the model. Two pre-processing processes (the moderate and the rigorous process) were applied to the text datasets. As shown in Table 3, in the moderate process, English words were lowercased after processing the basic process. This process was applied to deep learning models that recognise the contextual meaning of the entire sentence. Although, stop-words are frequently used words without special meaning, such as ‘we’, ‘are’, ‘the’, ‘a’, ‘only’, and ‘in’, were not removed as they can play a significant role in conveying meaning for deep learning
models. In the rigorous process, stop-words were removed after processing the moderate process. This process is applied to traditional ML models that focus on words rather than a sentence.
[Table 3 about here.]
Secondly, the pre-processed dataset was divided into six datasets according to dataset selection criteria (i.e., type (representative or synonymous emo- tion hashtags) and position (any, last quarter, or last position) of emotion hashtags). For examples of position of emotion hashtags; ‘I am john #joy’ represents last position, ‘I am #joy john’ represents last quarter, and ‘#joy I am john’ represents any position. Table 4 provides detailed information about the six different datasets of that abbreviations mean; 24H: 24 hashtags, 6H: 6 hashtags, AP: any position, LQ: last quarter, and LP: last position.
Table 4
Id
|
Type
|
Hashtags
|
Position
|
N
|
24H-AP
|
synonymous
|
24
|
AP
|
5,645,139
|
24H-LQ
|
synonymous
|
24
|
LQ
|
4,023,748
|
24H-LP
|
synonymous
|
24
|
LP
|
2,183,452
|
6H-AP
|
representative
|
6
|
AP
|
1,478,116
|
6H-LQ
|
representative
|
6
|
LQ
|
903,002
|
6H-LP
|
representative
|
6
|
LP
|
390,630
|
Finally, the 80/20 dataset splitting strategy for model training and validation was applied to the six datasets. This split ratio is commonly used for NLP-related or other ML tasks [1]. For example, the 24H-AP dataset (n=5,645,139) was split into 80% for training (n=4,516,111) and 20% for testing (n=1,129,028).
To evaluate the generalizability of the models’ performance, the TEC dataset was applied as an external test dataset. As noted in the related work section, TEC dataset provides emotion-labelled tweets with representative emotion hashtags at the last position. TEC dataset (n=21,047) was also split into 80% for training (n=16,837) and 20% for testing (n=4,210). Also, traditional and deep learning ML algorithms trained on the six different datasets including the TEC dataset were evaluated.
Traditional machine learning algorithms
Five ML algorithms (i.e., 1) k-nearest neighbours, 2) decision tree, 3) naive bayes, 4) support vector machine, and 5) logistic regression) [30] trained on six
different datasets were applied to propose the effective dataset. Below these algorithms are briefly introduced.
First, the K-Nearest Neighbours (KNN) algorithm is the most widely used centroid-based clustering algorithm. It is an unsupervised learning technique that automatically groups data with similar characteristics into respective clusters. This algorithm is called K-NN because it generates k individual clusters, and the output is the value of the object’s feature, and the centre points of k clusters represent the average value of the shortest distance from the data in the cluster [31].
Second, the Decision Tree (DT) algorithm is widely used in data science due to its various advantages such as excellent predictive accuracy, intuitive description of a model, and selecting informative attributes in model design. The analysis result of DT can be drawn as a tree diagram that groups properties by sorting the various data entries by information gain (informative attributes). DT is primarily used for classification purposes [31].
Third, the Naive Bayes (NB) algorithm based on the Bayes rule is mainly used for clustering and classification purposes. The underlying architecture is based on conditional probability. Depending on the likelihood of happening, trees are created, which are also called Bayesian networks [31]. NB is mainly used for text classification, such as sentiment analysis [32–34].
Fourth, the Support Vector Machine (SVM) algorithm is one of the most widely used ML algorithms. SVM is mainly used for classification. SVM works according to the principle of margin calculation and basically draws as much margin as possible between the data to be classified [31]. The purpose of the SVM algorithm is to find the hyperplane in an N-dimensional space that clearly classifies data points [30].
Fifth, the Logistic Regression (Logit) algorithm uses a simple algorithm (sigmoid functions) like ordinary least squares (OLS). The relationship be- tween dependent and independent variables is expressed as a mathematical function and used in future prediction models. The main difference from OLS is that the results are divided into specific categories [35]. However, this single-
layer perceptron may not be suitable for dealing with high dimensional prob- lems of big data with many observations and an indefinite number of variables.
Deep Learning algorithms
Three representative deep learning algorithms (i.e., 1) artificial neural network,
2) recurrent neural network, and 3) convolution neural network) [36] were applied to further evaluate the proposed effective dataset.
First, the Artificial Neural Network (ANN) with multi-layer perceptron (MLP) technology has become the most advanced ML technology available today. It has been particularly successful in areas such as voice recognition, image analysis, and natural language processing [37]. ANN is divided into three layers (i.e., input layer, hidden layer, and output layer). A hidden layer improves the accuracy of predictions by enabling the classification of complex structures that may be found in big data.
Second, the Recurrent Neural Network (RNN) algorithm can find patterns in the sequences of sentence and classifies the results when receiving sequence data such as a sentence [38]. It extracts a specific pattern for a sentence by sequentially inputting text information without using the given word feature vector such as the traditional ML algorithms and ANN discussed above. In the RNN, the previously input information is gradually accumulated in the hidden state and transmitted to the current input state, thereby enabling predictive modelling of sequence data.
Third, the CNN algorithm can stack multiple convolutional layers. In general, it is used to classify images by learning to extract the best features by applying various filters to the input image [39]. Yoon [40] demonstrates that CNN can be applied not only to image data but also to text data with outstanding classification performance. While RNN reflects the input order of words in training, CNN classifies sentences by reflecting the appearance information of words in each sentence to training.
Count vectorising to extract features from text
Since words cannot be provided as input data and only numerical data may be used to train ML models, it is necessary to convert words or sentences into specific numeric values through feature extraction. This paper adopts the count vectorizer; a method of constructing a word vector after measuring the number of times a word or words appear.
Specifically, n-gram (n consecutive words) [36] was set to 3-gram where up to 3 consecutive words were included in the word vector. For example, when providing two texts as input data, such as ’John is happy’ and ’John is angry’, feature names (i.e., ’angry’, ’happy’, ’is’, ’is angry’, ’is happy’, ’john’, ’john is’, ’john is angry’, ’john is happy’) are assigned. Accordingly, the array value of ’John is happy’ is assigned as (0, 1, 1, 0, 1, 1, 1, 0, 1). Six different datasets and TEC dataset were count vectorised to train ML models.
F1 score for multi-class classification
The purpose of ML models trained with the word vector is to perform multi- class classification in which the output indicates the likelihood of the input sentence being classified as one of the six emotion labels (i.e., ‘fear’, ‘anger’, ‘sadness’, ‘joy’, ‘surprise’ and ‘disgust’). ML models were evaluated by using the F1 score. Although classification accuracy is widely applied due to its easy measurement, it has been criticised for being unsuitable for application to real-world problems, as its simplicity disables measurement of imbalanced data [30]. Vinodhini and Chandrasekaran [41] point out that the F1 score is suitable measure of the ML models tested with imbalanced data, which calculates the harmonic balance between precision and recall taken from the confusion metric. The formula for the F1 score of each label (class) can be expressed as:
$${\text{F}}_{1} \text{s}\text{c}\text{o}\text{r}\text{e}\left(l\right)= \frac{2\times \text{p}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\left(l\right)\times \text{r}\text{e}\text{c}\text{a}\text{l}\text{l}\left(l\right)}{\text{p}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\left(l\right)+\text{r}\text{e}\text{c}\text{a}\text{l}\text{l}\left(l\right)}$$
, where l is the label (i.e., anger, disgust, fear, joy, sadness, and surprise), precision(l) is calculated as \(\frac{\text{t}\text{r}\text{u}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\left(l\right)}{\text{t}\text{r}\text{u}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\left(l\right) + \text{f}\text{a}\text{l}\text{s}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\left(l\right)}\), and recall(l) is calculated as \(\frac{\text{t}\text{r}\text{u}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\left(l\right)}{\text{t}\text{r}\text{u}\text{e} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\left(l\right) + \text{f}\text{a}\text{l}\text{s}\text{e} \text{n}\text{e}\text{g}\text{a}\text{t}\text{i}\text{v}\text{e}\left(l\right)}\).
The F1 scores calculated for each label were weighted according to the number of data points in each label to derive the weighted F1 score [42], applying to the proposed model evaluation. Overall, a total of eight ML algorithms (i.e., five traditional ML and three deep learning algorithms) were evaluated using the highest weighted F1 score (see Fig. 2).
[Fig. 2 about here.]