Significant findings presented in the concluding section of the previous section have been gathered from the literature analysis of contributions made to address imbalance as well as an in-depth investigation of the fake news issue from the data quality perspective. Based on those significant observations, two strategies for sampling to classify fake news data have been analyzed.
i) Data Augmentation using Partial Back Translation
ii) Data Augmentation using Synonym Word Replacement
3.1 Sampling using Clustering of News Vectors
The first approach presented in this work is to sample the news vectors based on similar clusters. The vectors of news articles are generated using the embedding technique. Thus, this section requires a depth analysis of embedding techniques, existing sampling techniques as well as clustering. The diagrammatic representation of this workflow is presented in fig 3.1 below.
3.1.1 Word Embedding Techniques
The comparative analysis of existing techniques is presented in table 3.1 below. From the below table it can be noted that GloVe and BERT are best suitable for this case. BERT is computationally intensive and there is not much difference in the vectors from a broader perspective. Hence, the GloVe technique is chosen for this work [26].
Table 3.1
Comparative analysis of Word Embedding Techniques
Parameters/ Model
|
TF
|
TF-IDF
|
Word2Vec
|
N-Gram
|
GloVe
|
Continuous bag-of-words
|
Skip-gram
|
Concept
|
Word occurrence
frequencies
|
Word occurrence
Frequencies
|
Context words to predict the pivot word
|
Word occurrence Frequencies computed and grouped based on the value of ‘N’
|
Center
word predicts
context words
|
Co-occurrence probability ratio between words
|
Size of data for training
|
More data
|
More data
|
More data
|
More data
|
Less data
|
Less data
|
Memory consumption
|
Linear
|
Linear
|
Linear
|
Linear
|
Linear
|
Quadratic
|
Effect of
Change in
Dimensionality
|
Training starts
from scratch
|
Training starts
from scratch
|
Training starts
from scratch
|
It can
re-use N-grams
|
Training starts
from scratch
|
It can re-use
co-occurrence
matrix
|
If not much training data is available then Word2Vec proves to be a good choice. On the contrary, if sufficient training data is available, also considering the availability of memory, GloVe gives better accuracy. For this work, the GloVe embedding technique is preferred from the list above. The algorithm to generate news vectors by word embedding using GloVe is presented below.
To create the word embedding of the articles, the real and fake arrays are initialized to null. Followed by which, for every document in the news dataset, the vector associated with it is the average of the vectors of all the words within the document. After the vector has been calculated for the particular document, it is then added to the real and fake array based on the label associated with it.
3.1.2 Optimal number of clusters
To choose the optimal number of clusters for oversampling purposes, Silhouette Analysis, Elbow Method, and Davies-Bouldin Index methods were considered. Based on research, silhouette analysis provides the best visualization and understanding of the optimal number. For performing the silhouette analysis, the clusteval method has been utilized and made to fit the minority class vectors.
Based on figure 3.2, it can be concluded that for the FNH dataset, 100 clusters are the optimal number of clusters for oversampling purposes.
3.1.3 Clustering strategies
These vectors are subjected to clustering algorithms to group similar articles. Sampling is performed by selecting a single article as a representative of the cluster. There are two fundamental strategies of sampling i) Undersampling and ii) Oversampling. Oversampling gives better output as compared to Undersampling as there is no information loss in the oversampling approach. Thus, Oversampling strategy is chosen here. However, there can also be a hybrid technique, which is to be explored in the future. The selection of the sample representative can be done i) Randomly and ii) Closest to the mean. Both the strategies are experimented with here.
For oversampling using a random selection from clusters initially, the vectors for news articles were obtained as discussed in section 3.1.1. Followed by which, the Kmeans algorithm is applied to the vectors of minority class vectors. Here, K is optimally chosen using silhouette analysis for the minority class. Once the clusters are created, from each cluster a vector is randomly chosen to be the representative. In this manner, K vectors are chosen to undergo oversampling. A null data is initialized followed by which these K vectors undergo duplication N times and are stored within the data. In the final step, the sampled data is shuffled with the original minority data resulting in the final data.
For oversampling using the closest mean selection from clusters initially, the vectors for news articles are obtained as discussed in section 3.1.1. Followed by which, the Kmeans algorithm is applied to the vectors of minority class vectors. Here, K is optimally chosen using silhouette analysis for the minority class. Once the clusters are created, from each cluster a vector closest to the mean of the cluster is chosen to be the representative. In this manner, K vectors are chosen to undergo oversampling. A null data is initialized followed by which these K vectors undergo duplication N times and are stored within the data. In the final step, the sampled data is shuffled with the original minority data resulting in the final data.
This sampled data is used and trained on the Adaboost classifier. The result of the same is presented in the next section.
3.2 Data Augmentation using Partial Back Translation
Within the previous approach of sampling using clustering, ‘K’ representative vectors are chosen. Oversampling was applied directly to this vector to remove the imbalance. An intuitive approach is to incorporate data augmentation techniques over direct duplication.
Data augmentation utilizes the meaning of a sentence and produces a sentence with the same different syntax but the same semantics. Data augmentation can be approached using back translation, random shuffling of words within sentences, synonym word replacement, and others.
This section employs the back translation technique; However, this approach is computationally expensive as it involves translating a huge amount of content to another language and back to the same language. This increases the load on the model and results in a time-consuming process. Hence, the threshold is selected to decide the number of samples to be given for back-translation. The diagram below is a pictorial view of the proposed process.
Back translation of entire content is computationally expensive and infeasible for the task on hand. Hence, a partial back translation approach has been introduced within this section. The algorithm above explains the steps taken to balance the minority class by increasing the instances using a combination of oversampling along with back-translation.
After creating the word embedding/vector representation of the content for each instance, K-Means clustering is applied where K is decided based on silhouette analysis of minority classes. After the clusters have been created, a random vector is chosen from each cluster that will be subjected to the partial back-translation technique. First, we will split the data instances into odd and even instances. For odd instances, T number of instances among them will be subjected to the back translation technique while size(odd) - T will be subjected to direct duplication Q times. Along with this, the even instance will be subjected to direct duplication R times.
The back translate function takes in the parameter the content and sentence S. The back-translation is applied to the first S sentences of the content. The translated content is then subjected to back translation and this continues a P number of times. P, Q, and R are dependent upon the imbalance ratio.
3.3 Data Augmentation using Synonym Word Replacement
Within the previous approach of sampling using clustering, K representative vectors are chosen. Oversampling had been applied directly to this vector to remove the imbalance. An intuitive approach is to incorporate data augmentation techniques over direct duplication. Within section 3.2, data augmentation using partial back translation has been explained. This section employs the synonym word replacement technique where random words from the content are replaced with their synonyms.
After creating the word embedding/vector representation of the content for each instance, K-Means clustering is applied where K is decided based on silhouette analysis of minority classes. After the clusters have been created, a random vector is chosen from each cluster that will be subjected to the synonym word replacement technique. Within this technique, the first 20 random indexes for the content are generated followed by which the synonyms of the specific word at the chosen indexes are collected. From the list of synonyms, the first synonym is chosen and replaces the word within the content. The new content is then subjected to this function and the process repeats N number of times.
In this manner, for a particular representative cluster N contents are produced using synonym word replacement and are repeated K times giving us K*N new instances. These newly created instances are introduced to the original minority instances thus resolving the balance.