As stated earlier, there are two proposed models in this work, viz., one for supervised learning on labeled data and the other for unsupervised learning on unlabeled data. The labels in the datasets are used to anticipate a feature of the data based on the others. Labeling, for example, is the process of determining the sentiment of data (positive, negative, or neutral) based on the words in the text or other information. Figure 1 represents the difference between labeled and unlabeled data.
Supervised learning is a sort of prospective machine learning method that functions with annotated data and forecasts the label of incoming data samples based on prior data. Unsupervised learning, on the other hand, works with unlabeled data, which is typically a dataset with characteristics but no prediction aim. A dataset is a logically organized collection of data that is usually linked to a certain piece of research. Numeric, Bivariate [28], Multimodal [29], and Qualitative [30] data sets are some example data sets. A sample population may, for example, represent each student's quarterly grades in a specific curriculum. The datasets that have been used in this research are listed in Table 3.
Table 3
– Details of datasets used
Dataset
|
Name
|
Labeled/
Unlabeled
|
Balanced/imbalanced
|
Total count of data
|
Class of data
|
Specifications
|
Dataset 1
[31]
|
Sentiment140
|
Labeled
|
Balanced
|
1,600,000
|
Binary
|
Positive – 50,000
Negative – 50,000
|
Dataset 2
[32]
|
IMDB review
|
Labeled
|
Balanced
|
50,000
|
Binary
|
Positive – 12,500
Negative – 12,500
|
Dataset 3
[33]
|
Clothing
|
Labeled
|
Imbalanced
|
22,640
|
Binary
|
Positive – 18,539
Negative – 4,101
|
Dataset 4
[34]
|
Amazon MP3 dataset
|
Labeled
|
Imbalanced
|
28,469
|
Multi-class
|
Positive – 21987
Negative – 6482
Neutral – 2531
|
Dataset 5
|
IMDB review
|
Unlabeled
|
Balanced
|
50,000
|
Binary
|
NA
|
Dataset 6
|
Clothing
|
Unlabeled
|
Imbalanced
|
22,640
|
Binary
|
NA
|
Dataset 7
[35]
|
Unlabeled dataset
|
Unlabeled
|
imbalanced
|
20500
|
Multi-class
|
NA
|
The following subsections elaborate on the methods suggested for the above two scenarios in detail.
a) Supervised Sentiment Analysis on labeled datasets
The main aim of this first part of the research is to unearth the subtleties of the BERT model [9] and try to diminish the computational complexity or time without affecting the performance of the original model. For researchers with insufficient computational resources, BERT's [9] lengthy training and inferential times are a significant impediment [36]. More than cutting training time, lessening training speed would result in faster problem recurrence and, eventually would lead to faster solutions. Figure 2 depicts a comprehensive flowchart of the proposed methodology.
i) Basic data cleaning
The initial model in this study entails gathering relevant datasets and preparing them in readiness for the functions of the future modules. With the substantial number of unstructured information available on the web nowadays, extracting precise sentiments from unclean data is a difficult undertaking. Natural Language Processing plays a critical role here by assessing and interpreting large levels of organic data sets. The first step in preparing datasets is to remove noise from the data. Noise in data is defined as the use of special characters, brackets, square brackets, white spaces, URLs, and punctuations. BeautifulSoup [37] is a Python module that has been used here to reduce noise from data. The next step is to use the NLTK Tokenizer Package to parse the data into tokens. The third step focuses on creating consistency in all of the words. All words are converted to small letters as a matter of course. After this, stopwords are deleted from the corpus to further minimize the varied dimensions of data. Stopwords are English words that don't add meaningful content to a sentence [38]. They can be easily dismissed without jeopardizing the statement's meaning. Words like “are”, “he”, and “the”, for example, are referred to as stopwords in datasets intended for text processing. Following the eradication of stopwords, text pre-processing options include stemming and lemmatization. Stemming is the process of removing suffixes from phrases in order to convert them to their elemental state. On the other hand, lemmatization identifies the impacted form of a word and transforms it to its source. "Moving," for example, is stemmed as "mov," and "worst" is lemmatized as "bad." In this case, the lemmatized version of the words has been considered for further processing.
ii) Fine tuning of the BERT–base model
Once the basic pre-processing is completed, the inputs to be fed are to be passed to the BERT model [9] for further classification. BERT [9] has been acclaimed as one of the most groundbreaking inventions in the NLP arena since its introduction in 2018. It's a transformer architecture-based deep learning model. The transformer-based design enables BERT [9] to read input bi-directionally, both from right to left or left to right simultaneously. Every outcome object is linked to every node in the BERT model [9], and the distributions between them are continually determined based on their association. The incorporation of Transformers, which can analyze information in any sequence, is primarily responsible for BERT's bidirectionality. Transformers aid in the design and effectiveness of the BERT model [9] by leaning on the self-attention method and permitting training on massive amounts of data. The BERT model includes special tokens such as [SEP], [CLS], Token ID, Mask IDs, Segment IDs, and Position Embeddings. BERT is programmed to accept one or two sentences as input. It utilizes the [SEP] token to distinguish two sentences and the [CLS] token to categorize them based on the problem. BERT [9] is effective for task-oriented concepts, can produce excellent outcomes in several languages, and fine-tuning BERT [9] results in a significant boost in performance. The BERT-based architecture [9] has 12 encoder layers, comprising 100 million factors to configure. The bigger version is the BERT-large architecture which has 24 encoding layers and 350 million characteristics. However, aside from consistently excellent outcomes, BERT [9] has a number of disadvantages; and the majority of them are related to its magnitude. BERT's massive size is mostly attributed to its structure and volume. There are a lot of weights to update, and the computation is complicated, therefore it's costly. Though fine-tuning BERT [9] is always an option, there are a number of issues that need to be addressed with fine-tuned models, such as non-converging of the outputs and over-fitting. Traditional strategies for minimizing BERT [9] include changes to the basic design, modifying layers based on their priority, effective optimization policies, shrinking or thresholding models, and model distillation. As researchers started to explore more of BERT in a myriad of tasks, new models such as ALBERT, RoBERTa, and ELECTRA [39] have emerged to address specific challenges and minimize the complexity of the basic BERT model [9]. In order to improve the training speed of the proposed method, the authors suggest freezing the first layers of BERT as a means of optimizing BERT [9]. The speed boost and lower memory usage during training are managed by the freezing of the early layers of BERT [9], which results in a reduction in the number of attributes to be updated. During the initial phases of learning, the fundamental layers would be frozen and then unfrozen when the model stabilized. It is a delicate task to choose which layers to suspend during training. It should be implemented in a manner that minimizes over-fitting and prevents significant gradient changes from destroying the pre-trained properties. Another key consideration during freezing is to keep an eye on the dataset size and learning efficiency. The objective of this work is to freeze the top encoding levels of the BERT model [9] because the early layers of deep learning models are assumed to capture generic properties while the upper layers are more task-oriented. The BERT base model [9] has been used in this experiment and its specifications, as well as its 12-layered encoder architecture, has been illustrated in Table 4 and Fig. 3, respectively.
Table 4
BERT – base [9] specifications
Model
|
Parameters
|
Layers
|
Hidden units
|
BERT-base
|
110 million
|
12
|
768
|
The intake to the BERT base model [9] consists of a series of tokens that are wrapped into feature vectors before being handled by the model. The result is a series of vectors, all of which correlates to the source token and with a similar position. The outcome from each of the preceding encoding layers can be regarded as a succession of context-specific embeddings as the process progresses. Hence, in the forward pass, the upcoming layer receives every contextual embedding series as an input to it. As a result, as the input progresses through the phases, different aspects of the source are retrieved, and each subsequent layer builds on the trends revealed by the preceding layer. The scope of this work lies in the fact that this implementation of the encoder layer upon layer on a regular basis may easily lead to overfitting. Hence, the n initial encoding layers will be frozen but not the embedding layer. Thereafter, the data missing out(if any) during freezing the layer would be overcome by channelizing the projections through the fuzzy logic module.
iii) Type 2 fuzzy module
Fuzzifier, rules, inference system, and defuzzifier are all parts of a fuzzy logic system (FLS) also referred to as a fuzzy inference system or a fuzzy controller [40]. Zadeh invented the notion of Type-2 fuzzy sets as an elaboration of the notion of a conventional fuzzy set, i.e., a Type-1 fuzzy set. Type-2 fuzzy sets have membership grades that are also fuzzy [41]. A Type-2 fuzzy membership grade can be any subset of the primary membership, and there can also be a secondary membership that corresponds to each primary membership and determines the alternatives for the core membership. For instance, consider the quandary of neutral being referred to as either 10 "meter" (mt) or 100 "centimeter" (cm) for a certain distance. Now, both phrases signify the same thing. But depending on this threshold of 10 meters or 100 centimeters, various people may have different interpretations of neutral. Similarly, how closer is the interpretation to the neutral might also vary depending on the individual. For example, 10.1 mt, which is just crossing the threshold, or 9.8 mt, which is approaching the threshold, may be rounded off as the neutral value. If we take into account the opinions of different people, the neutral distance will eventually get increasingly fuzzier, and every iteration of neutral will change the fuzzy set in what seems like a three-dimensional function. The idea that concepts have diverse interpretations for different persons motivates Type-2 fuzzy logic. A Type-2 fuzzy set is just the aggregate of all the points that make up a set in three axes.
Figure 4 represents the membership function of a broad Type-2 fuzzy set in three axes. The unpredictability in the fundamental memberships of a Type-2 fuzzy set is represented by the footprint of uncertainty (FOU). The membership degree is a hazy set rather than a precise number. The value of the membership function at each location on the footprint of uncertainty, which is a two-dimensional region, forms the third dimension. While classical Type 1 fuzzy logic suffers to handle degrees of risks, Type-2 fuzzy logic systems (T2FLSs) are efficient in handling the same. In comparison to Type-1 fuzzy sets, this added dimension provides higher permutations for better representation of ambiguity. The authors have combined all of the theoretical benefits of full Type-2 fuzzy sets in this work for the ease of processing by using interval Type-2 fuzzy sets. The zone between the upper and lower memberships is known as the "Footprint Of Uncertainty" (FOU). The working of an interval Type-2 fuzzy comprises the following methods-
a) The initial phase in the operation of an interval Type-2 fuzzy is fuzzification, in which the crisp inputs are transformed into input for interval Type-2 fuzzy sets.
b) Subsequently, the membership function is generated. Both the primary and secondary membership functions in this instance have been implemented using the triangle membership function [43], as shown in Fig 5.
c) In addition to the membership function, the rule base is the same as in the type-1 fuzzy logic system. A set of seven rules has been considered in this case as mentioned in [12]. Depending on the categorization of the datasets (binary or multi-class), the rules are selected.
d) Output type-2 fuzzy sets are created by the interval type-2 fuzzy that is sent into the inference engine together with the rule basis.
e) The output for interval type-2 fuzzy logic systems is then produced by the inference engine after combining all the fired rules.
f) The type-1 fuzzy sets are subsequently created by type-reducing the type-2 fuzzy outputs.
g) Defuzzification is then completed to create crisp output sets.
The intermediate extremities method has been utilized to demonstrate the affinity for a label, as stated later in the findings section.
b) Unsupervised Sentiment Analysis on unlabeled datasets
The second part of this study focuses on extracting sentiments from unlabeled datasets. Because most datasets aren't labeled, creating a model that effectively uncovers sentiments from those datasets is a sensible endeavor. The key challenge with these activities is determining the best label for accurate model prediction. The successful use of BERT [9] in calculating text embeddings has also been demonstrated in this case. As previously noted, BERT's size [9] provides an issue in terms of both time and limitations, necessitating constant fine-tuning in order to lower the model's computational complexity. Here, the first n levels of the BERT model [9] have been considered, so that the time and space costs incurred due to the original nature of the BERT architecture [9] do not affect the prediction model. Furthermore, as the datasets in this situation are not labeled, considering only a part of the layers in the BERT model [9] affects the effectiveness of the original BERT [9] to some extent. As a result, a neural network with one hidden neuron has been implemented on top of BERT [9] during the classification process to help achieve excellent results in identifying feelings from unlabeled data, reducing the loss due to voluntary layer selection. The approach for extracting sentiments from unlabeled datasets is depicted in Fig. 6.
The first step in identifying sentiments from unlabeled data is to clean or pre-process the data to make it suitable for passing as inputs to the BERT model [9] for determining word embeddings. The basic approaches for transforming unprocessed data into cleaner datasets are included in pre-processing datasets. As previously indicated, the BERT-base models [9] have been employed for both experiments with 12 transformer blocks, 768 hidden layers, and 12 attention heads. Furthermore, any input to BERT [9] must correspond to the basic tokens that it already uses for task training. After the embedding is completed, the target variables are chosen for the clusters. The desired labels for a binary classification task would be 'Positive' and 'Negative.' The intended labels in a multi-classification task could be 'Positive,' 'Negative,' and 'Neutral. Finally, the affinities between the vectorized forms of words are calculated, and the cluster with the closest proximity is assigned. The accuracy of the predictions is calculated using several measures once they have been evaluated on a model.