All experiments in this paper were conducted on a computer with a 1.80GHz Intel Core i5 processor, 4GB of RAM, and 64bit Windows 11. The data preprocessing and all algorithms used for sentiment classification are run in python environment and implemented with tensorflow framework.
A. Datasets and Data PreProcessing
This paper selects two public datasets: Chinese takeaway comment dataset and Chinese hotel comment dataset, and also crawls 12663 review data about masks in more than a dozen stores in Jingdong Mall, and the crawled dataset needs to manually add a label column, delete all comment data with a star rating of three stars, and record the comment data label with a star rating below three stars as 0, and the comment data with a star rating below three stars as 1. The three datasets were then merged together as the original corpus for the experiments in this paper. After data preprocessing, the dataset size and sentiment distribution of each datsets are shown in Table I.
Table 1. Experimental dataset
polarity

Hotel reviews

Take away reviews

Mask reviews

Total

negative

2437

7836

2867

13140

postitive

5314

3843

9316

18473

total

7751

11678

12183

31613

B. Evaluation
The commonly used evaluation index in sentiment analysis research is Accuracy, but this paper is about the binary sentiment classification problem of imbalanced data, so the G_Mean and F_measure of the common index for imbalanced data sentiment classification are selected.
The evaluation indicators are calculated by the following formula:
$$Accuracy=\frac{{TP+TN}}{{TP+FP+TN+FN}}$$
1
$$Sensitivity=\frac{{TP}}{{TP+FN}}$$
2
$$Specificity=\frac{{TN}}{{TN+FP}}$$
3
$$G{\text{\_}}Mean=\sqrt {Specificity * Sensitivity}$$
4
$$F{\text{\_}}measure=\frac{{2 \times Recall \times Precision}}{{Recall+Precision}}$$
5
Where TP indicates the number of comments correctly assigned to the positive category, FP indicates the number of comments incorrectly assigned to the positive category, FN indicates the number of comments incorrectly rejected to the positive category, and TN indicates the number of comments correctly rejected to the positive category.
C. Experimental Design
Before the sentiment classification experiment, the ratios of negative comments and positive comments were 3:1 and 7:1, respectively, according to the imbalance of the two class of low and high in the experimental study.This paper samples the corresponding number of positive and negative comments from the original corpus according to the ratio of 6:2:2 of the training set: verification set: test set to generate the training set, verification set and test set required for the experiment.
Due to the different selection of training and test samples, the results will be greatly different, in order to reduce the experimental error, repeat the experiment ten times, and the result will be averaged. In order to verify the performance of the model proposed in this paper based on the BiLSTM framework, two types of algorithms, machine learning and deep learning, are used to compare each model from the aspects of G_mean and F_measure under the condition of different degree of class imbalance. Among them, the machine learning algorithm uses SVM(Support Vector Machine), Random Forest, Naive Bayes and Logistic Regression, and the deep learning algorithm uses CNN, LSTM, and GRU to compare with BiLSTM.
D. Experimental parameter design
After adjusting the parameters, the hyperparameters parameters of CNN and Word2Vec model are shown in Table II and Table III, respectively. The machine learning algorithm parameters are set to default values in sklearn. In the process of model training, in order to ensure the comparability between models, these neural networks are set to two hidden layer structures, and the callback functions ReduceLROnPlateau and EarlyStopping in Keras are used to set the optimization scheme of the learning rate, so the specific parameter settings are shown in Table IV.
Table 2. CNN super parameter settings
Neural network layer

Tunable parameters

Value

The first layer of convolution

neurons

256

convolution kernel size

5

activation function

relu

Dropout

convolution kernel size

5

The second layer of convolution

neurons

128

convolution kernel size

5

activation function

relu

The third layer of convolution

neurons

32

convolution kernel size

3

activation function

tanh

Table 3. Word2Vec model parameter settings
Tunable parameters

Value

algorithm

Skipgram

vector size

200

min_count

3

window

3

Table 4. Deep learning Model parameter settings
Tunable parameters

Value

number of neurons

32

optimizer

adam

loss function

binary_crossentropy

dropout

0.4

batch size

32

epoch

10

E. Experimental result
In order to ensure the fairness of the comparative experiment, all classification algorithms in the experiment adopt a unified unbalanced data processing method: when the imbalance degree is low, Adaptive Synthetic Sampling is performed on the samples of minority class; On the contrary, the samples of minority class are kept unchanged, the samples of majority class are sampled multiple times, and the minority samples are subjected to multigroup equalization Adaptive Synthetic Sampling.
1) Comparison of small proportion unbalanced emotion analysis methods
According to the above experimental design, different classification algorithms were carried out on the small scale imbalanced comment dataset in this paper, and the experimental comparison results are shown in Table V below.
Table 5. Experimental Performance Comparison of Different Classification Algorithms for Small Scale Unbalanced Datasets
model

G_Mean

F_measure

Accuracy

CNNBiLSTM

0.8529

0.8343

0.8816

CNN

0.8314

0.8072

0.8646

BiLSTM

0.8474

0.8037

0.8666

LSTM

0.8416

0.8027

0.8650

GRU

0.8444

0.8074

0.8676

SVM

0.7537

0.7085

0.8223

Naive Bayes

0.5832

0.5122

0.6384

Logistic regression

0.7492

0.7012

0.8189

Random Forest

0.8013

0.6864

0.8511

From the experimental results, the CNNBiLSTM model proposed in this paper has the best performance. The overall performance of the four commonly used deep learning algorithms was better than that of the four machine learning algorithms, with an average increase of 11.93%, 15.32% and 8.33% in terms of G_mean, F_measure and Accuracy. In deep learning algorithms, CNN, BiLSTM, LSTM and GRU are almost the same in accuracy and F_measure. In terms of F_measure alone, the performance of these four commonly used deep learning algorithms is relatively close, but it can be seen from the G_mean that CNN has the worst performance and the best performance is BiLSTM.
Specifically, the lowest Accuracy is the Naive Bayes method, which has a value of 0.6384, and its G_mean and F_measure are also the lowest, with values of 0.5832 and 0.5122; the highest accuracy is the CNNBiLSTM model proposed in this paper, which has a value of 0.8816, and its G_mean and F_measure are also better than other algorithms, with values of 0.8529 and 0.8343. Compared with BiLSTM, the CNNBiLSTM model proposed in this paper directly improves 0.55% (0.8529 − 0.8474), 3.06% (0.8343 − 0.8037) and 1.50% (0.8816 − 0.8666) in terms of G_mean, F_measure and Accuracy. Compared with CNN, it increased by 2.15% (0.8529 − 0.8314), 2.71% (0.8343 − 0.8072) and 1.70% (0.8816 − 0.8646).
2) Comparison of large proportion unbalanced emotion analysis methods
Under the condition that the degree of imbalance is high, the ensemble learning performance of the deep learning algorithm is compared according to the experimental design of this paper, as shown in Table VI.
Table 6. Experimental Performance Comparison of Different Classification Algorithms for Small Scale Unbalanced Datasets
model

G_Mean

F_measure

Accuracy

BiLSTM

0.8401

0.8209

0.8570

LSTM

0.8389

0.8217

0.8566

CNN

0.8204

0.8100

0.8485

GRU

0.8091

0.8028

0.8386

Table 7. Experimental Performance Comparison of Different Classification Algorithms for Small Scale Unbalanced Datasets
model

G_Mean

F_measure

Accuracy

SVM

0.7721

0.7310

0.8151

Naive Bayes

0.6344

0.5874

0.6506

Logistic regression

0.7657

0.7258

0.8083

Random Forest

0.7916

0.7726

0.8302

The results show that the multiple BiLSTM ensemble has the best performance, with G_mean of 0.8401 and F_measure of 0.8209, while the GRU has the worst performance with G_mean of 0.8091 and F_measure of 0.8028. From the table, BiLSTM and LSTM have little difference in F_measure, which may be because the dataset used in this paper is short text, so it cannot fully reflect the advantages of BiLSTM to consider context, but from the perspective of G_mean, BiLSTM is significantly improved by 14.3% ((0.8401 − 0.8389)/0.8389) than LSTM, which indicates that the deep learning BiLSTM model has corresponding potential in dealing with short text sentiment classification problems. Table VII shows that the experimental performance of machine learning algorithms in dealing with the problem of imbalanced comment text sentiment classification is indeed inferior to the deep learning ensemble model proposed in this paper.
3) Comparison of unbalance treatment methods under the framework of BiLSTM
In order to verify the experimental performance of Adaptive Synthetic Sampling imbalance processing method for different imbalance situations under the framework of BiLSTM, four sentiment classification methods under data imbalance processing are designed:
The first is complete training + BiLSTM framework, that is, the training set is not balanced, all training data is taken for training, and the BiLSTM framework is used for deep learning training and classification prediction;
The second is the random oversampling + BiLSTM framework, that is, random oversampling processing is done for the samples of minority class in the unbalanced training set, combined with the samples of majority class to form a balanced training set, and the BiLSTM framework is used for deep learning training and classification prediction;
The third is the random undersampling + BiLSTM framework, that is, the random undersampling processing is carried out on the samples of majority class in the unbalanced training set, combined with the samples of minority class to form a balanced training set, and the BiLSTM framework is used for deep learning training and classification prediction;
The fourth is the Adaptive Synthetic Sampling + BiLSTM framework in this paper, that is, the corresponding imbalance processing and training prediction framework is selected according to the degree of imbalance of the training set.
When the imbalance ratio is low, the above four imbalance treatment methods are based on the CNNBiLSTM model proposed in this paper. When the imbalance ratio is high, the first method is full training + BiLSTM, that is, 7000 negative comments and 1000 positive comments are taken to training, and BiLSTM is directly used for deep learning training and classification prediction, while the second to fourth imbalance processing methods are based on multiple BiLSTM ensemble frameworks proposed in this paper. In order to further verify the performance of multiple BiLSTM ensemble frameworks when the degree of imbalance is high, an additional sentiment classification method is added for comparison: Adaptive Synthetic Sampling + BiLSTM, that is, Adaptive Synthetic Sampling the samples of minority class, forming a balanced training set with the samples of majority class, and directly using BiLSTM for deep learning training and classification prediction.
Figures 2 and 3 compare the experimental results of different methods in the G_mean and F_measure indicators under the low degree of imbalance and high degree of imbalance, respectively.
It can be seen from the experimental results that the Adaptive Synthetic Sampling sentiment analysis method based on the BiLSTM framework proposed in this paper can maintain a good performance advantage and the overall performance is the best by adopting the corresponding learning strategy for the imbalance degree. Specifically, Fig. 2 shows that the performance of all CNNBiLSTM methods after imbalanced treatment is better than that of fully trained without imbalanced treatment. Compared with the fully trained CNNBiLSTM method without equalization, the adaptive comprehensive oversampling CNNBiLSTM method proposed in this paper for low imbalance rate has a G_mean increase of 53.1% (0.8529 − 0.3219) and an increase of 22.69% (0.8343 − 0.6074) in F_measure, which is improved to a certain extent compared with the CNNBiLSTM method with random undersampling or random oversampling balancing treatment.
In Fig. 3, the Adaptive Synthetic Sampling multiple BiLSTM ensemble method proposed for high imbalance rate is better than simple random oversampling and random undersampling from the perspective of imbalance processing method, whether it is G_mean (0.8401) or F_measure (0.8209). Second, from the model point of view, the performance of multiple BiLSTM integration is indeed better than that of using only BiLSTM models, and its G_mean is increased by at least 5.03% (((0.8401 − 0.7999)/0.7999), and the F_measure is increased by at least 5.55% ((0.8209 − 0.7777)/0.7777). Another point is that it can be seen on the G_mean and F_measure that when the data imbalance of the training set is large, the simple random undersampling and random oversampling methods may fail, or even inferior to the fully trained BiLSTM model.
F. Result analysis
In the comparison of small proportion imbalanced sentiment analysis methods, machine learning methods are more affected by the imbalanced distribution of data than deep learning methods, and their performance is poor. The performance of several commonly used deep learning methods selected in this paper is CNN, LSTM, GRU, and BiLSTM in descending order. CNN have often been used for sentiment analysis in recent years, but from the experimental results of this paper, if used alone, its performance is not high. LSTM uses the gate mechanism to solve the gradient disappearance problem of traditional RNNs, which can realize the preservation and control of longterm memory. GRU is a variant of LSTM, which has fewer parameters and faster convergence than LSTM. The experimental results show that the performance of GRU is slightly better than that of LSTM. BiLSTM is a combination of forward LSTM and backward LSTM, which solves the problem that LSTM cannot encode backtofront information, and BiLSTM performs better than LSTM in this experiment. Therefore, CNN is added on the basis of BiLSTM, and a better performance CNNBiLSTM model is proposed, which can not only establish temporal relationships, but also characterize local spatial characteristics.
In the comparison of large proportion imbalanced sentiment analysis methods, the ensemble performance of several deep learning methods is GRU, CNN, LSTM, BiLSTM from low to high. The results show that compared with the ensemble of LSTM and CNN, the advantages of GRU such as small number of parameters and fast convergence speed have not been fully utilized in ensemble learning. This shows from the side that in the experimental process, we cannot blindly only look at the advantages and disadvantages of the method model to choose which one to use or not to use, and in the experiment, we need to choose the most appropriate treatment method and model according to the actual situation.
Under the BiLSTM deep learning framework, the simple random undersampling and random oversampling methods, compared with the BiLSTM method that is fully trained without imbalance treatment, have their own advantages and disadvantages under different imbalance ratios, on the one hand, it shows that the imbalance of data distribution can affect the performance of the model to a certain extent, on the other hand, when the degree of imbalance is large, the effect of simple random undersampling and random oversampling methods may not be satisfactory, and more consideration should be given to improving performance while balancing data distribution. After many experiments, the Adaptive Synthetic Sampling method is selected among the imbalance treatment methods. Different learning strategies are adopted for different levels of balance of data. Finally, the sentiment analysis method based on BiLSTM framework under Adaptive Synthetic Sampling can maintain good performance advantages, which shows that the performance of the model is not only related to the classification method, but also depends on the data distribution and data quality.