Self-Attention Networks and Adaptive Support Vector Machine for aspect-level sentiment classification

Aspect-level sentiment classification aims to integrating the context to predict the sentiment polarity of aspect-specific in a text, which has been quite useful and popular, e.g., opinion survey and products’ recommending in e-commerce. Many recent studies exploit a Long Short-Term Memory (LSTM) networks to perform aspect-level sentiment classification, but the limitation of long-term dependencies is not solved well, so that the semantic correlations between each two words of the text are ignored. In addition, traditional classification model adopts SoftMax function based on probability statistics as classifier, but ignores the words’ features in the semantic space. Support Vector Machine (SVM) can fully use the information of characteristics, and it is appropriate to make classification in the high-dimensional space, however, which just considers the maximum distance between different classes and ignores the similarities between different features of the same classes. To address these defects, we propose the two-stage novel architecture named Self Attention Networks and Adaptive SVM (SAN-ASVM) for aspect-level sentiment classification. In the first stage, in order to overcome the long-term dependencies, Multi-HeadsSelf Attention (MHSA) mechanism is applied to extract the semantic relationships between each two words; furthermore, 1-hop attention mechanism is designed to pay more attention on some important words related to aspect-specific. In the second stage, ASVM is designed to substitute the SoftMax function to perform sentiment classification, which can effectively make multi-classifications in high-dimensional space. Extensive experiments on SemEval2014, SemEval2016 and Twitter datasets are conducted, and compared experiments prove that SAN-ASVM model can obtain better performance.


Introduction
With the advancement of electronic informational technology, more and more people share their comments or experiences on the internet, which can reflect their attitudes on a product or service, so how to mine these contexts and extract the main information has been one of hot researching topics in the field of Natural Language Process (NLP). For example, in the process of question survey, sentiment classification can recognize people's attitudes toward some policy or services. Therefore, sentiment analysis is a very useful method to mine the sentiment information (Pang and Lee 2008;Mohammad 2015).
In general, sentiment analysis of aspect-level can excavate semantic correlation with context, but a text consists of several words and the same word may have different semantics in different context. For example, in this text ''The speed of computer is so fast that a lot of time is saved,'' ''fast'' is attribute word which means the positive polarity. However, another text, ''Time has passed so fast, I haven't had time to read this book.'', ''fast'' implies regret. All in all, if semantic environment and context of a word are ignored, the feature representation of the word would lack of contextual information which might lead to the deviation on sentiment orientation of aspect-specific. Therefore, a method based on aspect-level sentiment analysis should be proposed.
LSTM and its' varieties have been a prevalent and effective method to address the problem of large-scale sequential modeling process; however, when the text is too long, the performance of LSTM on long-term dependencies would be decreased greatly (Tang et al. 2016a, b). Besides that, LSTM models can deal with the sequential information, but the interactive relationships between each two words are not well considered. Therefore, in order to analyze effectively the semantic correlations between each two words, attention mechanism is developed to solve the above problems which can analyze the correlations between each two words also pay more attention on some important words of the context and decrease the influences of less-related words. Many previous works focus on the design of attention mechanism, to capture more important words and generate the contexts' representation based on different aspects. Topic-Dependent Attention Model (TDAM) (Gabriele et al. 2019) employed the bidirectional Gate Recurrent Unit (GRU) to model the input sequence, and the hierarchical and multi-tasks learning architecture were designed to calculate the weights but which ignored the relationships between topics and context; therefore, the important information would be lost. ATtention-based Long Short-Term Memory for AspEct-level (ATAE-LSTM) was proposed to (Wang et al. 2016) integrate aspect embedding with each word's embedding and then analyze the mutual influences between the hidden layers' outputs obtained by LSTM, yet this model did not learn the correlations of every word with context detailly. In order to analyze the influences of aspects on every word of context, Song et al. ) designed attentional encoder layer and target-specific attention layer to capture more valuable information between them. Above of all, in order to solve the long-term dependencies and consider the interactive relationships between each two words, LSTM networks or attention mechanism is a good choice to achieve it.
Because SoftMax function is based on probability model, which ignores some valuable information brought by the feature vectors, for example the differences and similarities of features, therefore, in order to make full use of features information, SVM is applied to reflect feature vectors of samples to high-dimensional space and generate the separate plate by evaluating the maximum distance of different samples (Alyami and Olatunji 2020;Han et al. 2020). However, the decision function of SVM just only considers the differences between classes and ignores the valuable information of similarities between classes. Therefore, Fisher method (Belhumeur et al.1996) is introduced, which can project the samples of the high-dimensional space into the low-dimensional space and then take the smallest differences in-cluster and the largest differences between-clusters into account to classify projected samples in subspace. This method makes the classified samples possess the smallest intra-class distance and the largest inter-class distance. Therefore, combining the principle of structural risk minimization and the determinant principle of Fisher classifier to design decision function of Adaptive Support Vector Machine (ASVM) classifier, the goal of this proposal is to balance the differences and similarities of features to improve the classification accuracy.
Above of all, in order to enhance the semantic correlations between each two words in text and improve the accuracy of classification, we put forward two-stage novel architecture. In the first stage, Self-Attention Network (SAN) is designed to generate feature representations. More concretely, Multi-Heads Self-Attention (MHSA) is utilized to extract the semantic correlations between each two words which can solve the long-term dependencies well by reflecting the representation of every word to the same context semantic space. Besides, 1-hop attention mechanism is applied to calculate the correlations between aspect-specific and every word of context, which can help us assign a bigger weigh to the word with strong correlation to aspect-specific and smaller weigh to the word with weak correlation. After that, the cosine similarity is further employed to evaluate the sentimental influences on aspect-specific to generate the final representation integrating the semantics of aspect-specific and context. The architecture of first stage aims to address problems of long-term dependencies and semantic correlations between each two words. We design the ASVM classifier to conduct sentiment polarity predictions in the second stage to balance the distance of inter-class and intra-class. Specially, we adopt GloVe pre-trained model to train word embeddings and the whole experiments are trained on public datasets deriving from SemEval2014 (Kiritchenko et al. 2014), SemEval2016 (Pontiki et al. 2016) and Twitter (Dong et al.2014) to make sentiment classification.
In comparison with previous approaches of sentiment analysis based on aspect-level, the main contribution of this paper can be described as follows: • We introduce the two-stage novel architecture ASN-ASVM to solve the problem of aspect-level sentiment analysis, in which the sentence's representation integrating with semantic correlations is generated by ASN and sentiment classification is conducted by ASVM. • In the first stage, MHSA mechanism is designed to fully learn semantic correlations of context and 1-hop attention mechanism is utilized to capture the vital words with greater influences on sentiment orientation, which significantly enhances the representation ability on dealing with context feature excavation. • In the second stage, ASVM classifier integrating with the Fisher regulation is designed, which can balance the feature vectors' distances between inter-class and intraclass, and our approach obtains satisfied results on these four datasets.
This paper is organized as follows. We elaborate the related work in Sect. 2. The two-stage novel architecture about sentiment analysis on aspect-level proposed here is present in detail in Sect. 3. And the ten groups of compared experimental analysis and results are demonstrated in Sect. 4. At last, conclusion is concluded and an outlook of future work is given in Sect. 5.

Related work
In this section, we will illustrate the recent researches from the following four aspects: the first part introduces research approaches about aspect-level sentiment classification; the second part discusses the methods of solving long-term dependencies problem; and the third part elaborates the attention mechanism and the last part states SVM classifier to make classification.

Aspect-level sentiment classification
Aspect-level sentiment classification is a fine-grained classification tasks, which aims to extract the semantic features of context corresponded to aspect-specific and predict the sentiment polarity on the aspect-specific. Traditional methods (Shang et al. 2016) usually extract the context's feature by using n-gram, bag-of-words or word order (Atrio et al. 2019) and predict sentiment polarity by using Naïve Bayes (Jamilu et al. 2019), Support Vector Machine (SVM) (Akshay et al. 2018) or decision tree (Chauhan and Sehgal 2018;Sridharan and Komarasamy 2020). These methods can achieve satisfactory classification results, but the methods depending on feature engineering would lead to a lot of labor and time waste. Recent researches focus on Embedding from Language Model (ELMo) (Ma et al. 2018), Generative Pre-trained Transformer (GPT) (Xue and Li 2018) and Bidirectional Encoder Representations from Transformers (BERT) (Yi et al. 2018) to generate word representations and using deep networks to conduct sentiment classification.
Tang (Tang et al. 2016) designed two LSTM networks to capture the left and right texts' features of an aspect and concatenate the succeeding and proceeding outputs as the final feature representation. Meanwhile, Zhang ) employed the Gated Neural Networks (GNN) to extract the sentiment influences on target from the left and right text. However, all these methods lack of semantic logicality and fluency for context divided by aspectspecific; hence, the representation of context is continuous so that the semantic information is not fluent. Yunseok Noh (Noh et al. 2019) applied CNN as Aspect-Map Extraction Network (AMEN) to extract the representation of aspect to constitute an aspect map and then regarded the representations of aspect and context as input to conduct sentiment classification. Attention over Attention mechanism was introduced to model the interactive relationships between aspect-specific and every word of text, which can assign the greater weights to some important words on sentiment classification . TNet utilizing Transformation Networks architecture with CNN layer was put forward to combining the bidirectional LSTM network's outputs and generated the next word representation by convolution layer .
Above of all, this part summarizes recent works on aspect-level sentiment analysis, and we find that the effectiveness of feature representation is very important to sentiment classification. However, the biggest defect is that these works do not effectively analyze the semantic correlations of context, such as which words in the context are very important to determine the polarity of aspect.

Methods on solving long-term dependencies
Recently, LSTM are widely applied to process the sequence data (Bengio et al. 1994;Liu et al. 2018a, b;Rehman;Mohammad et al. 2019;Tai et al. 2015); however, when the text is very long, the long-term dependencies do not be solved efficiently, so the performances of LSTM and LSTM's variants are dissatisfied .
Generally, some researchers adopted LSTM or its variants to solve long-term dependencies problem. In addition, Liu proposed a flexible forget mechanism to decide when and how to forget the state information for LSTM (Liu et al. 2017). But this method just focused on processing units of LSTM, ignored the key word's weight to the whole context and correlations of information transmission between layers. Wang ) integrated the mind of deep residual learning (He et al. 2016) into LSTM to propose a stacked residual LSTM model to predicate the sentiment polarity, which acquired further features between processing units effectively and tackled the degradation problem easily. The novel architecture (Liu and Guo 2019) combining the bidirectional LSTM, attention mechanism and convolutional layer was proposed to extract the higherlevel semantic features and set the sentiment weights on the output from hidden layers, so that the contexts' representations could be obtained. However, when adapting LSTM model to generate the word embeddings, the long-term dependencies of sequential text would be degraded gradually with the length of text increasing, so that the correlations between each two words cannot be acquired accurately. All in all, it is very significant to consider the semantic correlations between each two words without limitation of text length; therefore, the attention mechanism should be introduced to calculate the semantic weights.

Methods for attention mechanism
Attention mechanism can help us draw more attention on significant information and excavate in-depth semantic correlations. Hence, attention mechanism based on feature election or feature representation has been more and more popular. Interactive Attention Networks (IAN) (Ma et al. 2017) was proposed to interactively learn attentions between the context and target. Although this proposal considered the interactive influences of target and context, it still used the average vectors to obtain the attention score. Based on that, a Feature-Enhanced Attention Network (Yang et al. 2018) was proposed, which aimed to develop a multi-views co-attention mechanism to generate the final representation integrating the context. Cheng et.al. (Cheng et al. 2017a, b) came up with the hierarchical attention mechanism for aspect-level sentiment analysis. The alternative co-attention was proposed, which could focus on those key words to obtain the context representation according to the attention weights of target to context (Yang et al. 2019). Wang and Chang (2017) put forward dependency-attention mechanism to solve the long-term dependencies of LSTM and made the LSTM networks capture the key information of targets. Multipleattention mechanism was also proposed to solve long-term dependencies for sentiment analysis ). All these mechanisms can achieve better performance by analyzing the sentiment relevance between aspects and context (Bahdanau et al. 2019), but do not consider the semantic correlations from the perspective of word level, which will lead to the word representation integrating context are not fully excavated.
When GloVe embedding and BERT embedding to acquire word embeddings. Rietzler et al. (Alexander et al.2019) performed a cross-domain evaluation mechanism with an adapted BERT language model for aspect-level sentiment classification, which achieved state-of-the-art performance. Xu et al. ) explored a novel post-training method on the language model BERT to enhance the performances on aspect sentiment classification. DA-BERT (Pei et al. 2019) was proposed, which adopted transformer as the feature extraction to parallel calculate the correlation values of each word to all other words in the context, and experiment results indicated that this model could achieve better performance. Why all of these models of BERT on sentiment analysis can achieve the best, the key reason is that the bidirectional self-attention mechanism which can effectively catch the semantic correlations from the perspective of word level. Inspired by this mind, we aim to adopt this idea to design Multi-Heads Self-Attention mechanism, so as to solve the long-term dependencies and analyze the semantic correlations between each two words of context.

Methods for SVM classifier
Comparing with Naive Bayes, Random Forest and K-Nearest Neighbor, SVM has higher classification accuracy in terms of small sample and high-dimensional space (Sujataand Parteek 2019; Sarkar et al. 2020). Kalarani (Kalarani et al. 2019) employed the part-of-speech tags and sentiment features to extract the context feature and then utilized the SVM and Artificial Neural Network (ANN) to realize the classification of tasks. Wang et.al. ) designed Char Convolutional Neural Networks (Char CNN) with SVM to mine the customers' sentiment orientation of purchasing and proved that the performance of SVM classifier was better than others. The N-grams or TF-IDF (Term Frequency-Inverse Document Frequency) was applied to produce the sentiment knowledge and SVM was used as a classifier (Laoh et al. 2019;de Godoi Brandao et al. 2019). However, there is a common phenomenon, the decision function of SVM which does not consider characteristics of inter-class and intra-class, so that the classification accuracy would be decreased greatly. Consequently, we design the adaptive mechanism for SVM to improve the accuracy of sentiment classification.

The proposed method
In this section, we will adopt GloVe pre-trained model to simply train every word's representation and then introduce the Multi-Heads Self-Attention (MHSA) network to focus on semantic dependencies of every word with context. After the semantic correlations between every word and the aspect are calculated by 1-hop attention and cosine similarity function, the representation of context is obtained by weighted calculation. In the second stage, we design the ASVM classifier to make sentiment classification. The general flowchart of the model, namely Self-Attention Network and Adaptive Support Vector Machine (SAN-ASVM) designed by us, is shown as follows (Fig. 1).

Self-attention network
In the first stage of SAN-ASVM, we design the Multi-Heads Self-Attention network and 1-hop attention mechanism to capture the semantic dependencies and correlations between each two words of text, and cosine similarity function is employed to pay more attention on some words related to sentiment polarity of aspect-specific. Because, MHSA does not need to use the position information, partof-speech, which can help us simplify the computing progress. Hence combining MHSA network and cosine similarity function, more contextual information related to the aspect can be captured. Next, we present the structure of MHSA mechanism, the simple structure containing GloVe pre-trained model, tanh function and SoftMax function is shown in Fig. 2.

Input layer
In the field of NLP, because one-hot representation possesses the disadvantages of dimensional disaster and semantics generation gap, but word distributed embedding has the advantages of capturing more semantic and syntactic information, so we usually adopt the distributed representations of word embedding. GloVe is selected in our proposed algorithm to train the contexts' feature vectors. We introduce some datasets including SemEval 2014 (including Restaurant and Laptop) and Twitter dataset.
Assuming that the size of context is n, namely this context is composed of n words, so the context can be descripted as c ¼ ðw 1 ; w 2 ; w 3 ; . . .w n Þ, in which an aspect with m words is contained in a context and the aspect can be depicted as a ¼ ðw i ; w iþ1 ; w iþ2 ; . . .w iþmÀ1 Þ. Here, we just consider one situation that one context just contains one aspect, but one context probably takes up several aspects generally. Therefore, we assume that if the context includes several aspects, we regard every different aspect and the same context as one unclassified example. GloVe pre-trained model is utilized to draw the word embeddings, thence the context representation can be expressed as hc ¼ ðx c1 ; x c2 ; x c3 ; . . .x cn Þ, meanwhile the aspect's representation is ha ¼ ðx ci ; x ciþ1 ; x ciþ2 ; . . .x ciþmÀ1 Þ.

Multi-Heads Self-Attention mechanism
In the part of MHSA mechanism, we introduce two learned matrixes, query matrix W Q 2 R ðnÃd dim ÞÂd dim and uniform matrix W U 2 R 2d dim , and d dim represents the dimension of word embedding.
Firstly, in order to analyze the semantic correlation between each two words, we develop the uniform matrix to learn this dependencies relationship. Thence, the weights between each two words denoting the degree of semantic consistency can be acquired by tanh activate function.
where Uðx ci ; x cj Þ denotes the semantic correlation between x ci and x cj . When we select x ci as the central word's representation and calculate the other words' influences on it, the representation of x ci integrating contextual semantics can be deduced by the following equation.
Therefore, the weighted representation of every word can be gained by the same means. The output of single head self-attention mechanism is hc 1 ¼ ðx 1 1 ; x 1 2 ; x 1 3 ; . . .x 1 n Þ. Moreover, Multi-Heads Self-Attention mechanism is adopted to parallelly calculate the feature representations with contextual semantics, which can learn the different semantic information from the various perspectives. Concatenating the k heads output representations, and the final output of Multi-Heads Self-Attention mechanism is obtained.
where W Q is query matrix which can project to a specific semantic space.
In addition, Position-wise Convolution Transformation (PCT) is employed to transform contextual information between the outputs of MHSA, and the equation is shown as follows.
where r is the RELU activate function, and Ã is the convolution operator. W 1 W 2 2 R d dim Âd dim are the convolution kernels, respectively, and b 1 ; b 2 2 R d dim are the bias of kernel functions, respectively.

1-Hop attention mechanism
In this part, we introduce 1-hop attention mechanism to analyze the influences of aspects on the context from the global perspective.
More concretely, in order to analyze aspect's sentimental influences on every word of text, we simply take the average pooling result of aspects' feature vectors by the pre-trained model, and the context's embedding is original output from PCT models, so the aspect's representation after average pooling operation can be obtained.
where is the number of words in an aspect. In general, we regard the average pooling result ha ave 2 R d dim as an entirety to pay more attention the influences of aspect on other words. Therefore, we adopt the score function to denote the importance, and the score function can be expressed by the following equation.
where denotes word's representation in text; and is weight matrix and is bias. Then we adopt SoftMax function to calculate the normalized weights of 1-hop attention mechanism.
After 1-hop attention mechanism, we obtain the context's representation considering the influences of aspect, so the context's representation is expressed.
After the word embeddings fusing the context semantics are obtained, the cosine similarity between aspect-specific and every word in the same contextual semantic space can be calculated to evaluate the influences of every word on aspect-specific. The correlations can be obtained.
Consequently, the final context representation is deduced by the following equation.

Traditional SoftMax function
Generally, we use SoftMax function to predict the sentiment polarity of context, so we can get the probability distribution by the following formula.
where is a 3-dimensional vector, and every dimension indicates the different sentiment, including the positive, neutral and negative. and denotes the weighs and bias of predictive layer.
Moreover, the probability about the polarity of context is calculated by the followings.
in which denotes probability belonging to one of three kinds of sentiment polarity, and denotes different element of y.
Lastly, the cross-entropy function is utilized to evaluate predictive performance, as the following equation shown where is the coefficient for regulation term, and is the parameters of model.

Adaptive SVM Classifier
We adopt the first-stage structure to acquire the final context's representation and gather these representations and their actual sentiment labels as dataset of classifier. Given SVM does not balance the characteristics of inter-class and intra-class, we design the adaptive mechanism. In detail, we fuse the Fisher regulation to design the classification decision function whose benefit exits in balancing incluster-distance and between-cluster-distance of the samples' features, to achieve a better classification performance.
The classification decision function can be expressed.
where the number of contexts in the whole dataset is Q; y Ã i is sentimental label of the iÀth context's representation; Kðh; hc c Þ is kernel function, here Gaussian function is selected as kernel function; a Ã i denotes the constant coefficient of support vector; b Ã is the bias.
Based on Fisher method and regulation of structural risk minimization, we add the middle parts to the original optimal function of SVM. The middle parts request the ratio of the smallest intra-class divergence and the largest inter-class divergence. In other words, this method requires the contexts' representations have the larger inter-class distance and the smallest intra-class distance in semantic space. The classification plate can be adaptively adjusted by calculating the distances continuously, to produce a more accurate classification result.
The optimal function of ASVM is defined.
where w ¼ P Q i¼1 a Ã i y Ã i hc c i , and w k k 2 is L 2 regularization; r is constant; C is the penalty parameter; n i ! 0 is the elastic variable. S w is the sum of intra-class divergence of the same sentiment polarity' samples; S b is the sum of inter-class divergence of different sentiment polarities' samples.
In our model, by calculating intra-class differences and inter-class differences, respectively, and can be obtained, as the following formulas shown where T j is the sample of different sentiment polarities classified by ASVM; and hc j c i is the i-th context's representation corresponding to sentiment polarity p j ðj ¼ 1; 2; 3Þ; l j is mean vector of T j ; k j is the size of T j .
Above of all, the architecture of SAN-ASVM model is shown as Fig. 3. We utilize the GloVe pre-trained model to acquire word embeddings; then, more semantics and sentimental information are obtained by SAN module; at last, ASVM classifier is designed to predicate the sentiment polarity of context.

Experiments
In this section, a set of experiments and ablation studies of sentiment analysis are conducted on SemEval-2014 Task 4, Restaurant of SemEval-2016 and Twitter datasets to investigate the effectiveness of the proposed method by evaluating Accuracy (Acc) and Macro F1 (MF1).
We mainly design three experiments to prove our proposal: (1) 10 group of experiments on sentiment analysis are conducted to valid the effectiveness of our model; (2) multi-hops attention mechanisms are employed to demonstrate the election of 1-hop is more rational than multi-hops attention mechanism; (3) the compared experiment on classification function about ASVM is designed to achieve the improvement of accuracy.

Dataset and evaluation
SemEval-2014 Task4 is selected as experimental dataset, mainly including the Restaurant and Laptop domains. The domain of Restaurant in SemEval-2016 is also selected, for it provides a fine-grained aspect term, but Laptop dataset is a coarse-grained term. We also employ the other datasets, Twitter dataset. Their languages are English, all the four domains are composed of users' comments, and each comment contains several sentences which may have one or more aspects and sentiment labels, namely positive, negative and neutral. The concrete statistics of experimental data is shown as Table 1.
The indicators of sentiment classification, namely Acc and MF1, Acc denotes the ratio of all correct predictions to all samples which can be deduced by the predictive results. However, MF1 is the weighted average value of precision and recall, in which precision and recall also can be calculated by the predictive results. In order to calculate two indicators, the definitions should be listed.
According to the table shown, Acc can be expressed by using the formula: Acc = (TP ? TN)/(TP ? FP ? FN ? TN); and the precision refers to the ratio of correct predictions that are actually positive and also predicted to be positive to all predictions that are positive including the actually positive and negative, thence which can be deduced by the equation: precision = TP/(TP ? FP). In addition, recall means that the ratio of correct predictions that are positive (TP) to all actually positive samples   FN). Therefore, MF1 is expressed as MF1 = 2* precision* recall/ (precision ? recall).

Baseline models
Ten groups of models about sentiment analysis on aspectspecific are chosen to make comparison with our methods. Feature-based SVM: This model applied traditional feature engineering methods to extract the context feature and then employed the SVM to make sentiment classification (Kiritchenko et al. 2014).
MemNet: This model adopted Memory Network to calculate the importance of every word and then generated the context's representation to infer the sentiment polarity of aspect-specific .
TD-LSTM: Tang et al. employed Target Dependent LSTM networks to model the left and right context, respectively, and then concatenate hidden state outputs of left and right context to produce the final representation (Tang et al. 2016b, a).
ATAE-LSTM: This model named ATtention-based LSTM for AspEct-level sentiment classification, which mainly considered the influences of aspect-specific on every word of text, and then calculated the sum of weighted outputs to denote the final representations (Wang et al. 2016).
MGAN: Multi-Grained Attention Network was proposed to capture the word-level interaction between aspect and context, especially an aspect alignment loss was designed to fully use the interactions between aspect and context (Fan et al. 2018).
AOA-LSTM: This model introduced Attention over Attention mechanism on LSTM to model the mutual influences between aspects and context, which could capture the most important word of context influenced by aspect .
TNet: This model utilized Transformation Networks with CNN layer to generate context representations, and in every CPT module, the representations of target and context were combined to generate the next word representation .
RAM: Recurrent Attention Network based on MemNet introduced the bidirectional LSTM and a gated recurrent unit to produce the context representation .
CABASC: Content Attention Based Aspect based Sentiment Classification model is proposed, which could capture the important information about given aspects from a global perspective, and considered the order of words and their correlations (Liu et al. 2018a, b).
IAN: Interactive Attention Network was proposed, which analyzed the influences between aspects and context interactively by calculating attention values, to generate the representation of context (Ma et al. 2017).

Experimental setting
In this section, we need to pretrain the words from SemEval-2014 Task 4, Restaurant dataset of SemEval-2016 and Twitter dataset. In these four datasets, the Stanford CoreNLP (Manning et al. 2014) is applied as tokenizer to split context. We utilize the GloVe pre-trained model to train the word vectors in pre-trained process. Here we adopt the public pre-trained word embeddings of 300 dimensions. And the batch size of the training data is 32. During training, we usually initialize all matrices with sampling from uniform distribution of [-0.1, 0.1]; in order to learn the depth-hyperparameters effectively, the Adam optimizer is adopted to make the errors be backward and update the parameters; and the learning rate is 5e-5. To avoid the overfitting, the dropout rate is 0.1.

Analysis of experimental results
The results of compared experiments are shown as Table 2. Feature-based SVM based on feature engineering adopted SVM to make sentiment classification, exhibiting a good performance on classification. MemNet model ignored the latent semantics of embeddings, and the performance of MemNet model was worse than Feature-based SVM on the dataset of Restaurant and Laptop. In addition, based on RNN baseline model, TD-LSTM simply analyzed the Table 1 The statistics of the dataset   Dataset  Positive  Neural  Negative   Train  Test  Train  Test  Train  Test   Restaurant14  2164  341  807  196  637  196   Laptop14  994  728  870  128  464  169   Restaurant16  1620  597  88  38  709  190   Twitter  1561  173  173  346 1560 173 embeddings of left context and right context and utilized SoftMax function to predict the sentiment polarity, so that the performance was not good since it did not analyze the influences of target on every word of context. ATAE-LSTM just adopts aspect and every word of context as input of LSTM to calculate the weights, which just considered the sequential influences. IAN mainly took further step to highlight the importance of aspect by learning the interactions between aspect and context in order to generate the concatenated representation. Compared with ATAE-LATM, 1.4% would be improved in the Restau-rant14 dataset and 3.4% improvement occurs in Laptop14 dataset. RAM, TNet and CABASC all have better performances, but on the Twitter dataset, RAM is worse than TNet and CABASC, which illustrates interactive mechanism employed by MGAN and AOA-LSTM could effectively capture the correlation in the small and ungrammatical text.
In Table 3 GloVe-MHSA/SM model consists of Multi-Heads Self-Attention mechanism and SoftMax function. And GloVe-SAN/SM model is composed of a whole Self-Attention Networks and SoftMax function. The last one, utilizing Adaptive Support Vector Machine to substitute SoftMax function, and to constitute GloVe-SAN-ASVM.
Above of all, we aim at the two issues, long-term dependencies in the sequential model and interactive influences between aspect and context; therefore, GloVe-SAN-ASVM sentiment analysis model was designed. According to the compared results, the indicators Acc and MF1 are all improved in various degrees on the four datasets. Concretely, selecting GloVe-MHSA/SM as baseline, this model adopted MHSA to calculate the correlations between each two words in context, and compared with IAN, the result proves that the problem of long-term dependences can be solved; moreover, compared with TD-LSTM and ATAE-LSTM, the improvement was remarkable. This phenomenon demonstrated MHSA mechanism was more effective to capture the long-term dependences of context than LSTM networks. Besides that, GloVe-SAN/SM solved both issues; compared with GloVe-MHSA/SM, 1.63% improvement occurs on the Restau-rant14, 1.81% improvement on Laptop14 dataset, 1.27% improvement on Restaurant16 dataset and 1.29% improvement on Twitter dataset. In addition, on four datasets the performances of GloVe-SAN/SM are superior to AOA-LSTM, RAM and CABASC, respectively, but compared with AOA-LSTM and MGAN the accuracy on Restaurant14 dataset is lower, and the reason is probably these two models analyze the interactive influences between aspect and context by building interactive matrix from the perspective of fine-grained. This is in line with our main idea about our model, which further illustrated the effectiveness of our method about SAN. Moreover, when adopting ASVM to substitute SoftMax to make classification, the experimental results are shown in Table 3. In comparison with GloVe-SAN/SM and GloVe-MHSA/SM, the classification accuracy was improved in different degrees on each dataset when applying GloVe-SAN-ASVM model, which comprehensively demonstrated ASVM is effective to improve the classification accuracy. Even though the performance on the first three datasets was superior to other models, but on Twitter dataset, the performance was not better than TNet-AS, the reason was probably that SAN model was not more appropriate to the characteristics (short and unigram) of the Twitter dataset.

Ablation studies
In this section, ablation experiments were preformed to comprehensively analyze the importance of different mechanisms. Therefore, multiple experiments are designed to verify the effectiveness of every structure.

Analysis of SAN
In order to pay more attention on some vital words related to aspect, we design 1-hop attention mechanism to capture the influences on some key words. By comparing the performances of GloVe-MHSA/SM and GloVe-SAN/SM from Table 2, we valid the effectiveness of 1-hop attention mechanism. Therefore, to study the positive influences on the number of multi-hops mechanism, we design 1-hop and 2-hops mechanism based on SAN model to conduct the experiment.
As Table 4 shows, we can conclude that SAN adopted 1-hop attention mechanism has more excellent performances than 2-hops attention. In detail, the model SAN/1 h achieves better results which outperforms SAN/2 h over 2.35, 6.67, 14.06 and 3.17% on Restaurant14, Laptop14, Restaurant16 and Twitter datasets, respectively, in terms of Acc. We think the reason is probably that we adopt the pooling operation of aspect to calculate the influences on context, which discards of some useful information to some extent. Moreover, the variances are smaller and smoother than SAN/2 h, which indicates that SAN/1 h is more robust.
In Table 4 SAN/1 h means this model adopts SAN with 1-hop attention mechanism, and SAN/2 h includes 2-hops attention mechanism.

Analysis of ASVM
In addition, based on the results of previous experiments on proving the effectiveness of 1-hop attention mechanism, we further conduct the compared experiments to valid the necessaries of ASVM classifier (Table 5).
Regarding SAN/SM model with SoftMax function as baseline, meanwhile we apply SAN model with standard SVM classifier to realize three classifications of sentiment, and the results manifest SAN-SVM exceeds SAN/SM 0.31% on Restaurant14 dataset, 0.24% on Laptop14 dataset, 0.48% on Restaurant16 dataset and 0.28% on Twitter dataset in terms of Acc; at the same time, MF1 also exhibits a steady increasements. Further, we develop ASVM integrating Fisher regulation to achieve sentiment classification, and compared with SAN/SM model the steady increasement of performances has occurred on the four datasets no matter Acc and MF1, which is a strong argument to valid the effectiveness of ASVM. Therefore, the Fisher regulation inserted into SVM function is very useful to make sentiment classification.

Case study
In order to further understand the effectiveness of 1-hop attention mechanism and 2-hops attention mechanism, we pitch some examples to visualize the weights of attention mechanism. We design the ablation analysis on SAN with 1-hop or 2-hops attention mechanism to demonstrate the effectiveness. We select one context from test dataset ''Everything is so easy to use, Mac software is just so much simpler than Microsoft software'' and analyze which word contributes the biggest influences on the aspects' sentiment polarities. Therefore, we visualize two groups of weights obtained by SAN with 1-hop or 2-hops attention mechanism, and the compared results are shown as Fig. 4, in which the color depth indicates the different importance of a word in this context. The first aspect is ''use'' whose weights distribution is depicted in the first two examples, and the second is ''Mac software'' whose weights distribution is drawn in the last two examples. As the figure shows, SAN with 1-hop attention mechanism captures the key words ''easy'' and ''simpler'' more obviously and assign the relatively small weights to some adverbs than 2-hops attention, e.g., ''so,'' and'' so much.'' Above of all, we can indicate that the SAN with 1-hop attention mechanism on sentiment classification of aspect-specific can strength greatly the ability of extracting the important sentimental words.

Conclusions and future work
The goal of this paper was to develop an effective sentiment analysis method based on aspect-level. The goal has been successfully achieved by proposing a novelty twostage architecture, namely GloVe-SAN-ASVM for sentiment classification that could effectively extract the context's feature related to aspect-specific and the performance of sentiment classification was also improved by ASVM. The effectiveness of the proposed method which adopted two-stage model to conduct sentiment classification was evaluated by comparing with other methods in terms of the Acc and MF1. The experimental results showed that the proposed method could effectively achieve higher classification accuracy than other models on almost all datasets. This is because the MHSA mechanism can effectively mine the latent semantic dependencies between each two words of context; moreover, 1-hop attention mechanism also can excavate the semantic correlations between aspect-specific and vital words of context. Therefore, combining these two mechanisms the more effective feature representations can be learned, besides that ASVM classifier is designed in the second stage to improve the classification performances. Finally, this paper applied compared experiments to prove the proposal, which can successfully boost the Acc and MF1 of GloVe-SAN-ASVM model based on aspect-specific, and achieve a good result for sentiment classification problems.
However, existing dataset is labeled by human, which is very limited and inefficient; therefore, in the future works we aim to research the semi-supervise networks and few shot learning models combined prior language knowledge for sentiment analysis problems. Data Availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest We confirm that there are no known conflicts of interest associated with this publication.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.  . 4 Examples of attention distribution for sentiment classification on aspect-specific. The color bar denotes the sentimental importance of key words