SGPT: Semantic graphs based pre-training for aspect-based sentiment analysis

Previous studies show effective of pre-trained language models for sentiment analysis. However, most of these studies ignore the importance of sentimental information for pre-trained models. Therefore, we fully investigate the sentimental information for pre-trained models and enhance pre-trained language models with semantic graphs for sentiment analysis. In particular, we introduce Semantic Graphs based Pre-training(SGPT) using semantic graphs to obtain synonym knowledge for aspect-sentiment pairs and similar aspect/sentiment terms. We then optimize the pre-trained language model with the semantic graphs. Empirical studies on several downstream tasks show that proposed model outperforms strong pre-trained baselines. The results also show the effectiveness of proposed semantic graphs for pre-trained model.


Introduction
Pre-trained language models learn contextualized word representations on large-scale text corpus through a self-supervised learning method, which are fine-tuned on downstream tasks and can obtain the state-of-the-art (SOTA) performance, and are pervasive and have made a tremendous impact in many NLP fields such as reading comprehension (Lai et al., 2017), question answering (Rajpurkar et al., 2016) and sentiment analysis (Zhang et al., 2018).Leveraging pre-trained language models have achieved promising results in sentiment analysis tasks, including aspect-level sentiment classification (Zeng et al., 2019a),sentencelevel sentiment classification (Zhang et al., 2018).These pre-trained models have shown their power in learning general semantic representations from large-scale unlabelled corpora via well-designed pre-trained tasks.
Sentiment analysis involves a wide range of specific tasks (Liu, 2012), such as sentence-level sentiment classification, aspect-level sentiment classifi-cation, aspect term extraction, and so on.Sentiment analysis tasks usually depend on different types of sentiment knowledge including sentiment words, word polarity, and aspect-sentiment pairs (Tian et al., 2020).Recently, knowledge has been shown very important for enhancing the language representations, such as SentiLARE (Ke et al., 2020) and SKEP (Tian et al., 2020).
Sentiment analysis especially on large-scale reviews is still very challenging, since it is hard to capture the aspect term, sentiment words from the review text.As shown in the below example, since there are more than one aspect in the example, the traditional pre-trained model cannot capture the sentiment information.In addition, the masking language model of traditional pre-trained model will ignore the continue aspectopinion phrase (e.g., great color).However, based on the semantic graphs with the correlations between aspect and opinion, it is easy for proposed model to capture sentiment information, and solve the continue phrase masking problem.
The cloth is overall good, with great color, but bad material.
To address the above challenges, we develop a semantic graph-based pre-training model to employ semantic graphs with sentiment knowledge for pre-trained model.In particular, we explore similar aspect and sentiment words, and build a similar semantic and aspect-sentiment pair graph.We then employ aspect-sentiment pairs to construct the semantic graph.Thirdly, we feed the semantic graph into a the pre-trained language model with sentimental masking.Finally, we jointly optimize the aspect-sentiment pair prediction objective and mask language model.Empirical studies on several downstream tasks show that proposed model outperforms strong pre-trained baselines.The results also show the effectiveness of proposed semantic graphs for pre-trained model.In summary, our contributions are as follows: (1) We employ semantic graphs with aspect and sentiment terms to enhance pre-trained language models.The results outperforms state-of-the-art models on several sentiment analysis tasks in both Chinese and English datasets.
(2) Our method significantly outperforms the strong pre-training method RoBERTa (Liu et al., 2019) on three typical sentiment tasks, and achieves much better results on all the datasets.

Overview of Proposed Model
In this study, we propose a semantic graph-based pre-trained model to construct the semantic graphs from aspect terms and sentiment words for the pretrained language model.As shown in Figure 1, we firstly construct a semantic graph with pair-wise aspect and sentiment term relations, and similarity relations among aspect and sentiment terms.We then employ pretrain a language model to learn sentiment knowledge from the semantic graphs with three tasks including sentiment masking prediction, aspectsentiment pair prediction and node similarity.Finally, we fine-tune the pre-training model on three sentiment analysis tasks: sentence-level sentiment classification, aspect-level sentiment classification and aspect/sentiment terms extraction.

Semantic Graphs based Pre-training
As shown in Figure 2, we employ semantic graphs to capture sentiment knowledge for pre-trained lan-guage model.Our paradigm contains three sentiment pre-training objectives: sentiment words masking prediction L sw , aspect-sentiment pair prediction L ap and aspect-based similarity score L ns .
Given an input sentence, sentimental masking prediction attempts to recover the sentiment words masked based on the semantic graph.Aspectsentiment pairs similarity aims to calculate the matching rate of a sentiment description on an aspect along with some related words sampled from the semantic graph.We extend the related words to the aspect-based similarity score and further learn the synonym knowledge from another perspective.
Therefore, these three tasks are joint learning to continue pre-train the language model:

Semantic Graphs Construction
We construct a semantic graph from large-scale unlabeled data.In particular, we extract the aspect words, sentiment words, and aspect-sentiment pairs from the unlabeled data.Our work is mainly based on automation methods along with a slightly manual review.This is a heterogeneous graph with aspect words and sentiment words as different type nodes.Two nodes are connected if they are semantic or literally similar, aspect-sentiment word pair is also connected.Our method aims to integrate those knowledge into pre-trained language model.

Sentiment Word Prediction
Inspired by BERT (Devlin et al., 2018) that randomly replaces 15% words with [MASK] and learn to recover them, we attempt recovering masked sentiment words to pay more attention to sentiment descriptions.For sentiment words prediction, each token in a masked sentence X is fed into roBERTa to get a vector representation x i and then normalized with softmax layer to produce a probability vector ŷi over the entire vocabulary.In this way, the sentiment word prediction objective L sw is to maximize the probability of original sentiment word x i as follows: (2) Here, W ∈ R d×v and b ∈ R d×1 are all trainable parameters of the prediction layer.m i = 1 when x i is masked, otherwise it equals 0. y i is the one-hot representation of the original token x i .

Aspect-Sentiment Pair Prediction for Pre-Training
We propose a new pre-training task to build these dependency between aspect and sentiment terms.
Compared with predicting pairs directly, we calculate a pair prediction over the aspect-sentiment pairs enhanced by their similar words.The is a value between 0 to 1 which means the probability of a pair existing.For an aspect-sentiment pair, we extract a similar aspect words set SA and a similar sentiment words set SS from the semantic graph with algorithm 1.We concatenate SA and SS to construct the input sequence: where, CLS denotes the entire sequence representation, while SEP is a separator of two sequence.
After encoding each element with roBERTa, we use u cls as the embedding of CLS to calculate the pair prediction.
We expect every aspect-sentiment pair equals 1.Thus, the pair prediction is: Where p i always equals 1 when the input sequence is a pair while p i equals 0 if the input sequence is not a pair.Then the relation between aspect words and sentiment words will be established in this way.

Node Similarity for Pre-training
Semantic graph node similarity aim to capture model's sensitivity to aspect/sentiment synonyms.
In particular, we sample synonyms from the semantic graph with similar nodes sampling algorithm detailed in Algorithm 1 and get SA and SS.All aspect synonyms are fed into roBERTa to get a representation U SA , so dose sentiment synonyms and gets a representation U SS .As contrastive learning has shown great success on many areas especially unsupervised methods, we apply contrastive learning to capture these synonyms relations.The core idea of contrastive learning is to shorten the distance of positive samples and widen the distance of negative samples which perfectly meets our requirements.
score f (x), f x + >> score f (x), f x − (7) What we want to do is clustering each similar aspect word together so all synonyms are positive samples and randomly sampled words are negative samples.We use a cosine function to measure the distance between samples and get the loss L ns as following: (9)

Sentiment Analysis with Pre-training Models
We verify the effectiveness of our language model on three typical sentiment analysis tasks: sentencelevel sentiment classification, aspect-based sentiment classification, and aspect/sentiment terms extraction.We fine-tune some strong models on the same language model as baseline to evaluate the improvement.

Sentence-level Sentiment Classification
This task aims to classify the sentiment polarity of an input sentence.We use the final state vector of classification token [CLS] as the overall representation of an input sentence and a classification layer is added to calculate the sentiment probability on top of the transformer encoder.

Aspect-based Sentiment Classification
The purpose of this task is to analyze fine-grained sentiment polarity for an aspect with a given contextual text.Thus, there are two parts in the input: contextual text and aspect description.We combine these two parts with a separator [SEP], and feed them into the language model.the final state of [CLS] also be utilized as the representation for classification.

Aspect and Sentiment Term Extraction
This task is to extracting all aspect description or sentiment statement.The same as other tasks, all tokens are fed into language model to get a representation.Then a CRF layer is added on each tokens to predict if it belongs to an aspect or sentiment term.

Domain
Train

Experimentation
In this section, we introduce our training/evaluation datasets and some experiment setting.Then release the experimental results conducted from different perspectives and analyze the effectiveness compared with different baseline models.

Data Collections
We mainly develop with Chinese datasets and also pre-train an English version to evaluate the effectiveness on public datasets.The Chinese data comes from product reviews on TaoBao.comwhich is one of the top online shopping platforms.The Chinese pre-training dataset includes over 167 million sentences and evaluate them on two domains, i.e., Furniture (Furn) and Kitchen (Kith).And all of the three tasks for Chinese are evaluated as Macro-F1 score.The statistics of Chinese evaluation datasets is shown in Table 1 and Table 2.We split all data as 7:1:2 and get the train/valid/test datasets.
The English dataset for pre-training is amazon-2 (Zhang et al., 2015), 3.2 million of the original training data are reserved for development.We evaluate the performance of the English model on a variety of English sentiment analysis datasets.Table 3 summarizes the statistics of English datasets used in the evaluations.Different tasks are evaluated on different datasets: (1) For sentence-level sentiment classification, Standford Sentiment Treebank (SST-2) (Socher et al., 2013) (Maria et al., 2014).This task contains both restaurant domain and laptop domain, whose accuracy is evaluated separately.(3) For the extraction task, MPQA 2.0 (Wiebe et al., 2005;Wilson, 2008) dataset is used which aims to extract the aspects or the holders of the sentiments.We measured with the method in SRL4ORL (Marasovic and Frank, 2017), which is released and available online.

Data Pre-Processing
To build the semantic graph, we extract these information as following: Aspect/Sentiment Term Extraction Aspect and Sentiment descriptions are extracted by BERT-CRF which is trained on labeled datasets up to 85 F1 score.
Aspect-Sentiment Pair Extraction We match the aspect-sentiment pairs with simple constraints.An aspect-sentiment pair refers to the mention of an aspect and corresponding sentiment words.Thus, a sentiment word with its nearest aspect has high probability to be a pair.More specifically, we limit the aspect-sentiment pair must be included in one sentence and only one-to-one pairs are considered.
Similar Words Extraction As similar words can be seen as words with a same category, we employ DBSCAN clustering algorithm to get coarsegrained synonyms represented by the average pooling of all piece words' Word2Vec embeddings.A recycling mechanism is applied to further clustering big clusters, which contain lots of irrelevant words, into small clusters by grid searching different parameters.Finally, we review all similar words manually to get more accurate synonyms.

Experiment Setting
We use RoBERTa as the base language model and continue pre-training it with our paradigm.We concatenate different sentences until the sentence length up to 512 and train them with batch size 32.An adam optimizer is applied with learning rate 1e-5 and warmup ratio 0.1 without weight decay.For sentiment masking, we mask sentiment words as far as possible up to 20% and two aspect-sentiment pairs at most needed to be extracted.At the stage of fine-tune for downstream tests, we take the same parameters as SKEP (Tian et al., 2020) did.

Comparison with Baselines
We compare SGPT with two strong pre-training baselines: RoBERTa and a similar continue pre-training method SKEP.The results on Chinese datasets are shown in table 5.For classification tasks, there is no obvious improvement from RoBERTa to SKEP while SGPT outperforms SKEP 1.6 and 1.8 on aspectlevel, along with 1.0 and 1.3 on sentence-level.Meanwhile both pair prediction and node similarity performs better than SKEP.More details, SGPT overcomes the unbalanced problem with average 3 point improvements on negative samples.For extraction task, sentiment continuing pre-training's effectiveness is also significantly.SKEP has a 1-2 point improvement compared with RoBERTa and SGPT has a further 2-3 point improvement based on SKEP.
Meanwhile, English experiments are shown in table 3. SGPT outperforms other methods over three task as well as in Chinese datasets and indicates that SGPT is universal to cover different language and different tasks.

Influence of Different Factors
We then analyze the influence of different factors with both pair prediction and node similarity separately to verify the influence factors on Table 3.
Effective of Aspect-Sentiment Pair Predic- tion For aspect based sentiment analysis, we propose the aspect-sentiment pair prediction to build the dependence between aspect words and sentiment words.The results reflect that continuing pre-training language model with pair prediction can absorb pairs information successfully and adapt great on downstream task related to aspect or sentiment.
Effective of Node Similarity The node similarity aims to learn the synonym knowledge from semantic graphs.Compared with aspect-sentiment pair prediction, the experiment results shows that it almost outperforms far more than all pair prediction's results except aspect term extraction.We think node similarity is an even more powerful mechanism for gaining sentiment information and contrastive learning also plays a great role in this.
Finally, we combine the two factors into one model.The results shows that our composite method benefits from the two with different advantages and get a better performance than both of them separately.

Influence of Training Size
To verify how capable SGPT solves the unbalance label problem and few-shoot problems, we design an experiment which increases the data scale from 10% to 100%, then compare the effectiveness between SGPT and fine-tuning directly.As shown in figure 3, SGPT has great performance even on 10% of all training data and gets more than 30 point improvement compared with the base model.The base model is close to SGPT until 60% of data has been used but still has a big gap with SGPT.

Case Study
We give two aspect based sentiment analysis examples in table 7 and illustrate the situation where RoBERT or SKEP can't solve but can be overcame by SGPT.
The first example is about unconventional expression.For most of reviews, "cost" usually means needing more and SKEP makes the decision from this inertia thinking.While SGPT takes more words into consideration that linked in the semantic graph and recognizes "costs less" is a positive expression.
The second example is fine-grain sentiment po- larity contrary to the sentence-level polarity.The pair prediction task in SKEP can't identify every aspect-sentiment pair when the sentiment description is intensive so that it assesses "price" from a holistic perspective.While SGPT benefits from pair prediction without predicting all pairs and recognizes the real sentiment polarity.
5 Related Works

Sentiment Analysis with Knowledge
Various types of sentiment knowledge, including sentiment words, aspect-sentiment pairs and prior sentiment polarity from Senti WordNet (Ke et al., 2020), have been proved to be useful for a wide range of sentiment analysis tasks.Sentiment words with their polarity are widely used for sentiment analysis, including sentence level sentiment classification (Taboada et al., 2011b;Shin et al., 2017;Zhang et al., 2018), aspect-level sentiment classification (Vo and Zhang, 2015;Zeng et al., 2019a), sentiment extraction (Li and Lam, 2017), emotion analysis (Gui et al., 2017;Fan et al., 2019) and so on.Lexicon-based method (Turney, 2002;Taboada et al., 2011a)directly utilizes polarity of sentiment words for classification.Traditional feature-based approaches encode sentiment word information in manually-designed features to improve the supervised models (Bakshi et al., 2016;Agarwal et al., 2011).In contrast, deep learning approaches enhance the embedding representation with the help of sentiment words (Shin et al., 2016), or absorb the sentiment knowledge through linguistic regularization (Qian et al., 2016).Aspectsentiment pair knowledge is also useful for aspectlevel classification and sentiment extraction.Previous works often provide weak supervision by this type of knowledge, either for aspect level classification (Zeng et al., 2019b) or for sentiment extraction (Yang et al., 2017;Ding et al., 2017).One related study is SKEP (Tian et al., 2020), which utilize sentiment knowledge to embed sentiment information at the word, polarity and aspect level into pretrained sentiment representation.But it's hard to re-cover aspect-sentiment pair due to aspect-sentiment pair appear continuously(85% probability) in product review(product reviews in tabao.com).
Most of these work adopts the "BERT+entity linking" paradigm, whereas, it is not suitable for Ecommerce product reviews due to the lack of quality entity linkers as well as KGs in this domain.
Skep conducts aspect-sentiment pair masking, sentiment word masking,common token masking and utilize three sentiment knowledge prediction objectives, with sentiment word prediction, word polarity prediction and aspect-sentiment pair prediction and aspect-sentiment pairs is converted into multi-label classification.However, aspectsentiment word appear continuously in product reviews, which make it difficult to predict when masking the pair.Our work also differs from the work skep, we develop a novel pre-training paradigm that leverage semantic graphs and incorporate sentiment knowledge into pretraining, a detailed comparison between our model and skep pre-trained language models can be found in §3.

Conclusion
In this study, we design three pre-training tasks to continue pre-training language model for downstream sentiment analysis tasks.We propose a semi automatic method to build a semantic graph and employee language model to adopt the graph knowledge with these tasks.Sentiment words masking for paying more attention to sentiment term, aspectsentiment pair prediction for building the dependence between aspect and sentiment, node similarity for learning synonym knowledge, which all get great improvement over different downstream tasks and different languages.
In the future, we will try to apply SGPT on more sentiment analysis tasks, to further see the generalization of SGPT, and we are also interested in exploiting more efficiency method of building semantic knowledge.

Figure 1 :
Figure 1: The Overview of proposed models.

Figure 2 :
Figure 2: The Framework of our pre-training paradigm.

Figure 3 :
Figure 3: The Results under Different Data Scales.
Algorithm 1: Similar Nodes Sampling Input: Graph G, initial node h, max sampling depth K, max sampling words length L, word frequency table T .Output: Sampled similar nodes Ĉh = S 0 ∩ S 1 ∩ S 2 ∩ • • • ∩ S K ; 9 Sortthe nodes in Ĉh by frequency according to table T in incremental order; 10 Choose top-ranked nodes up to the length L and append them to C h ; 11 Return C h ;

Table 1 :
Statistics of Chinese ABSA Evaluation Datasets.POS and NEG refer to positive polarity and negative polarity.There is an obviously unbalanced phenomenon, more than 10:1, between the positive and negative samples.

Table 2 :
Statistics of Chinese Evaluation Datasets on Extraction and Sentence-Level Classification.The extraction task includes aspect words and sentiment words which are aggregate.

Table 3 :
Statistics of English Evaluation Datasets.Sem-R and Sem-L refer to restaurant and laptop parts of SemEval 2014 Task 4.
(Zhang et al., 2015)are used.The performance is evaluated in terms of accuracy.(2) Aspect based sentiment classification is evaluated on Semantic Eval 2014 Task4

Table 5 :
Results of Chinese Evaluation on Extraction and Sentence-Level Classification.Model BERT and RoBERTa refer to fine-tune directly on downstream tasks.Pair prediction refer to our pre-training methods with sentiment masking and pair prediction.Node similarity means sentiment masking with node similarity.Ours means the complete method we propose.M-F1 is the abbreviation of macro-F1 score.

Table 6 :
Results of English Evaluation

Table 7 :
Case Study.Italics words are aspect terms and bold words are sentiment descriptions.