3.1 Framework of method
In this section, we propose a learnable dependency-enhanced deep learning framework to learn and fuse dependency-based features for ASC. Instead of directly using a dependency tree generated by syntax parsing tool like spaCy, our framework can learn dependency-based features for improving the performance of ASC. Because a DT generation tool may not articulate precise relationships in dependency tree, much noisy information might be produced and therefore consequently hinders the improvement of ASC performance.
We illustrate the framework of our method, which is shown in Fig. 2. The framework can be roughly divided into three stages: the multi-task learning stage, the graph features fusion stage, and the representation learning and classification. In the first stage, we use the multi-task learning technique based on the encoder and decoder models for domain adaptive pretraining, where dependency parsing is made to learn multi-feature information including structure, relations of edges, and linguistic features. The Encoder is the word embedding model, which takes the tokens as input and outputs the token embedding Z. Then, Mask Language Model is applied to the encoder for iterative model training and reducing the noisy information of word embedding representation. The token embedding Z is further fed into the decoder. Decoder is a dependency parsing model, which takes the hidden state Z as input, and outputs the predicted dependency graph \({G}({A},{X},{R})\), where A,\({X}\) and \({R}\) respectively denote the structure feature of dependency tree, the feature of word nodes, and the features of dependency relations. After the new dependency graph is trained, the relevant parameters of the encoder and decoder are also saved for the subsequent processing. The details about how to perform the MTL-based domain adaptive training will be discussed in section 3.3.
In the graph feature fusion stage, these different graph features including structure and relations of edges between words will be effectively fused for message passing neural network-based model training. We propose a learnable dependency-based double graph structure to deeply fuse these different graph features, through which the pretrained dependency graph \({G}({A},{X},{R})\) generated in the previous stage will be transformed into a multi-feature fusion-based double graph structure \({G}(\tilde{{A}}, \tilde{{X}}, \tilde{{R}})\), where \(\tilde{{A}}\), \(\tilde{{X}}\) and \(\tilde{{R}}\) are respectively the structure feature, the node feature, and the relation features after multi-feature fusion. These fused graph features and the token embedding Z will be fed into the message passing neural network to train an ASC classification model. The detail about double graph structure-based feature fusion and MPNN-based model training will be discussed in the sections 3.4.
3.2 Architecture of model
In this section, we specifically discuss the architecture of model in our method, which is shown in Fig. 3. The architecture of model mainly includes three parts: the domain adaptive pretraining, the fusion of structure and relations, and the multi-feature fusion-based MPNN for ASC.
As mentioned in the previous section, the multi-task learning (MTL) technique is applied for domain adaptive pretraining because it can effectively extract word semantic information and therefore reduce the noisy information in the dependency graph to be produced. The BERT model as the encoder for word representation. We further use the Mask Language Model (MLM) to pretrain the BERT model over the ABSA data sets, and the Biaffine Attention Model (BAM) as the decoder for dependency parsing. MLM and BAM are together trained in an MTL task to learn the structure and relations features for word representation and dependency parsing. Through the MTL for domain adaptive pretraining, we can get a pretrained dependency graph with less noisy information that can be transformed into the sparse score matrices of the structure graph and the relation graph. Both of them are fused and form the double graph structure. The double graph-based message-passing neural network (MPNN) is applied to incorporate the structure feature, the relation features of edges, and linguistic feature for improving the performance of ASC. During the MPNN training for ASC, our model can continuously learn and enhance the structure, relation and linguistic features by the double graph structure, and therefore improve the feature representation for ASC. The related experiments were also made for evaluating our method, which will be discussed in section 4.
3.3 MTL-based domain adaptive pretraining
The application of BERT model has shown its excellent efficiency in embedding representation by using a pretrained model. In order to further enhance feature representation of the sentiment text and reduce the noisy information of dependency parsing, we use the Mask Language Model (MLM) (Devlin et al. 2021) to further train the BERT model on the sentiment text. The enhanced pretrained feature is input to the biaffine attention parser to learn the structure feature and relations features for generating a dependency graph with less noise compared with a static dependency tree produced by a tool. The dependency parsing task and the MLM task will be combined as an MTL task. Unlike some previous studies such as Chen et al. (2022) that use the multi-channel technique to train different graph features independently, our method considers the interactions between these various features including structure, relations and linguistic features. Multi-task learning contributes to deeply fusing these different kinds of features affected with each other in a unified framework of domain adaptive pretraining.
Biaffine Attention (Dozat and Manning 2017; Bekoulis et al. 2018) is a popular graph-based dependency parsing model. method. It can handle the multi head selection by promptly splicing the head and tail tokens of the entities without allowing for any information interaction. Biaffine attention has illustrated very promising potential in graph syntax analysis (Dozat and Manning 2017) compared with some other models like a single affine model plus MLP.
Suppose that the sentence structure is denoted as \({H}=\{{{h}}_{0},{{h}}_{1},\dots ,{{h}}_{n}\}\), where n is the number of words in the sentence. The dependency structure-based biaffine attention \({Biaffine}_{arch}\left({H}\right)\) can be defined by following equations.
$${{h}}_{i}^{\left(arch-dep\right)}={MLP}^{\left(arch-dep\right)}\left({{h}}_{i}\right)$$
1
$${{h}}_{j}^{(arch-head)}={MLP}^{\left(arch-head\right)}\left({{h}}_{j}\right)$$
2
$${{s}}_{i}^{\left(arch\right)}={{W}}^{(arch-head)}{U}{{h}}_{i}^{(arch-dep)}+{{W}}^{(arch-head)}{b}$$
3
where \({{h}}_{i}\in {\mathbb{R}}^{m\times 1}\), m is the dimension of word embeddings, \({{W}}^{arch-head}\in {\mathbb{R}}^{n\times m}\) is a learnable transformation matrix. The notions \({U}\in {\mathbb{R}}^{m\times m}and {b}\in {\mathbb{R}}^{m\times 1}\) are respectively the trainable weights and the biases. The resulting score matrix \({{s}}_{i}\in {\mathbb{R}}^{n\times 1}\) is used as the input to the graph neural network.
As shown in the example of Fig. 1, the relations between words are important for high-quality ASC. The Biaffine model in the earlier research was mainly used for parsing dependency structure rather than relationship parsing (Kipf and Welling 2017; Sun et al. 2019; Luo et al. 2019; Tang et al. 2020), which may generate a dependency tree with more noisy information and further impede the downstream task such as sentiment classification. We argue that the dependency parsing that fuses the structure feature and the relations feature between words will contribute to lower noise in comparison to the static dependency tree produced by other tools. In this paper, we extend the dependency structure parsing for relationship parsing. The biaffine attention \({Biaffine}_{rel}\left({H}\right)\) for dependency relations parsing is defined in Eq. (4).
$${Biaffine}_{rel}\left({H}\right)={\bigcup }_{rRS}{\{Biaffine}_{arch}^{r}\left({H}\right)\}$$
4
where RS is the set of dependency relation types of edges between words, such as conj, prep, nsubj, probj, and so on. There are currently 45 kinds of dependency relation types in the set RS. Each \({Biaffine}_{arch}^{r}\left({H}\right)\) is a dependency-based biaffine attention for a given dependency relation type r.
Given a sentence in the form of \(\{{w}_{0},{w}_{1},...,{w}_{n}\}\) as input, MLM is further applied to the BERT model for domain adaptation. Then, we can obtain the word embedding representation \({H}=\{{{h}}_{0},{{h}}_{1},\dots ,{{h}}_{n}\}\). Taking the result \({H}\) of the MLM-based pretrained model as input, we use the architecture biaffine and relation biaffine to obtain the score matrices as follows.
$${{S}}_{arch}={Biaffine}_{arch}\left({H}\right)$$
5
$${{S}}_{rel}={Biaffine}_{rel}\left({H}\right)$$
6
where \({{S}}_{arch}\in {\mathbb{R}}^{n\times n}\) is the structure feature, and \({{S}}_{rel}\in {\mathbb{R}}^{n\times n\times \left|RS\right|}\) is the relation features for all the edge relation types in \({{S}}_{arch}\).
During domain adaptive pretraining, we do not use a spanning tree algorithm to generate a tree structure. Instead, in order to simplify the training process of dependency parsing, once we obtain the score matrices, we directly feed the score matrices into a softmax normalization layer to yield the most likely head node for each node, which is shown in Eq. (7).
$${{P}}_{arch}=softmax\left({{S}}_{arch}\right)$$
7
We use the cross-entropy loss to straightly obtain the tree structure on \({S}^{arch}\), where for all \({{p}}_{i}^{arch}\) in \({{P}}_{arch}\), the structure loss \({Loss}_{arch}\) is defined in Eq. (8).
$${Loss}_{arch}=-{\varSigma }_{i\in V}log{ {p}}_{i}^{arch}$$
8
where V denotes the set of tokens in a sentence (except for the ‘[ROOT]’ node).
Similarly, we use the cross-entropy loss to the softmax normalization layer to produce the relations feature, which is defined in equations (9) and (10).
$${{P}}_{rel}=softmax\left({{S}}_{rel}\right)$$
9
$${Loss}_{rel}=-{\varSigma }_{i\in RS}log{ {p}}_{i}^{rel}$$
10
where \({{P}}_{rel}\) is a probability matrix of relations feature consisting of the feature\({ {p}}_{i}^{rel}\) for each relation \(i\in RS\).
During the dependency parsing based on domain adaptive pretraining, the overall loss is a combination of structure loss, relations loss and the MTL loss, which is shown in Eq. (11).
$$Loss= {Loss}_{arch}+{Loss}_{rel}+{Loss}_{MLM}$$
11
where \({Loss}^{MLM}\) is the cross-entropy loss for the MLM task.
3.4 Double graph fusion of structure and relations
After the MTL-based train process is finished, we obtain both structure feature \({{S}}_{arch}\) and relations feature \({{S}}_{rel}\). They will be deeply fused by a double graph feature fusion method. The fused double graph data is input into the message passing neural network (MPNN) for ASC.
For dependency feature fusion, the structure graph \({{S}}_{arch}\) can be regarded as an adjacency matrix \({{A}}_{arch}\), and the relation graph \({{S}}_{rel}\) can be converted as a relation graph matrix \({{A}}_{rel}\) by the MLP operation as follows.
$${{A}}_{rel}=MLP\left({{S}}_{rel}\right)$$
12
where \({{A}}_{rel}\in {\mathbb{R}}^{n\times n}\), and \({{S}}_{rel}\in {\mathbb{R}}^{n\times n\times \left|RS\right|}\).
In dependency parsing, the dependency between two words is often directional. For example, an aspect word has a specific emotion represented by an emotion word; an emotion word can be related to an aspect word. In this paper, we use an adjacency matrix and its transpose to distinguish the two directions between head-dependent and dependent-head pairs in dependency graph. The structure and relations features are respectively represented by adjacency matrices \({{A}}_{arch}\) and \({{A}}_{rel}\), and their transposes \({{A}}_{arch}^{T}\) and \({{A}}_{rel}^{T}\). They can be fused into a dependency enhanced graph \({{A}}_{head}\) with transpose \({{A}}_{dep}\), which is shown in equations (13)-(14).
$${{A}}_{head}={{A}}_{arch}+{{A}}_{rel}$$
13
$${{A}}_{dep}={{A}}_{{a}{r}{c}{h}}^{T}+{{A}}_{rel}^{T}$$
14
where all the matrices are \({\mathbb{R}}^{n\times n}\) matrices.
The double graph data \({{A}}_{head}\) and \({{A}}_{dep}\) is subjected to row normalization operation, and then is input into the MPNN together with the pretrained tokens embedding for ASC. During MPNN training, the pretrained dependency graph will be further be trained and updated by learning BAM for continuously producing an optimal dependency graph with less noisy information.
Specifically, the token embeddings from the MLM pretraining model is denoted as \({{H}}^{0}\), which is directly fed into the MPNN graph neural network as node representations. We use bidirectional message transmission (Kampffmeyer et al. 2019) as a layer to gather the final hidden states in order to learn the node representation over both \({{A}}_{head}\) and \({A}_{dep}\), which is shown in equations (14)-(15).
$${{H}}^{l+\frac{1}{2}}=LN(Relu\left({{A}}_{head}{{H}}^{l}{\theta }_{head}^{l}\right)+{{H}}^{0})$$
14
$${{H}}^{l+1}=LN(Relu\left({{A}}_{dep}{{H}}^{l+\frac{1}{2}}{\theta }_{dep}^{l}\right)+{{H}}^{0})$$
15
where \(LN\) represents layer normalization, \({A}\in {\mathbb{R}}^{n\times n}\) and \({H}\in {\mathbb{R}}^{n\times m}\). The notions \({{H}}^{l+\frac{1}{2}} \text{a}\text{n}\text{d} {{H}}^{l+1}\)represent the MPNN forward propagation and backward propagation, respectively.
Because message passing in the graph convolution network (GCN) allows the multi-layer architecture to propagate knowledge to distant nodes in the graph, and every subsequent layer of the GCN to perform extensive Laplacian smoothing, which may dilute the knowledge and thereby consequently decrease performance (Kampffmeyer et al. 2019). The use of our double graph structure in the MPNN network can effectively alleviate the problem of knowledge dilution from distant nodes, which will be validated in the experiment of section 4.
3.5 Aspect-based sentiment classification
To produce the aspect representation \({\gamma }\), the pooling function is applied to the hidden state of aspect tokens. We utilize the final representation of the pooled aspect embedding as the input of classification function. The probability distribution \({p}_{c}\) is then obtained by using the softmax function.
$${{p}}_{{c}}=softmax({{W}}_{p}{\gamma }+{{b}}_{p})$$
16
where \({{W}}_{p}\) and \({{b}}_{p}\) are both trainable parameters. The loss \(\mathcal{l}\) for ASC is defined by the cross entropy as follows.
$$\mathcal{l}=-{\varSigma }_{d\in D}{\varSigma }_{c\in P}log{ {p}}_{{c}}$$
17
Where D denotes the training dataset, and P denotes the set of all the polarities in ASC.