Phrase-level attention network for few-shot inverse relation classification in knowledge graph

Relation classification aims to recognize semantic relation between two given entities mentioned in the given text. Existing models have performed well on the inverse relation classification with large-scale datasets, but their performance drops significantly for few-shot learning. In this paper, we propose a Phrase-level Attention Network, function words adaptively enhanced attention framework (FAEA+), to attend class-related function words by the designed hybrid attention for few-shot inverse relation classification in Knowledge Graph. Then, an instance-aware prototype network is present to adaptively capture relation information associated with query instances and eliminate intra-class redundancy due to function words introduced. We theoretically prove that the introduction of function words will increase intra-class differences, and the designed instance-aware prototype network is competent for reducing redundancy. Experimental results show that FAEA+ significantly improved over strong baselines on two few-shot relation classification datasets. Moreover, our model has a distinct advantage in solving inverse relations, which outperforms state-of-the-art results by 16.82% under a 1-shot setting in FewRel1.0.


Introduction
Relation classification(RC) is a basic task in Knowledge Graph research, aims to identify the correct relationship between two entities for a given instance (i.e., a sentence in natural language including a pair of entities, head and tail entities) and a set of relations.It is essential for many applied researches in natural language processing, such as knowledge base completion [1], dialogue systems [2], entity resolution [3], drug discovery [4] and so on.
Most RC models proposed in the literature rely on supervised learning, and their training datasets are generally based on manually labeled data, which is costly in terms of labor and time.To address this issue, Han et al. [5] first investigated the few-shot relation classification (FSRC), and proposed the dataset Few Rel1.0 to provide a fair and uniform comparison for FSRC models.The basic idea is to train existing relation types on the large-scale dataset and then to migrate the trained model to new relation types with sparse data [6].Since then, several other FSRC models have been reported in the literature, which shows remarkable performance on FewRel1.0 [7][8][9][10][11].
However, our experiments have found that the performance of existing FSRC models is significantly reduced when a relation is the inverse of another relation in the given set of relations, which we call few-shot inverse relation classification (FSIRC).As shown in Figure 1, the relation 'has part' is an inverse of the relation 'part of '.It can be seen from two support instances of these two relations.It is obvious that 'has part' and 'part of ' have the same content word 'part' but there are different function words 'its' and 'of '.Existing models [10,12] for FSRC tend to focus on characterizing the differences between content words but ignore the differences of function words.
We have also observed that not all support instances equally contribute to the confidence degree of relation.In particular, some support instances can be useless.There are two causes for this.First, the intra-class variance may increase further when function words are taken into account.In addition, the representations of support instances are not sufficiently informative for replacing prototypical representations of relations, as support instances may contain instance-specific information.For different query instances, we can detect instances in the support set closely related to specific instances and ignore noisy instances for classifying specific relations (Table 1).

Table 1
The change in accuracy (%) from random few-shot settings to inverse few-shot settings on Few Rel1.0 Method 2-way-1-shot 4-way-1-shot 5-way-1-shot 5-way-3-shot Taking the 5-way as an example, both settings are trained with five instances at random.During testing, the random few-shot setting is the same as above, whereas the inverse setting selects instances from two pairs of inverse relations and a single other relation To address these two issues, we propose FAEA+, a phrase-level attention network that captures both function words and feature words in the instance encoder and obtains prototypical representations by adaptive message passing mechanism associated with query instances.As shown in Figure 2, we not only consider the importance of keywords, but also capture function words in the instance encoder.Specifically, we adopt a hybrid attention mechanism to capture function words in the instance encoder.And then, an instance-aware prototype network is proposed to obtain the representation if prototypes related to the query instance.
In short, our contributions are listed as follows: • We propose FAEA+ that learns to capture the importance distribution of function words from two different perspectives, including relation-centric and entity-centric, to distinguish inverse relations.• We present an instance-aware prototype network to adaptively obtain prototypical representations associated with query instances so that the intra-class redundancy caused by function words is reduced.• We theoretically prove that the introduction of function words will increase intra-class differences through the derivation of inequality, and the designed message passing mechanism effectively reduces the redundancy.• Experimental results on two benchmarks show that FAEA+ significantly outperforms strong baselines.In particular, it achieves new state-of-the-art results for the FSIRC task.
This paper is a significant extension of our IJCAI-2022 conference paper [13].We have expanded on the original version in the following four aspects: 1.Our FAEA+ further improves the model FAEA proposed in [13] by introducing an entitycentric instance encoder.In addition, we also propose an enhanced prototype computing mechanism, which can adaptively learn prototypical representations associated with query instances.2. Full proofs of theoretical results are included.3. Experiments are extended by comparing some more recent baseline models for FSRC.4.More technical details are added.In particular, related work is extended.

Given a support set
. ., y N , we define the FSRC as a task to predict the relation y i between the entity pair (e h , e t ) mentioned in the query instance x q (i.e., a sentence containing head entity e h and tail entity e t ), where x i k , h i k , t i k , r i , y i denotes the relation between the given head entity h i k and tail entity t i k is y i with respect to the instance x i k , and r i is the corresponding relation description provided as auxiliary support evidence for relation classification.N stands for the number of relations, and there are quite small K labeled instances for each relation.
For the few-shot inverse relation classification task, the relation set R includes some pairs of inverse relations.For example, 'participant' and 'participant of ' are inverse relations, and their descriptions are "person or organization (object) that actively takes part in the event (subject)" and "event a person or an organization was a participant in", respectively.

Entity-centric Instance Encoder
Entity-centric instance encoder aims to enhance the attention of function words associated with two given entities by constituent attention and semantic-related attention.We apply BERT [14] as the encoder to get contextualized embeddings X i k = {w 1 , w 2 , . . ., w l } corresponding to an instance x i k = {w 1 , w 2 , . . ., w l } mentioning two entities with l words, where each word embedding w i ∈ R d and d is the dimension of embeddings.
Constituent Attention Considering the importance of function words varying by different entities, we learn a constituent prior matrix C ∈ R l×l to pay more attention to function words that are close to entities.
For the instance x i k , the probability that w i and w j belong to the same phrase is calculated as follows: where and [•] n is the n-th row of a matrix.The correlation between w i and its right neighbor w i+1 is shown by the score s i,i+1 .Then, by applying a softmax function on the correlation score of w i , (3), we limit w i only linking to either its right neighbor or left neighbor.As p i,i+1 and p i+1,i possibly have different values, we average its two attention links.

Semantic-related Attention
We use self-attention mechanism to obtain semantic-related matrix S to attend necessary function words far from entities, it is calculated as follows: Next, we find the index of entities and strengthen related function words according to constituent prior matrix C and semantic-related matrix S.
where I h start and I t start refer to the start index of the head and tail entity, respectively.Then, we combine β h constituent and β h related to form hybrid function-words attention vector related to the head entity, formalized as: Similarly, we can obtain the hybrid function-words attention vector β t , related to the tail entity.
where λ is hyper-parameter.Finally, we obtain the entity-centric instance representation x i k by concatenating two entitycentric vectors, formalized as: The entity-centric instance encoder is added, compared to FAEA [13], which aims to enhance the attention of function words related to the entity.The FAEA is only enhanced with function words for keywords related to relations, which has greatly improved the performance of the model on inverse relations, but still fails to correctly identify inverse relations when the position of head and tail entities changes.Therefore, we additionally designed the above module to obtain a phrase-level representation of the entity and improve the perception of position by enhancing the attention of func-tion words associated with two given entities.

Relation-centric instance encoder
To learn the importance distribution of general function words and class-special function words, we combine class-general attention with class-specific attention, consisting of constituent and semantic-related attention.
For i-th relation r i , we encode the name and corresponding description to acquire word embeddings R i ∈ R l×d , and employ hidden states of [CLS] token to obtain features of relations r i ∈ R 2d .

Class-general Attention
The major step entails acquiring a generic function-words attention β s ∈ R l and a keyword attention α i k ∈ R l which is computed as follows: 123 World Wide Web (2023) 26:3001-3026 where the memory unit u w ∈ R d is a trainable parameter.We can select general keywords from instances with the help of the memory unit.sum(•) is an operation that adds the elements of each row in a matrix.
Then, we get general function-words importance β general ∈ R l by downweighing the importance of words related to the entity u w and weighing the importance of words irrelated to u w .It is computed as follows: where E ∈ R l is an all-one vector.
Class-specific Attention Class-specific attention consists constituent and semantic-related attention, which have the same idea as Section 3.1.Unlike that, we need to obtain the keywords index I i k in x i k , which is calculated as: where max(•) n stands for obtaining indexes of the top-n largest attention keywords, where n is the number of keywords.
We utilize the keywords index to enhance related function words following C and S, obtained by ( 1) and ( 4).
Next, the model uses β general β r constituent and β r related to construct a hybrid function-words attention vector β r , formalized as: Then, we obtain the relation-centric instance representation xi k by aggregating keyword attention α i k and hybrid function-words attention β i k , formalized as:

Finally, Our approach combines entity-centric representation x i k and relation-centric representation xi
k to obtain the integral representation of instance, denoted as x i k .It can be formalized as follows: Overall, we construct class-general and class-specific attention to learn function-words variance in the few-shot setting, motivated by MAML [15] learning general model parameters and fine-tuning them to fit the new task.

Instance-aware prototype learning
We propose an instance-aware prototype learning network that assigns specific weights to each instance according to the similarity between the query instance and support instances, based on updating instance representations by adaptive message passing.

Adaptive message passing
To reduce the intra-class redundancy brought on by function words and adaptively control the percentage of transferred inter-class messages, we propose an adaptive message passing module that includes graph construction and node update, as shown in Figure 3.
is a set of instance features with |V| = N × K and E is the adjacency matrix, where [; ] denotes the row-wise concatenation, v i ∈ R d denotes the i-th row of the matrix V • Node Update.According to the work of [16], we design a novel node updating way that captures and transfers inter-class differences and intra-class commonalities between instance nodes. where N 0 i and N 1 i denote the different and same class neighbor set as v i , respectively.d i is the degree of instance node i.It is worth to mention that we set a threshold ēi to adaptively adjust the transferred proportion of inter-class difference information.Specifically, two nodes from different classes will stop message passing if their difference is greater than the threshold ēi .

Prototype enhancing mechanism
As not all support instances equally contribute to the confidence degree of a relation classification for query instance, we design an instance-aware prototype enhancing mechanism to learn special weights for support instances even if they belong to the same class.
For the relation r i , its instance-aware prototype p i is computed as follows: where K is the number of instances within the relation r i , ω i j is the importance score of each instance v i j regarding the query instance x q in the support set.
F(•) is the linear function.
We estimate the probability that the query instance x q belongs to the relation r i as follows: The final objective function is formally written as: where y is class label and z y is predicted probability for the class y.
In conclusion, we design a new prototype learning network that adaptively assigns special weights to each support instance based on the designed node update method.Different from FAEA [13], this work adds an instance-aware prototype learning mechanism, which obtains different weights based on the similarity of the support instances to the query instance and use them to get a prototype representation instead of averaging operations.

Theoretical analysis
In this section, we theoretically prove that intra-class differences increase when function words are introduced (Theorem 1).It achieves that different class nodes become discriminative and same class nodes similar through the designed message passing mechanism (Theorem 2).
Given any two instances x i and x j , its corresponding keywords representations be Let its corresponding function words representations be x World Wide Web (2023) 26:3001-3026 and indicates the inner product between vectors.
Proof Firstly, according to the definition of the similarity between instances, we can get the similarity between instances only considering keywords as follows: Considering function words, the similarity between instances is as follows: We assume that norm Then, we can derive that Due to we have Combining ( 24), ( 25) and ( 26), we obtain that Therefore, the theorem is proved.
This theorem shows that when function words are taken into account, the similarity degree norm(x i x j ) between instances x i and x j becomes smaller and thus the intra-class difference is increased.
Theorem 2 Given any two instances x i and x j , their similarity is defined as: The message passing between same class instances and different classes are respectively defined as follows: • If x i and x i belong to the same class, it holds that: • If x i and x i belong to different classes, it holds that: Proof Let D 1 , D 2 be the similarity of instance features before and after the transformation.
If two instances x i and x j belong to the same class, we obtain that The above inequality indicates that compared with the original similarity D 1 , the similarity D 2 is bigger, implying that the designed aggregation method make the representations of nodes belong to same class become similar.
Next, if two instances x i and x j belong to the different classes, then we get that Therefore, the theorem is proved.
The above theorem proves that, compared with the original similarity D x i , x j , if D x i , x j becomes smaller, then the message passing transfers different information, otherwise it transfers similar information.

Datasets
We select two approved large-scale FSRC datasets, FewRel1.0 and FewRel2.0, to highlight the effectiveness and strong adaptability of our model.The details are as follows: • FewRel1.0 [5] contains 100 relations and each with 700 labeled instances.Each instance has an average of 24.99 tokens, and a total of 124,577 unique tokens.Furthermore, each instance is marked with designated head and tail entities.In addition, the name and description of the relationship are provided as supplementary evidence, as shown in Table 2. • FewRel2.0 [9] is a cross-domain dataset constructed by aligning PubMed and UMLS, containing 25 relations, and each with 100 instances.In particular, its training set comes from the Wikipedia domain, and the test set is in the field of biomedical science.
The Table 3 shows a 2-way 2-shot scenario in FewRel1.0 and a 2-way 1-shot scenario in FewRel2.0.In the experiment, The dataset is divided into 64 base classes for training, 16 classes for validation (10 classes in FewRel2.0), and 20 novel classes for testing, followed by the splits used in official benchmarks.

Setting
We evaluate the performance of our model by using the average accuracy of the N-way-Kshot tasks, followed by the setting in FewRel [5], where N and K represent the number of classes and instances from each of them, respectively.We choose N to be 5 and 10, K as 1 and 5 to create 4 test scenarios according to [7,9].In order to achieve a fair comparison, we consider base-uncased BERT as an encoder of 768 dimensions for a fair comparison.The maximum input length is set to 128.In addition, the AdamW [17] optimizer is applied with the learning rate as 2×10 −5 and weight decay as 1×10 −2 .In addition, the hyper-parameter λ is set to 0.6 and u w is randomly initialized after Sukhbaatar et al. [18].It is worth mentioning that we concatenate the name and its description of each relation as inputs for Few Rel1.0 and only input the name of the relation for Few Rel2.0.
To investigate the influence of hyperparameters on our model, we randomly set the learning rate and max iterations to focus on the accuracy change.We observe that the best performance [London] head is the capital of ( is achieved when the learning rate is 2 × 10 −5 and the max iterations is 30000 on Few Rel1.0 and Few Rel2.0 datasets, which is consistent with Peng et al. [19].

Baselines
We compare our model with latest FSRC approaches.The detail is as follows: • To explain the functionality of local-level features, compared with instance-level models: metric-based models Proto [8], Proto-HATT [7], MLMAN [20], BERT-PAIR [9], TPN [23] and HCRP [25], methods to learn by prototypical network or measuring the similarity of sentence pair.And gradient-based models MAML [15] and GNN [21].• To prove the importance of function words, compared with word-level model: TD-Proto [10], learning the importance distribution of generic content words by a memory network.ConceptFERE [11], measuring class-specific content words importance by introducing a new attention mechanism.• Models introducing external information: REGRAB [22], a bayesian meta-learning approach that utilizes the the global relation of knowledge graph.CTEG [24], decoupling high co-occurrence relations with two external information.PRM [26], a robust prototype network utilizing relation information (i.e., relation name and relation description).• Pretrained RC methods: MTB [27], proposing a pre-training task named matching the blank on top of an existing BERT model.CP [19], a contrastive pre-training framework with masked entities for relation extraction.
Since Soares et al. [27] employ B E RT large network as encoder backbone, we do not adopt their results.For a fair comparison, We take the results from the work of Peng et al. [19] and some recent works, with BERT base as encoder.

Performance on few-shot relation classification
As you can see from Table 4, our model is significantly superior to the previous FAEA and other strong baselines, especially under 5-shot settings.In particular, our model improves the accuracy of 5-way-5-shot and 10-way-5-shot tasks by 1.13 points and 3.13 points, respectively, showing superior ability.
• Proto and GNN, which are widely used as the baselines for few-shot learning, do not perform well on FSRC.While low-level patterns could be shared across different tasks in computer vision, words that are useful for one task might not be relevant to another.In contrast, FAEA+ leverages the entity-centric and relation-centric instance encoder to learn the importance distribution of words as the relationship changes.• TD-Proto and ConceptFERE also use semantic-level attention to detect important content words, but ignore function words maintaining grammaticality and syntactic structure differences.FAEA+ enhances the importance of class-related function words importance by hybrid attention.It benefits structural differences that have not been accounted for in previous work, which is especially useful for the FSIRC task.• Proto-HATT and TPN capture intra-class commonalities when calculating relationship prototypes without considering inter-class differences.FAEA+ proposes an instanceaware prototype computing mechanism to capture and leverage differences between relations and query instances to get more discriminative representations of prototypes.• We also evaluate our approach based on CP, in which BERT encoder is initialized with its pretrained parameters.The proposed FAEA+ achieves a consistent performance gain, which proves the significance of our method.
In addition, we also conduct experiments on FewRel2.0, a cross-domain FSRC dataset, as shown in Table 5.It shows that FAEA+ also achieves the best performance.Similar to other models, results are significantly lower than those on FewRel1.0 since the training and Different from FAEA, we not only take the importance of content words and function words into account, but also the transition of the roles of head and tail entities.Specifically, the FAEA+ leverages entity-centric instance encoder to explore differences caused by headto-tail entity role changes.

Performance on few-shot inverse relation classification
In order to further demonstrate the effectiveness of the model for FSIRC, we evaluate models on the FewRel1.0 validation set with various scenarios, as shown in Table 6.We train models in a general training setting but evaluate them with two kinds of settings: 'Random' and 'Inverse'.Random is a generic evaluation setting that randomly selects 10,000 FSRC tasks from the validation set.Inverse is for every task being assessed including the inverse relationship, such as 'has part', 'part of ', 'participant', 'participant of' under the 4-way setting.The baselines achieve great performance under random setting but drop rapidly under inverse setting, especially around 26.98 points in 1-shot scenarios, which shows that the FSIRC is very challenging.FAEA+ gets the best result and drops less under the inverse setting, which proves that it can capture function words and process FSIRC tasks efficiently.
Compared with FAEA, the performance of FAEA+ is further improved, especially in multishot scenarios, since our proposed prototype computation mechanism is able to adaptively adjust the representation of prototypes according to the query instance.

Analysis
To take a deep look into the improvements contributed by each part of our model, we conduct fine-grained analysis from three aspects: entity-centric instance encoder, relation-centric instance encoder, and instance-aware prototype learning.

Analysis of relation-centric instance encoder
This section discusses the effect of the relation-centric instance encoder.As shown in Table 7, the performance decreases significantly after removing the relation-centric encoder completely (Model 5).Especially, the accuracy decreased by 2.91 points and 4.95 points under 5-way-1-shot and 10-way-5-shot, respectively.It indicates that relation-related function words play an important role in FSRC.In addition, as shown in Figure 4, with the introduction of function words attention, the relation-centric instance encoder highlight the importance of 'are' and 'of ' to compose the phrase "are part of ", which emerges in the query instance and the support instance of class 'part of ', then the support instance obtains a higher score.Therefore, our model makes a correct classification of the query instance.
To further confirm the validity of the three components of function-words attention, we remove three components separately.We can see that results are declined from model 6,8,9 of Table 7.In addition, we also visualize the attention scores of words in the query instance.As shown in the Figure 4, the TD-Proto mainly focuses on content words, such as 'part', 'Towada' and 'Lake'.FAEA+ without general attention module increases the importance of content words unrelated to keywords, such as 'Hakkoda' and 'Lake'.FAEA+, without constituent attention, improves some keywords-irrelated function words, such as 'along'  and 'and'.FAEA+ without related attention ignores some function words that are far from thekeywords but more important to relations, such as 'are'.FAEA+ further captures correct function words to create the phrase 'are three parts of ', demonstrating that all components contribute to enhancing the importance of function words.Figure 4 A 2-way-1-shot FSIRC example.In the top section, FAEA+ and TD-Proto are used to visualize the attention score of every word.The middle section is the similarity of the query instance and the support instance.The lower section is the attention score of the words in the query

Analysis of entity-centric instance encoder
In addition to having a clear view of the role that the entity-centric instance encoder plays in FAEA+, we remove the entity-centric instance encoder (Model 2) and its two submodules (Model 3, 4) in Table 7.The performance is severely decreased, indicating the role of entities(i.e., head entity or tail entity) is also important for relational representation.As shown in Figure 5, with the introduction of function words attention, FAEA+ highlights the importance of 'is' and 'of ' to form the phrase "is part of ", which is also noticed by FAEA.But unfortunately, it appears to support instances of both classes 'part of ' and 'has part'.It causes FAEA still be hard to classify the query correctly, especially in the case that the query instance and the support instance from 'has part' has some other same words like 'Island'.Instead, FAEA+ could distinguish the difference between entity pairs (Entit y head is, the Entit y tail .)and (larger Entit y head ., Entit y tail is), belonging to relations 'part of ' and 'has part', respectively.Thus, the query instance can correctly be classified as relation 'part of '.It demonstrates that FAEA+ captures differences between similar relations with nearly identical words by utilizing the difference in head-to-tail entity phrase structure.

Analysis of instance-aware prototype learning
To have a clear view of the role of instance-aware prototype learning, we completely remove the instance-aware prototype learning (Model 10) as shown in Table 7, then see that the performance decreases, especially for the 10-way-5-shot setting.It shows that the representation of archetypes is crucial to classify query instances for FSRC.Then, we analyze it fine-grained in two parts: adaptive message passing and prototype enhancing mechanism.

Adaptive message passing
As shown in Table 7, comparing models without message passing (Model 11) and message passing without adaptive threshold (Model 12), We can see that message passing can achieve the best performance.Considering the threshold to control message transmissions further enhances the accuracy.
In order to further prove the effectiveness of message passing, we select 15 classes from the FewRel1.0 and visualize their similarities of intra-class and inter-class, as shown in Figure 6.When considering only content words, there is a high score for the inter-class similarity among  In addition, we use the t-SNE tool to visualize the instance representation distribution before message passing (Figure 7a), after message passing without threshold (Figure 7b) and after adaptive message passing (Figure 7c).For the message passing without threshold, we can see the similarity between relations 'part of' and 'notable work' significantly decreases from 0.79 to 0.38, much lower than 0.80, which is the similarity between relations 'has part' and 'part of'.But After the message passed with threshold, the similarity slowly declined to 0.65, which indicates the effectiveness of the threshold setting and proves that the adaptive message passing mechanism effectively enhances the inter-class differences while maintaining the original relation semantic.

Prototype enhancing mechanism
To evaluate the capacity of FAEA+ in computing prototypes, we remove the prototype enhancing mechanism (Model 13), as shown in Table 7.It shows that the performance is almost unchanged under the 5-way-1-shot task, while the accuracy decrease by 2.29 points under the 10-way-5-shot task.Since there is only one instance in the support set for the 1-shot setting, the representation of prototypes cannot be adjusted according to the query instance.As a result, there is no effect on whether a prototype enhancement mechanism is added.
Besides, to further evaluate the contribution of the prototype enhancing mechanism, we also use the t-SNE tool to visualize the instance representation distribution and further compute the distance between prototypes or query instance and prototypes after adaptive message passing (Figure 7c) and after prototype enhancing (Figure 7d).It is observed that the prototype of each relation is closer to the query instance due to the guidance of the query instance.In addition, before the introduction of the prototype enhancing mechanism, the prototype generated by the mean of a few instances may introduce some noise and lead to poor robustness.However, FAEA+ learns the representation of the prototype based on the similarity between query instances and support instances adaptively.It illustrates that the instanceaware prototype network captures more fine-grained details and performs better than the general prototype network.

Related work
In this section, we compare our approach with related works, including few-shot relation classification and graph neural networks.

Few-shot relation classification
Few-shot relation classification attempts to identify novel relations by learning a few labeled instances.Three categories can be used to classify existing methods: instance-level representation, word-level representation, and other methods.

Instance-level representation
This type of method uses different algorithms to aggregate instances, mainly based on the prototypical network [8] embedding the instances into a smaller space and aiming at making the instances of the same class gather and instances of different classs separate.Considering the problem that the data itself has noise in the fewshot relation classification task, Gao et al. [7] utilized a hierarchical attention mechanism to enhance the robustness of the model.Xie et al. [28] reduced the model sensitivity to noise samples through heterogeneous graph networks and adversarial training.Wen et al. [23] and Li et al. [29] enhanced the representation of the prototype by increasing the contribution of instances similar to the query instances.Some works focused on the problem of losing differences between instances of different relations.Koch et al. [30] used a siamese network to encode instances of support and query set.Han et al. [25] shortened inter-class distances and increased intra-class distances by contrastive learning.BERT-PAIR [9] paired up all supporting instances with each query instance, and predicted whether the pairs were of the same category.

Word-level representation
This kind of method aims to capture local keywords to obtain more fine-grained information.Considering that different words in the same instance have different contributions to relation representation, a word-level attention mechanism is designed to select more informative content words as context to learn relation representations.An attentive generator was created by Bao et al. [12] and Yang et al. [11] to gauge the significance of each word for a certain relation class.Sun et al. [10] introduced a memory network to memorize the importance distribution of general content words.
Other methods In addition to the above strategies, other ideas have emerged in recent years.Combine prototypical networks with pretrained relation classification models, which achieve impressive results.The pretrained model MTB proposed by Soares et al. [27] is pretrained on the relation classification task using a large amount of unsupervised data.Considering the model's over-reliance on entity information, Peng et al. [19] proposed a contrastive pretraining framework for masked entities.Li et al. [31] used causal intervention to weaken the influence of confounder caused by pretrained knowledge.Since the information is limited in FSRC, some methods introduce external information.Yang et al. [32] introduced the inherent concepts of entities to offer cues for relation classification.Zhang et al. [33] incorporated general and domain-specific knowledge graphs to the RC model, and Tang et al. [34] introduced the information of knowledge base to combine with the sample information to improve model's domain adaptability.There are also some gradient-based models.The MLMAN model proposed by Ye et al. [20] interactively encodes each support set instance and the query instance by considering the matching information at the instance-level.Ravi et al. [35] and Munkhdalai et al. [13] used an optimized meta-learning model as a framework to learn generic classification difference information in the meta-training phase, and the meta-test fine-tuning phase allows fast convergence on new tasks.Dong et al. [36] proposed a meta-information guided meta-learning framework by offering informative information for meta-learning in both initialization and adaptation with semantic concepts of classes.
In summary, the few-shot relation classification methods based on instance-level representation learn the differences between instances from a global perspective while ignoring the semantic and syntactic differences within instances.The few-shot relation classification methods based on word-level representation consider the difference of content words between classes from a local perspective but ignore function words.However, in some fine-grained few-shot inverse relation classification tasks, the syntactic structure differences of function words play a decisive role.Therefore, it is necessary and meaningful to capture function words.In this study, we mainly address inverse relations and propose a hybrid function words attention method, better modeling subtle variations among inverse relations.

Graph neural network
Graph neural network iteratively aggregates neighbor features to learn complex interactions between instances [37].Satorras et al. [21] and Yang et al. [38] constructed the query instance and all instances in the support set into a graph, and used the graph neural network to update the instance features.Some methods introduce a syntactic tree in the form of graph neural networks to capture instance structure differences [39][40][41].Specifically, Guo et al. [39] proposed an attention guided graph convolutional network to convert a dependency tree into a weighted graph to distinguish the dependencies of nodes and edges for relation classification.On this basis, Sun et al. [40] proposed a learnable attention graph convolutional network that updates the node representation of the graph according to the syntactic structure.Tian et al. [41] applied an attention mechanism based on graph convolutional network to dependency tree to distinguish the importance of different word dependencies, thereby facilitating relation extraction.In addition, there are some methods to modify the graph neural network structure.In order to reduce the over-smoothing propensities of deep GNNS, Wang et al. [42] used intra-layer neighborhood attention and inter-layer memory attention.Li et al. [43] appended two types of attention modules based on GCN to further improve feature representation by modeling the semantic connections in both the spatial and relational dimensions, respectively.Zhao et al. [44] proposed MP-GCN model with graph pooling based on self-attention to acess and choose significant nodes and provide multiple representations without adding trainable parameters by the multi-head method.Yang et al. [45] proposed a robust method to learn domain-invariant graph representations by utilizing GCN based on adversarial learning.Fang et al. [46] modeled multi-view image data with heterogeneous graphs to enrich intra-class and inter-class relationship information.
In short, existing methods pass messages by aggregating the information of neighbors [16], which has made great progress for neighbor nodes belonging to the same class.However, it is difficult to handle relation classification with nodes belonging to different classes.This work proposes an adaptive message passing mechanism that can not only learn intra-class common information, but also inter-class difference information.

Conclusion
In this paper, we introduce FAEA+, a framework for handling few-shot inverse relations in knowledge graph by enhancing the importance of related function words and employing instance-aware prototype computing mechanism.Experiments demonstrate that FAEA+ achieves new state-of-the-art results on two FewRel datasets.Our future research will focus on creating a generalized function-words enhanced backbone network for a variety of NLP tasks, such as few-shot sentiment classification and dialog intent classification tasks.

Figure 1
Figure 1 The left figure shows the word attention visualization of TD-PROTO, where a darker unit indicates a higher value.It is obvious that the region of content words is deeper than function words.The right is a 2-way-1-shot example of few-shot inverse relation classification.It involves two relations, and each relation has one instance.[•] head and [•] tail indicate head and tail entities, respectively

Figure 2
Figure 2The overall framework of FAEA+.The input of Instance Encoder is an instance with corresponding relation description.e h and e t are the head and tail entities, respectively.u w is the memory unit and x i k represents i-th instance representation of k-th relation.In addition, there is a significant extension of our conference paper, mainly reflected in modules marked by red light bulbs

Figure 3
Figure 3Adaptive Message Passing.S i represents an instance including a sentence in natural language, head entity and tail entity.In addition, nodes with the same color belong to the same relation and nodes update by transferring inter-class differences and intra-class commonalities between instances.The panel means the degree of commonalities or differences

FAEA+
Support Set (TD-Proto) has part: part of this subject [Swedish] head canal enables ships to sail between Lake Vänern and its central part the [Dalsland] tail and southwestern Värmland lake districts.has part: part of this subject [Swedish] head canal enables ships to sail between Lake Vänern and its central part the [Dalsland] tail and southwestern Värmland lake districts.part of: object of which the subject is a part Historically, [Bijar] head has been part of the [Garrus] tail administration unit.part of: object of which the subject is a part Historically, [Bijar] head has been part of the [Garrus] tail administration unit.Query Instance: The Hakkodda Mountains, along with the [Oirase-Valley] head and Lake Towada are three parts of the [Towada-Hachimantai National Park] tail .Query Instance: The Hakkodda Mountains, along with the [Oirase-Valley] head and Lake Towada are three parts of the [Towada-Hachimantai National Park] tail .

Figure 5
Figure 5 A 2-way-1-shot FSIRC example visualizing the attention score of each word by FAEA+ and previous FAEA.Darker units indicate higher values

Figure 6
Figure 6 The similarity of intra-class and inter-class of some classes, computed by dot product

Figure 7
Figure 7 T-SNE plots of instance embeddings obtained before, after message passing without threshold, after adaptive message passing and after prototype enhancing.'•' indicates an instance in the dataset and ' ' indicates the prototype of the corresponding relation.The panel is only used to show the position of instance in the T-SNE

Table 2
The description of some relations in Few Rel1.0

Table 5
test sets belong to different domains.But FAEA+ declines more slowly than other models, indicating that it has certain stability and migration ability.

Table 6
Accuracy (%) of different few-shot settings on FewRel1.0.'Random' stands for the standard few-shot setting and 'Inverse' stands for evaluating, including inverse

Table 7
Ablation study on FewRel 1.0 validation set showing accuracy (%), where w/o (without) represents the ablation model without corresponding modules