Context-aware Feature Attention Model for Coreference Resolution on Biomedical Texts

Background: Bio-entity Coreference resolution is an important task to gain a complete understanding of biomedical texts automatically. Previous neural network-based studies on this topic are domain system based methods which rely on some domain-speciﬁc information integration. However, for the identical mentions, this may lead to misleading information, as the model tends to get similar or even identical representations, which further leads to wrongful predictions. Results: we propose a new context-aware Feature Attention model to distinguish identical mentions eﬀectively to better resolve coreference. The new model can represent identical mentions based on diﬀerent contexts by adaptively exploiting features eﬀectively. The proposed model substantially outperforms the state-of-the-art baselines on the BioNLP dataset with a 64.0% F1 score and further demonstrates superior performance on the diﬀerential representation and coreferential link of identical mentions. Conclusion: The context-aware Feature Attention model adaptively exploit features and represent identical mentions according to diﬀerent contexts, which signiﬁcantly makes the system obtain semantic information eﬀectively and make more accurate predictions. Considering that this approach is still limited when context information is insuﬃcient, we expect to utilize such information more ﬁne-grained with the help of the external knowledge base in coreference resolution.

Typical works of Bio-entity coreference resolution 9 use rule-based [4, 5, 6] and hybrid methods [7,8], 10 which rely on syntactic features and are limited to a 11 specific corpus. Recently, neural network-based meth-12 ods for automatically identifying coreferences have re-13 ceived widespread attention. Some neural network-14 based systems have been developed that integrate 15 domain-specific information through pre-trained em-16 beddings and static bio-related features. [6, 9,10]. 17 However, for similar or identical mentions (formal 18 definition is given in the Task Section), the above-19 mentioned methods often get similar or even the same 20 representation, which can be misleading and makes 21 the coreferences more puzzling. As shown in Figure 22 1, there are three identical mentions: "it". While the 23 "it" marked in red and orange refers to two different 24 concepts. As they are likely to get a similar or even 25 identical representation, a false coreferential link be-26 tween them is often predicted. 27 Generally, the differential representation of identical 28 mentions based on context is necessary as the above 29 problem accounts for a large proportion of the corefer-30 ence dataset. Here for protein coreference, we made 31 statistics on the identical mentions of the BioNLP 32 dataset. Table 1 shows the detailed statistics of identi-33 cal mentions throughout the BioNLP dataset. We can 34 find that the identical mentions account for more than 35 half of the entire dataset, whatever the train or the de-36 velopment. While among these identical mentions, less 37 than 10% have a coreferential link. We further record 38 the frequency ranges and the Pos (part-of-speech) tags 39 of these identical mentions. The results illustrate that 40 mentions with a frequency greater than 100 occupy 41 Generally, the above work is summarized in Table 3. 3 Our work is most closely related to the work of [9], 4 while we focus on the problem that identical mentions 5 tend to get similar or even identical representations, 6 which further mislead to make coreferential mistakes. In an end-to-end coreference resolution system, the in-10 put is a document D with T words, and the output is 11 a set of mention clusters. Let N be the number of pos-12 sible text spans in D. We consider all possible spans 13 up to a predefined maximum width. ST ART (i) and 14 EN D(i) are the start and end indices of a span i in D 15 respectively. For each span i the system needs to assign 16 an antecedent a i ∈{ ,1,...,i-1} from all preceding spans 17 or a dummy antecedent . The dummy antecedent rep-18 resents two cases: (1) the span i is not a mention, or (2) 19 the span i is a mention but not coreferential with any 20 previous span. Finally, all spans that are connected by 21 a set of antecedent predictions are grouped.  In this section, we briefly describe the baseline model: 31 [13] which we will later augment with Feature Atten-32 tion mechanism.  Then, the model uses the attention mechanism [17] 40 over words in each span to learn a task-specific notion 41 of headedness, and the final representation g i of span 42 i is produced by: where x * is the output of Bi-LSTM andx i is the head 44 embedding encoded by the head attention mechanism. 45 ϕ(i) is the feature vector encoding the width of the 46 span.

Scoring 1
The scoring functions: mention score s m and an-2 tecedent score s a take the span representations as in- where w m and w a are the weight matrix, • denotes 5 element-wise multiplication, FFNN is the feed-forward 6 neural network, and ϕ(i, j) is the pair-wise features 7 encoding the distance between the two spans. Domain-specific Information Integration 10 Similar as [9], to integrate domain-specific information, 11 We make the following variations: Where ⊕ is the concatenation operation and f (x u ) is To use features adaptively, we apply the Feature At-5 tention mechanism to the span features: span width, 6 grammatical number, and Metamap entity tags.

7
As shown in Figure 3, a new context-aware fea-8 ture vector ϕ * is generated by the Feature Attention 9 method and the new span features are applied to up-10 date the span representation, where x * i is the contexts 11 vectors generated by Bi-LSTM for span i and FA is 12 the Feature Attention mechanism: Coreference Score

14
The final coreference score of span i and j shows that 15 (1) whether span i is a mention, (2) whether span j is 16 a mention and (3) whether j is an antecedent of i : where s m (i) is the mention score, s a (i, j) is the an-19 tecedent score, s c (i, j) is a rough sketch of likely an-20 tecedents and w c is a learned weight matrix. The experiments are performed on the BioNLP Pro-24 tein coreference dataset [14]. For evaluation, we em-25 ployed the scorer [2] provided by the organizers to make 26 fair comparisons with previous work. We use the base-27 lines below:

28
[7]: The system proposes a hybrid approach that 29 combines both rule-based and learning-based method. 30 [4]: The system develops a rule-based system for 31 anaphoric coreference resolution in the biomedical do-32 main with simple modules derived from available sys-33 tems.

34
[5]: The system designs a general modular frame-35 work, which is based on the smorgasbord architecture 36 and contains multiple types of coreference types, and 37 allows fine-grained specification of resolution strategies 38 to resolve coreference.
[6]: The system presents (1) a rule-based method, 1 which creates a set of syntactic rules or semantic con-2 straints.
(2) a neural network-based method using the 3 LSTM network. We follow the same hyperparameters as in the [13]. 9 We use a window size of 10 for the LSTM inputs. The Attention mechanism to encode span feature ϕ(i).
18 Table 4 shows the performance comparison of our   Figure 4 shows the detailed error statistics on the 12 test dataset compared with some baselines ([6]-rule 13 and [6]-neural). First, compared with [6], BioNeu per-14 forms much better in the reduction of spurious links, 15 which indicates that the domain-specific information 16 pre-trained on large-scale language models helps the 17 model to predict more precisely. Second, compared 18 with those neural network-based baselines [6]-neural 19 and BioNeu, the proposed SFA model illustrates that 20 the introduction of the Feature Attention mechanism 21 further improves the model, greatly increasing the 22 number of correct predictions and reducing the two 23 errors. This shows that the distinction of the identi-24 cal or similar mentions based on context can help the 25 system learn more precisely and make more accurate 26 predictions.

27
Identical Mention Linking Evaluation 28 Figure 5 and Figure 6 respectively display the per-29 formance of the BioNeu model and the SFA model 30 on coreferential and incoreferential identical mentions 31 with different frequencies (the number of times that 32 the identical mention appears in the document). As 33 there are no coreferential identical mentions with a 34 frequency greater than 3, we only show the frequency 35 of 2 and 3. First, The performance of the SFA model 36 on coreferential identical mentions is better than the 37 BioNeu model. This indicates that the distinction of 38 identical mentions helps in the prediction of the links 39 between them to a certain extent. Second, for incoref-40 erential identical mentions, the SFA model performs 41 better than the BioNeu model on all frequencies, and 42 the superiority is significantly greater when the fre-43 quency of the identical mention is greater than 3. The 44 superiority illustrates that, for the problem that the 45 higher the frequency, the more difficult it is to predict, 46 the Feature Attention mechanism does help to distin-47 guish the identical mentions based on context, which 48 provides further help in mention linking.

Mention Detection Subtask 1
To further understand the utility of the Feature Atten-2 tion mechanism for mention detection subtask, we list 3 the mention detection performance in Table 5. Over-4 all, compared with [9], the performance of the pro-  we can find that the first two "it" that are coreferen-

Figure 4
Detailed error analysis compared with some baselines Figure 5 The performance of the two models on coreferential identical mentions with different frequencies within the document. Figure 6 The performance of the two models on incoreferential identical mentions with different frequencies within the document.