3.2. Semantic Relation Classification Model
Performance Comparison between Pre-trained Language Models (Table 2)
While applying an effective masking methodology for input sentences and downstream layers to the model is crucial, it is also important to select a pre-trained language model that best fits our tasks and data as the base model. Therefore, we compared the performance of various existing pre-trained BERT models for our dataset with the same hyperparameters specified in each original paper unless otherwise noted, setting the input masking method to two masked sentence input for all models.
<Table 2 > Performance comparison of pre-trained language models
Model
|
F1
|
BERT [19]
|
78.1
|
ALBERT [34]
|
80.4
|
RoBERTa [35]
|
81.3
|
SpanBERT [20]
|
80.5
|
XLNET [37]
|
80.2
|
SciBERT [33]
|
81.7
|
Comparisons have shown that SciBERT models pre-trained on scientific publication datasets using the BERT model performed better than ALBERT, RoBERT, SpanBERT, and the autoregressive transformer (XLNET), which improve the model layers and pre-training methodologies of BERT. This confirms that when building a downstream model using a pre-trained language model, the data used for pre-training should have the same domain as those used for fine-tuning. SciBERT, which performed the best on our PubMed datasets, will be used as the base model in later experiments.
Performance Comparison between Methods of Masking Input (Table 3)
For this experiment, we used SciBERT, the language model with the best performance in the abovementioned comparative experiment, as the base model and the CLS token layer as the downstream output layer. We trained the model with 10 epochs and used a batch size of 8 during training, a learning rate of 5e-5 with 1,100 warm-up steps, and a weight decay of 0.01 to prevent overfitting. We evaluated each method of masking input sentences to determine which one performed the best.
We specifically compared the performance of the following methods: masked input, two masked sentence input, entity token input, masking with additional tokens other than [MASK], and entity marker–entity start, marking a span of entities.
<Table 3 > Performance comparison of masking input methods
Method
|
F1
|
Entity Marker–Entity Start
|
79.9
|
Masked Input
|
79.6
|
Two Masked Sentence Input
|
81.4
|
Two-Sentence Entity Token Input
|
80.5
|
The comparison of input methods showed that the two-sentence entity token input and the two masked sentence input methods, which used two combined sentences as input data, performed better than the entity marker–entity start method or the original masked input method, which is the way BERT is pre-trained. Two masked sentence input, which replaces entities with [MASK] tokens, was used rather than using additional tokens, [E1] and [E2] tokens, to replace entities and also showed better performance.
These findings show that leveraging [MASK] tokens is better than introducing additional tokens such as [E1] and [E2] to replace entities. They also confirm that maintaining consistency between the pre-training and fine-tuning stages can lead to improved performance. Furthermore, [MASK] tokens, which have only been used in pre-training phases, can be appropriately utilized in downstream tasks. In addition, the two masked sentence input methodology performed better than the methodologies where only one sentence is entered as input; this finding suggests that in such a relation classification task, using two identical sentences, each of which contains a masked entity with [MASK], can lead to improved performance.
Performance Comparison of Downstream Layers (Table 4)
Additional downstream layer construction is essential to train specific NLP tasks using pre-trained language models. Relation extraction models require the addition of the classification output layer for relation prediction after the output of the transformer encoder. For performance comparisons between different layer structures, we equally apply the two masked sentence input methodology to the same SciBERT model with the same hyperparameters as the previous experiment. In this experiment, we compare the performance of the CLS token layer, using the output of the [CLS] token location, two-mask token layer, using the output values of the two [MASK] token locations, and three-token layer, using the output values of the [CLS] and [MASK] token locations.
<Table 4 > Performance comparison of downstream layers
Layer
|
F1
|
CLS token layer
|
81.4
|
Two-mask token layer
|
81.7
|
Three-token layer
|
81.5
|
The experiments showed no significant difference, but the two-mask token layer showed slightly better performance than the other two techniques. The trivial difference between downstream layers seems to be due to the structure of the transformer encoder, where multiple layers are stacked and interconnected in both directions. The structural characteristics of BERT are exemplified by layers corresponding to all tokens in the input sequence, and these layers are interconnected and pass through multiple transformer encoders. This distinctive structure enables a well-trained model for predicting the relationship between entities regardless of the number or location of the final output vectors in the downstream layer.
Final Comparison of Model Performance (Table 5)
The experiments comparing different base models, input methodologies, and output layer structures using our datasets showed that utilizing SciBERT as a base model with the two masked sentence input methodology and two-mask token layer applied performs best, resulting in an F1 score of 81.7. Finally, we compared this model with existing models presented in related works using our dataset. In addition to BERT-based models that have shown SOTA performance in relation extraction tasks, such as [36], we also included models based on other deep learning algorithms such as the CNN [38] and entity attention Bi-LSTM, which is a semantic relation classification model using bidirectional LSTM networks with entity-aware attention using latent entity typing [22].
<Table 5 > Overall performance comparison of the models
Model
|
F1
|
Word2vec + CNN [38]
|
70.8
|
Entity Attention Bi-LSTM [22]
|
78.7
|
Matching the Blanks [36]
|
79.9
|
Our Model (Two Masked Sentences)
|
81.7
|
The experimental results confirm that the final proposed SciBERT-based model with the two masked sentence input methodology and two-mask token layer performed best.
To determine how well our model predicts on each class and examine situations where our model has limitations, we further analyzed the per-class performance for each of the eight types of relations (Table 6).
<Table 6 > Per-class performance
Class
|
Precision
|
Recall
|
F1-score
|
Support
|
Directed Link
|
0.928
|
0.850
|
0.888
|
167
|
Negative Cause
|
0.845
|
0.893
|
0.868
|
140
|
Negative Decrease
|
0.817
|
0.918
|
0.865
|
73
|
Negative Increase
|
0.788
|
0.820
|
0.804
|
50
|
Positive Cause
|
0.882
|
0.922
|
0.901
|
218
|
Positive Decrease
|
0.720
|
0.667
|
0.692
|
27
|
Positive Increase
|
0.738
|
0.818
|
0.776
|
55
|
Undirected Link
|
0.900
|
0.839
|
0.868
|
279
|
Support: Number of instances in the test data represents 20% of the full data set, which is proportional to each class.
In general, per-class performance relied on the number of data instances under the class. Classes designated as negative increase, positive decrease, and positive increase, which had the fewest data instances of 50, 27 and 55, respectively, obtained the lowest f1 scores among the different relation types. Other than the issue, the model showed generally even scores over the classes.
We closely examined the data points where the value predicted by the model differed from the annotated target value to objectively assess the limitations of our model or corpus and obtain insights for future reinforcement. As a result of the observation, we were able to identify two interesting patterns in false cases.
First, our model revealed its weakness when the verb between entities did not directly convey the meaning of increase/decrease or a cause-and-effect relationship, such as “improve,” “exacerbate,” and “aggravate,” making it difficult to accurately infer the relationship through context words surrounding entities. In this case, to correctly determine the direction in the quantity of the right entity, knowing whether the entity instance itself held a positive or negative meaning was necessary, such as in the sentence below:
Moreover, hepatic knockdown of HFREP1 improved insulin resistance in both mice fed a high-fat diet and ob/ob mice.
The target relation type associating “HFREP1” and “insulin resistance” belongs to the negative decrease class, but the model incorrectly predicted it as the negative increase class. To accurately classify their relationship, in this case, the model needs to know whether insulin resistance itself has a positive or negative meaning. This type of error could be alleviated with a language model pre-trained on richer literature in the biomedical field, resulting in more comprehensive coverage of semantic meaning for bio-vocabulary.
Second, we found several cases of errors due to the conflict between the annotators’ contextual considerations of the entirety of the literature findings and the model predictions that exploit contextual words limited to each sentence in classifying the relationship between entities. An example of this is as follows:
Our data suggest that titanium particles may cause less leukocyte activation and inflammatory tissue responses than other particulate biomaterials used in total joint arthroplasty.
For this sentence, the annotator classified the relationship between titanium particles and inflammatory tissue response as the negative cause class, and the model predicted the positive cause class. The annotator compared the relationship between these two entities to other entities in the sentence and focused on the intention of the sentence. However, if we simply considered the directional association between the two entities of interest, we could assign the positive cause class, which was the model prediction.
To avoid this controversial gray area, the data that required abstract and complex consideration of context were excluded as much as possible from the corpus construction stage; as a result of this, few of these cases were found. However, we specifically paid attention to this example because it provided insights on how the model prediction works in these special circumstances and which direction to move forward in future research to overcome this limitation. In the example sentence, the model prediction cannot be regarded as wrong, but the main finding conveyed in the sentence must have been that titanium particles cause “less” inflammatory reactions, not the fact that they do. Therefore, this case demonstrates that meaningful relation types, which better reflect the intentions of the text and provide benefit to researchers, require elaborately reflecting not only the causality between entities and its direction but also the relative extent of the increased or decreased amount of a particular entity. This is possible by pushing beyond the limits of the current relation classification based on binary entities and addressing subtle and complicated interactions among multiple bio-entities appearing in a sentence.