SSACBKM: an integration model for Biomedical relationship extraction

Background: The cause of the disease is one of the main contents of biomedical research. Extracting eﬀective relational information from a large number of biomedical texts has important applications for biomedical research. At present, most of the work of biomedicine is to use manual screening or use rule-based or feature-based pipeline network models to obtain screening characteristics. These methods require a lot of time to design speciﬁc rules or features to complete speciﬁc tasks, resulting in some features that non-compliant features cannot be ﬁltered out. Results: The model gets micro-F1 scores of 0.802 and 0.876 on the Chemprot data set and DDI data set, respectively. The resources that can be used in this project can be found in https://github.com/HunterHeidy/DDICPI- . Conclusions: Experiments have proved that without Bert, you can get good results by learning from Bert¡¯s core ideas.

ChemProt based on pre-training Biobert [1]. 103 Concerning extraction tasks, the data of most tasks are highly unbalanced. Data imbalance causes 104 data sparseness problems. Due to the lack of sufficient data, the classifier's ability to describe 105 sparse samples is insufficient, and it is difficult to effectively classify these sparse samples. In one 106 task, if there is a serious imbalance in sample proportions, the model can use data enhancement 107 methods to alleviate this phenomenon. In response to this phenomenon, the mitigation method is 108 random sampling in preprocessing. The biggest advantage of random sampling is simplicity, but 109 the disadvantages are also obvious. After upsampling, some samples will appear repeatedly in the 110 data set, and the model will partially over fitted. The disadvantage of downsampling is that the 111 training set loses part of the data, and the model only learns part of its features [28][29][30][31]. Some 112 models improve learning algorithms, giving fewer data more weight, or using integrated learning 113 to ease [28]. However, this will add parameters to the model and reduce the generalization of the 114 model.

116
Overview of the SSCBKM model    weights. Although word vectors are also generated in the middle, the word vectors are a 135 by-product. Word2vec is fast in training and effective, and it occupies much less memory than 136 GloVe. For our experimental equipment, word2vec is more suitable for us. In the DDI data set, 137 there are a total of 5748 words, of which 401 is not included. In the Chemprot data set, there are 138 a total of 10848 words, the unlisted words are 1325. Since there are relatively few unregistered 139 words, the model ignores unregistered words.

140
Input layer 141 We introduce the mask mechanism to mark the meta data. In the process of data preprocessing, 142 the previous practice is to use word embedding to process characters and convert them into digital 143 forms that can be processed by the computer. Such as: In the text processing process, the sentence length must be equal. To obtain all the information, 148 the text is usually processed to a fixed maximum length, and "0" is used for sentences with 149 insufficient length. For example, in sentence S, suppose the maximum text length is 5, padding 150 with '0', E s = [2, 0, 30, 0, 0]. Since the model cannot distinguish whether the data "0" seems to be 151 a valid value, we calculate average A Es = (2 + 0 + 30 + 0 + 0)/5 = 6.4. However, the real thing in  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 attention. In this way, the parts that are most helpful to attract attention are kept and other Then, assuming that the larger the score, the higher the correlation, the model evaluates the value In the top-k option, the k highest are selected explicitly. This is different from dropout where 185 points are dropped randomly. Such an explicit choice can not only ensure the preservation of 186 important components, but also simplify the model because k is usually a small value such as 5 or 187 10. The next step after top-k selection is normalization: Where A is the standardized score. Since the masking function M(p,k) assigns scores smaller than 190 the first k to negative infinity, the normalized score, that is, the probability is approximately zero.

194
In addition, this sparse attention can be extended to contextual attention. Similar to but different 195 from the self-attention mechanism, Q is no longer a linear transformation of the source context,

207
According to the input text feature space S 1 = w 1 , w 2 , the probabilities of S 2 = v 1 , v 2 are derived 208 respectively.

238
In the output stage, the state that can be output is controlled by z o [29].

239
To deal with MASK, in BiLSTM, the Layer is rewritten into a class that can accept mask. That

257
The model experiment in this article is based on the wikipedia-pubmed-and-PMC-w2v vocabulary.

282
The way to solve the data imbalance is to replace synonymously, add or delete words, and insert

292
Do these n times.

293
Random deletion (RD): For each word in the sentence, delete it randomly with probability p.

294
According to different data sets, a certain priority is given to the four algorithms. For example, in 295 the Chemprot data set, the ratio of CPR: 9 categories to CPR:4 categories is approximately 3:1.