Bridging Heterogeneous Mutation Data to Enhance Disease-Gene 1 Discovery

18 Background : Bridging heterogeneous mutation data fills in the gap between various data categories 19 and propels discovery of disease-related genes. It is known that genome-wide association study (GWAS) 20 infers significant mutation associations which link genotype and phenotype, and it is under-powered for 21 pinpointing causal genes due to high false positive or negative rate. In the meantime, mutation events 22 widely reported in literature unveil typical functional biological process, including mutation types like 23 gain-of-function and loss-of-function. 24 Methods: To bring together the heterogeneous mutation data, we propose a pipeline, “Gene-Disease 25 Association prediction Conclusion: Our model is capable of enhancing the GWAS-based gene association discovery by well combining text mining results. The positive result indicates that bridging the heterogeneous mutation 35 data is contributory for the novel disease-related gene discovery. the mutation type of the gene helps to recover the association in the manner of data fusion.

with the classification label encoded vector, and then a multi-label classifier is used for mutation type classification. 119 Here, we train our model in AGAC train sets, and test out model in development sets. AGAC corpus is designed for 120 extracting gene-mutation type-disease triples from PubMed abstracts. 121 For filter out SNP related documents in the AD case study, we first use "Alzheimer disease" [MeSH Terms] OR 122 Alzheimer's disease [Text Word]" as the search criteria for downloading 137,473 abstracts from the PubMed 123 database. Then regex and SNP recognize tool SETH was employed to filter out mutation related texts. SETH is an SNP 124 extraction tool for recognizing SNPs and other short sequence variations. Finally, we get 9,430 mutation related 125 abstracts. 126 The retrieved "mutation types" and part of the sentence evidences are presented in Figure 3. For instance, gene 127 APP is predicted to be related to a GOF mutation type, and the sentence evidence is from PubMed with ID equals to 128 16685645. The sentence, "...Promoter mutations that increase amyloid precursor-protein expression are associated 129 with Alzheimer disease..." clearly support the GOF prediction. Moreover, the full results are in the Supplementary file 130 S2 or in online data repository(https://hzaubionlp.com/agac-on-alzheimers-disease/) 131

"Synchronization Filter" module in GDAMDB
132 "Synchronization Filter" module uses a strategy to obtain a gene set with greatest size and most significant literature 133 support. It designs to optimize the probability that most genes in the gene set maintain not only literature 134 significance but also the GWAS significance. Actually, the whole idea of significance integration is an analogue of 135 signal synchronization". Taking this concern, the module is named as "Synchronization Filter" module. 136 In a mathematical way, we assume there is a gene set for each with , where is mutation type retrieved 137 by "Mutation Type Retrieval" module. Since of every is traceable from GWAS summary data, we order all of the 138 with with its -value in a descending order. Generally, the topmost has greater chance to has mutation type 139 info as reported in literature. However, speaking with probability, not all of the with has greater significance 140 in GWAS. Therefore, from all genes with predicted value, we obtain the top genes according to their value, 141 and observe the GWAS significance of this gene set over random set with the same size. The hypothesis test method 142 introduced for this case is Wilcoxon test, where the zero hypothesis 0 is: 143 increase gradually from 1 to the maximum. After increment of , the size of the gene set increases, while -value 148 of Wilcoxon test specifies the advantage of the gene set over the whole. In most simulation tests, with the increase 149 of , the plot of −log shows a bell shape. This plot suggests that there is a trade-off between the size of gene set and 150 the overall significance over the whole. In the end, a peak value of − in Wilcoxon test corresponds a selection of 151 proper value of , which forms a synchronization filter. 152

159
The following mathematical setup defines the notations and symbols. For a disease ( = 1, ⋯ , ) and a gene 160 ∈ {0,1} 3 encodes the associated mutation type of gene for disease , i.e., LOF/GOF/NA captured 161 from literature, while ∈ (0,1) refers to the -value of the mapped mutation association of gene for disease in 162 GWAS. Both of them are regarded as observations in the graphical model, and marked as dark circle in Figure 1    The two kinds of mutation data, mutation association and mutation type, share the commons but are with different 229 data characteristics, as shown in Figure 2 (a). 230 First, each mutation data comes from different resources. The mutation association is accessible from GWAS 231 summary statistics data, while the mutation type is available from literature mining. Second, they both support the 232 investigation of the gene-disease link, but the link is represented by p-value in mutation association, and by the 233 confidence value of LOF or GOF from the text mining module in mutation type. Third, the evidence of the link is under 234 different concerns. Mutation association data are from exacting and reliable experiments, and GWAS are widely 235 accepted as a powerful method to investigate the association between gene and disease. The mutation type is only 236 retrieved when a research report the gene are associated with the disease and also describe the mechanism between 237 the gene and disease in a published literature. Forth, both of them also has their weaknesses. For mutation 238 association, the high false negative rate and false negative rate in GWAS indicate that not all vital SNPs are able to 239 pass the multiple testing and not all passed genes are real important for the disease. In addition, the lack of 240 consideration about the biological mechanism between genes and phenotype makes GWAS insufficient to pinpoint 241 causal variations. Besides, it is difficult to conduct a GWAS on a large amount of case/control population. For 242 mutation type, since it comes from reported literature, it only represents part of the whole knowledge after text 243 mining. 244 Hence, considering the weaknesses and advantages of these two heterogeneous mutation data, we designed a 245 model to bridge mutation association and mutation type and achieves data fusion. Generally, a disease casual gene is 246 more likely to be identified by GWAS, and is also more likely to be discovered by other researches and described in 247 the literature. Therefore, bridging mutation association and mutation type is to integrate the mutation data in a 248 complementing way. In a simplified situation, if the mutation association of a significant SNP association failed to 249 pass or barely passed the threshold in GWAS, the mutation type of the gene helps to recover the association in the 250 manner of data fusion. Bridging" model. c. pipeline of GDAMDB model for gene-disease association prediction.) 255

Generative model bridges mutation association and mutation type 256
We designed a generative model by introducing a switch variable to bridge the mutation association and mutation 257 type data. Here, the switch variable considers both the significance of mutation association mapped to the gene and 258 the reported mutation type associated with the gene. Eventually, more reliable disease-related genes are predicted 259 through the integration method. 260 As shown in Figure 2  (1) 271 We assume the switch variable ∼ ( ), and ∈ (0,1). 272 As described above, can be generated by latent topics in LDA. Let ∈ {1, ⋯ , } index the latent topic, LDA 273  word. Therefore, we transfer and fine-tuned its parameters on our joint model. As shown in Figure 2 (c), after 296 paternal searching from PubMed and mutation filtering, the abstracts containing diseases and mutations are 297 input into BERT, then the presentations of each word in the abstracts are obtained. Subsequently, a fully 298 connected layer and softmax are used to normalize classification weights, and CRF loss function is employed to 299 optimize entity recognition task in the meantime. Finally, the model output the mutation type of genes in 300 abstracts. 301 (2) "SNP-Gene Mapping" module is to process the mutation association data. Since the pipeline is focus on gene, the 302 SNPs in GWAS data should be mapped on genes by bedtools [15]. The p-value of a gene is the p-value of its SNP 303 which is the lowest one. 304 (3) "Synchronization Filter" module designs to optimize the probability that most genes in the gene set maintain 305 not only literature significance but also the GWAS significance. Generally, the gene with great significant 306 mutation association, , in GWAS is likely to be described in literature with mutation type, , but not all of 307 genes satisfy the rule. Therefore, from all genes with predicted value, we obtain the top genes according 308 to their p value, and observe the GWAS significance of this gene set over random set with the same size. The 309 hypothesis test method introduced for this case is Wilcoxon test. 310 Briefly speaking, is retrieved from PubMed by using the "Mutation Type Retrieval" model and AGAC corpus, 316 and genes with significant are defined as "literature significant" genes. In the meantime, mutation association, 317 , is extracted from GWAS summary data by applying SNP inclusion criteria, and "GWAS significant" genes are 318 obtained by using a "SNP-Gene Mapping" module. In order to better synchronize the above heterogeneous mutation 319 data, a "Synchronization Filter" module creates a seed gene set consists of with significant ̂ and . After 320 feeding the observations, {̂} and { } , into the "Mutation Data Bridging" model, the model parameters are new appeared gene with ̃ is predicted with novel gene-disease association. All of the pipeline details are 323 elucidated in Online Method. 324 The purpose of GDAMDB is to accelerate the discovery rate of the gene associations of GWAS by integrating both 325 mutation association and mutation type information. A case study on Alzheimer's disease (AD) was carried on to 326 evaluate the performance of GDAMDB in the support of discovery of novel gene-disease associations. 327 The mutation type of the genes is widely implied in the description of the literatures. Since AD is a important diseases 333 and the mutation data of AD are available, we apply GDAMDB on AD to retrieve the genes that are undiscovered. 334

Mutation association data of Alzheimer's disease 335
In 2013, the International Genomics of Alzheimer's Project (IGAP) [12] performed a two-stage GWAS on individuals 336 of European ancestry on 7,055,881 SNPs. In stage 1, they meta-analyzed four previous AD GWAS datasets including 337 17,008 AD cases and 37,154 controls. In stage 2, they tested 211,632 SNPs on 8,572 AD cases and 11,312 controls. 338 The final result was obtained after meta-analysis of combining stage 1 and 2. We selected the summary statistics file 339 that combined stage 1 and 2, in which contains 1,513 genes. The p-value of each gene was same with the most 340 significant SNP in the gene. 341

Mutation type data of Alzheimer's disease 342
In the meantime, the MeSH term "Alzheimer's disease" was used as the key word to query PubMed database, and 343 137,473 abstracts were downloaded. To ensure that the literatures contain description about mutation, SETH [17] 344 was applied to filter the literature. SETH is able to recognize the SNP or other mutation semantic words in texts. 345 Thus, till this procedure, the abstracts that contains AD and mutations were left. After mutation filtering, 9,430 346 abstracts were input into "Mutation Type Retrieval" module. The module will compute the confidence value for each 347 abstract in each mutation type. The output of the module is the mutation type of genes in each abstract, of which the mutation types are firstly recognized from 325 abstracts where each abstract contains a conclusive sentence 351 evidence leading to a mutation type. It is noted that the obtained plenty of AD-related LOF/GOF data is new to the 352 AD community, while the full result is offered in supplementary data, S2: genes with predicted mutation type 353 information and the evidence sentence from PubMed texts, Mutation Type Data.xlsx. mutations, and this location is between the sequence producing Beta-APP domain and sequence producing APP 363 amyloid domain which form the beta-amyloid and is strongly implicated in the pathogenesis of AD. Moreover, the 364 corresponding sentence evidences of these APP mutations are below. As introduced above, this module is able to 365 recognize the entities and classify the mutation type of a gene. For example, the sentence in the middle, "The A673V 366 mutation affected APP processing, resulting in enhanced beta amyloid (Abeta) production and formation of amyloid 367 fibrils in vitro.", APP and enhanced will be recognized by the module. Based on enhanced, the confidence value of GOF 368 will be higher than the value of LOF in this abstract, hence the mutation type of APP will be classified as GOF. 369 Therefore, among the 325 abstracts, each one are recognized at least one gene and their mutation types.Besides that, 370 all the 325 abstracts carry the clear semantic of the downstream biological processes after mutation, which can be 371 divided into 8 types after manual curation. As shown in figure 4, Gene Expression, Protein Activity, Interaction, 372 Pathway Activity and Cell Activity are the fundamental biological processes which follows the central dogma and are 373 from molecular level to cell level. In addition, the Phosphorylation, Abeta Accumulation and Ca2+ Concentration are 374 frequently mentioned. Interestingly, these three biological processes are related to the known hypothesizes of AD 375 pathogenesis. Abeta is the production of APP gene, the accumulation of which, especially Abeta42, forms the fibrillar 376 amyloid plaques in brain and impair the ability of spatial learning and memory [18]. Phosphorylation related to 377 another hypothesis of AD pathogenesis, especially the phosphorylation of Tau protein which encoded by MAPT gene. 378 The hyperphosphorylation of Tau protein leads to neurofibrillary tangles in neurons and eventually results in the 379 apoptosis of neurons [19]. Intracellular Ca2+ concentration is also thought as part of the cause of AD. The 380 dysregulation of intracellular Ca2+ signaling disturbs many neural processes, which implicated in AD mechanism 381 [20]. precursor-protein expression are associated with Alzheimer disease...", "increase" helps to confirm GOF and "amyloid 390 precursor protein expression" helps to confirm that the biological process that effected by mutation is gene 391 expression. Similarly, GSTM3 is grouped into the LOF of gene expression. 392 The biological process category of the genes and evidence sentence can be found in supplementary data S2, or 393 in online data repository (https://hzaubionlp.com/agac-on-alzheimers-disease/) 394

Data fusion of heterogeneous mutation data 395
The data fusion by GDAMDB is shown on Figure 4. The left two graphs present confidence value of the gene mutation 396 type in each abstract and the p-value of gene mutation association, both of the graphs are showing the rough 397 distributions of data. In mutation type graph, the 325 gene-mutation types-AD are predicted and manually checked 398 from 9,430 abstracts, and there are 65 unique genes since some of the genes are mentioned more than once in these 399 abstracts. The empirical threshold represents the model parameters and human filtering. In mutation association 400 graph, there are 23 mutation associations passed the final Bonferroni threshold, and they are mapped to 23 genes. 401 The graph became bar graph after data fusion, since the output of the model is binary info representing the 402 association between gene and disease or not. 79 genes are predicted to be the AD-related genes. The final prediction 403 filtered some of the genes that passed the threshold in the single mutation data but recognized as the false negative 404 genes by the model, and also retrieved the genes failed to pass the threshold in the single mutation data but 405 recognized as the AD-related genes. 406 407 Figure 5: Data fusion of heterogeneous AD mutation data improves the discovery of novel AD-related genes. 408 For example, as shown in the circle above mutation types graph, the mutation types of the three genes are 409 retrieved by "mutation type retrieval module" and passed the empirical threshold, ABCA7, CLU and ADAM10. The 410 circle above the mutation association graph contains four genes that passed the Bonferroni threshold, ABCA7, CLU, 411 CR1 and ZCWPW1. There are different limitations make the information that mutation data contained is incomplete. 412 Therefore, as marked on the graphs, ABCA7 and CLU both pass the threshold in two kinds of mutation data, but 413 ADAM10, ZCWPW1 and CR1 only pass one. However, the -logp value of ADAM10 is close to the Bonferroni threshold, 414 while the confidence value of ZCWPW1 and CR1 are close to the empirical threshold. 415 After data fusion, ABCA7, CLU, ADAM10, CR1 and ZCWPW1 are output by GDAMDB, which shows that GDAMDB 416 is able to break the limitation of these two mutation data and save the important genes that are failed to pass the 417 threshold. Besides, the genes, NR1H3 and SQSTM1, are retrieved in neither mutation type data nor mutation 418 association data, but retrieved by GDAMDB after data fusion. It shows that GDAMDB is not simply merge the genes 419 that are significant in one of the mutation data, but to learn the latent regularity of the mutation data distribution. 420

Novel discovery of AD-related genes after heterogeneous mutation data fusion 421
An encouraging result of AD-related gene discovery is shown in Figure 6

457
In the case of data fusion in terms of knowledge discovery, the knowledge can be any form of data with different 458 format. In our research, the association relation between gene and disease can be the p-value, named as mutation 459 association, in GWAS, where the smaller p-value represents the more significant relevance between gene and disease. 460 Furthermore, this association relation can also be mutation type in literature, where the description about the 461 mechanism of mutations in disease pathogenesis directly indicate the details of the relation. When different data 462 reveal the relations in different aspects, taking both aspects into consideration leads to a more comprehensive 463 knowledge discovery. Besides, since the advantages and weakness vary from heterogeneous data, data fusion helps 464 to enhance the quality of both data. The relevance between a gene and a disease is adjusted after data fusion, 465 especially when a false negative mutation association of a gene fails to pass significant test in GWAS but is found to 466 be active with mutation type information in literature. 467 A generative model is capable of learning the data distribution of observations from two heterogeneous 468 categories, and generating novel data which represents the statistical characteristics of both observations, thus 469 achieves the data fusion of heterogeneous data. Therefore, by bridging mutation type data and mutation association 470 data, GDAMDB is capable of retrieving the important AD-related genes that are failed to pass the multiple testing in 471 GWAS or haven not been reported in literature. Eventually, our model retrieved 79 AD genes, and 57 of them are not 472 reported in the source GWAS study but 47 out of 57 are supported by convince evidences that are AD-related genes, 473 which positively shows the reliability of the model performance. 474 As a generative model, GDAMDB offers a way to enhance the disease-related gene discovery in a single mutation 475 data, and the implementation procedure of the model shows that the model is flexible to be adopted to each given 476 disease, in the case when the GWAS summary data and sufficiently abundant literature are available. All the results 477 in this research indicate that data fusion sheds light to the novel knowledge discovery. 478

479
This research drew a novel respective towards the data form of mutations, of which there are mutation associations 480 obtained from GWAS experiment and mutation type extracted from text mining. It is known that GWAS associations 481 are under-powered for pinpointing causal genes due to high false positive/negative rate, and integration of other 482 mutation information is possibly an effective addition. Thus, we used a PubMed-wide text mining strategy to 483 pinpoint vital genes which carry core semantics of the mutation effect, and came up with the mining of mutation