TVAR: Assessing Tissue-specific Functional Effects of Non-1 coding Variants with Deep Learning

Abstract


Introduction
database and the computer power, most computational models (such as the above 81 approaches) only describe the functional effects of variants at the organicism level. 82 Until recently, a wide range of tissue /cell type-specific functional annotations (such as 83 ENCODE, Roadmap, and GTEx 19 ) are published, and computational methods begin to 84 estimate the issue-specific functionality of variants. Due to the paucity of known 85 noncoding variants associated with complex diseases, it is challenging to train machine 86 learning approaches using well-validated labels. For this reason, very few methods, e.g. tissues, resulting in loss of accuracy as well as potential incompatibility among tissues. 93 The challenge of estimating the functionality of variants in various tissues is far from 94 being considered resolved. 95 The following issues need to be addressed urgently to address the challenge of 96 predicting tissue-specific non-coding functional variants. As most eQTLs detectable 97 with sufficient power are common, how to leverage the eQTL resources to predict 98 functionality of rare noncoding variants is a key challenge and opportunity. An 99 data of 4 complex diseases, i.e. coronary artery disease, breast cancer, Type 2 diabetes, 125 and Schizophrenia, with tissue-specific annotations from heart, breast, pancreas, and 126 brains tissues in GTEx. We found that the top scoring rare variants (MAF<0.01) have 127 significantly smaller P values than the background variants for each of the diseases, 128 supporting that TVAR can be used to prioritize noncoding rare variants in a tissue-129 specific manner for complex diseases. Finally, we compared the performance of the

141
Evaluation of TVAR's performance across 49 tissues in the GTEx dataset 142 Through the TVAR framework, we can predict the functional variants corresponding to 143 the 49 different human tissues in the GTEx (V8 release) (See Fig. 1 for the list of the 144 tissues). Since TVAR is a supervised learning approach, to achieve the goal of 145 identifying tissue-specific functional variants, sufficient and high-quality labels of 146 training individuals are required. To obtain the training labels, we first collected the variants that strongly associated with the eGenes in the GTEx dataset (with q-value 148 cutoff 0.01). Due to the effect of the linkage disequilibrium (LD), most detected eQTLs 149 are proxies to the functional regulatory variants 29 . We used a fine mapping strategy 150 to nominate the credible functional variants in an eQTL LD block. Specifically, we used 151 LINSIGHT, a state-of-art functional variant annotation tool, to filter these non-152 functional variants (Methods). For each tissue, we retained the top 1500 variants with 153 the highest LINSIGHT scores, and merged the variant labels in 49 tissues into a matrix, 154 in which each row is a variant, and each column is a tissue. The label matrix is a 0-1 155 matrix, which is used for the multi-label DNN. Since the variants in the label matrix 156 show functionality in at least one tissue, we set these variants as the positive sample The TVAR framework is a deep feedforward network based on multi-label learning. 169 The network of TVAR describes the functionality of the variant-tissue pairs through the 170 fully connected layers, which can learn the differences and similarities among the 49 171 tissues. The output of TVAR is a 49-dimensional vector that represents the functional 172 scores of the variant-tissue pairs. To prove that TVAR can successfully learn the 173 functionality of the variants corresponding to each issue, we used five-fold cross-174 validation to train and test the model. In each training process, we randomly selected 175 80% of the data for the model training and 20% of the data for the model testing (see 176 Supplementary Note 1). Receiver Operating Characteristic (ROC) curves were adopted 177 to distinguish the prediction power of different methods in all testing processes. As an 178 overall evaluation, we combined all variants across all tissues as a single evaluation set, 179 and obtained an average AUC = 0.770 across the GTEx 49 tissues, indicating that 180 through multi-label learning, the network can extract valid features from the input 181 1247-dimensional annotations predictive of eQTLs. Based on the average accuracy rate, 182 we divided the GTEx 49 tissues into the 'high accuracy group' and the 'low accuracy 183 group' (Fig. 1A).

184
In the high accuracy group (Fig. 1A), The AUCs of 15 tissues exceed 0.80, including (such as brain tissues) are also similar (Table S2).

217
After the above analysis, we found that TVAR achieved the best performance on   and negative variants on all data sets (Fig. 3, Table 1). The exception is DANN, which 288 achieved significance on GWAS data of all diseases except SCZ (Fig. 3, Table 1). As a Although the sample sizes of the four test scenarios are on the same scale, we found 300 that the performance advantage of TVAR is considerable higher on CAD, BRC, and T2D, 301 but comparable to other algorithms on SCZ. In fact, the low accuracy group of the TVAR 302 scores contains a large number of brain-related tissues, suggesting that the functional 303 prediction of variants on brain-related tissues is more challenging.  tissues, failed to separate the positive and negative samples on all data sets (Fig. 3,   324 Table 2). Even FUN-LDA, which has tissue-specific scores, has challenges to achieve a 325 significant discrimination, with T2D being close to significance with P = 7.63e-2. However, we found that there are still many functional variants in brain-related tissues 435 that cannot be explained by the functional annotation of the input features of TVAR.

436
This is not unexpected, however, as brain tissues are highly heterogeneous with 437 diverse cell types with different functions. We also uncovered that the use of multi- The specific network structure of the DNN model is shown in The G-score algorithm of TVAR 557 Although the design of the TVAR is to score the variant-tissue pairs. However, with the 558 development of the G-score algorithm, we can make the TVAR support variant scoring 559 at the organism level. For a score that does not distinguish between issues, only one 560 score s is needed for each variation. The G-score algorithm is designed as a multi-561 instance learning approach: the score y on 49 distinct tissues for each variant x is 562 regarded as a bag. For bag y, we use a function f() to find its general score:

563
( ) In this study, we simply consider the maximum score among tissues as the G-score for 565 each variant, i.e. f () takes the max function. The TVAR source code and its scores on the ClinVar catalog, fine mapped GWAS Loci,