MTBAN: An Enhanced Variant Effect Predictor Based on a Deep Generative Model

The development of an accurate and reliable variant effect prediction tool is important for research in human genetic diseases. A large number of predictors have been developed towards this goal, yet many of these predictors suffer from the problem of data circularity. Here we present MTBAN (Mutation effect predictor using the Temporal convolutional network and the Born-Again Networks), a method for predicting the deleteriousness of variants. We apply a form of knowledge distillation technique known as the Born-Again Networks (BAN) to a previously developed deep autoregressive generative model, mutationTCN, to achieve an improved performance in variant effect prediction. As the model is fully unsupervised and trained only on the evolutionarily related sequences of a protein, it does not suffer from the problem of data circularity which is common across supervised predictors. When evaluated on a test dataset consisting of deleterious and benign human protein variants, MTBAN shows an outstanding predictive ability compared to other well-known variant effect predictors. We also offer a user-friendly web server to predict variant effects using MTBAN, freely accessible at http://mtban.kaist.ac.kr. To our knowledge, MTBAN is the first variant effect prediction tool based on a deep generative model that provides a user-friendly web server for the prediction of deleteriousness of variants.


Introduction
While recent sequencing technologies have resulted in a tremendous amount of sequence variant data, the identification of deleterious variants is still a difficult problem. Development of a reliable computational tool to predict the effects of sequence variants would aid in the treatment of many human genetic diseases. To achieve this goal, many predictors have been developed based on different approaches. Among these methods, supervised methods learn from labelled variant data consisting of known deleterious and benign variants, and many of them show good predictive ability. However, many supervised methods face the problem of data circularity, which can be divided into two types according to Grimm et al. 1 The type I circularity arises due to the overlap between training data and test data. The type II circularity occurs when all variants in a given gene are labelled either all deleterious or all benign, which results in the model predicting the same label for all variants in that gene. Previous studies [1][2][3] have suggested that this problem of data circularity can result in an inflation of the reported performances of many supervised predictors. On the other hand, unsupervised methods do not learn from labelled variant data and learn solely from the evolutionary information contained in multiple sequence alignments. A recent study which carried out an extensive comparison of variant effect predictors claimed that a class of unsupervised models, namely the deep generative model, is a promising area of research for variant effect prediction 3 .
Here, we introduce MTBAN (Mutation effect predictor using the Temporal convolutional network and the Born-Again Networks), an enhanced method to predict the deleteriousness of single amino acid variants. We previously developed a method called mutationTCN 4 based on a deep autoregressive generative model, and showed that it demonstrates state-of-the-art performances on the prediction of functional effects of variants.
In this work, we apply a knowledge distillation technique called the Born-Again Networks (BAN) 5 to the mutationTCN model and develop an improved model called MTBAN. In machine learning, knowledge distillation is a process involving the transfer of knowledge learned from one machine learning model to another. In this scheme, the former model is referred to as the "teacher network" and the latter is referred to as the "student network." Using the Born-Again Networks allows the student network to achieve an improved predictive power compared to the teacher network. When evaluated on human variant datasets with deleterious and benign variants, MTBAN shows superior predictive performances compared to other variant effect predictors. Our model is fully unsupervised and is not dependent on labelled data for training. This gives the model advantage over supervised predictors, for which data circularity is an inherent problem. We also offer a freely accessible web server for using MTBAN for variant effect prediction.

MTBAN model
We previously developed a deep autoregressive generative model for variant effect prediction, called mutationTCN 4 . As it is a generative model, it is trained by maximizing the likelihood of the training data, which is the evolutionarily related sequences of a given protein. The model is thus optimized by minimizing the negative log likelihood between the input sequence and the predicted output. After training, the model can predict the probability of observing a given protein sequence under the parameters of the trained model. The deep autoregressive generative model is implemented using the temporal convolutional network architecture 6 , and is composed of an embedding layer followed by a series of dilated causal convolution layers, an attention layer, and a fully connected layer ( Figure 1). We showed that this model can effectively capture information from evolutionarily related sequences and use this information to predict the functional effects of variations in a sequence 4 . We implemented BAN with mutationTCN as both the teacher and the student network. In the first step, only the teacher network is trained, with the loss function being the label loss, which refers to the cross entropy loss between the input sequence and the softmax output distribution of the teacher network. In the second step, only the student network is trained, with the loss being the sum of the label loss and the teacher loss. The teacher loss refers to the cross entropy loss between the softmax output distribution of the student network and the softmax output distribution of the teacher network. MTBAN combines this model with a knowledge distillation technique in machine learning, known as the Born-Again Networks (BAN) 5 . Knowledge distillation is a process of model compression which involves transferring the knowledge from a teacher network to a student network with a smaller capacity 7 . This allows for the reduction of model size, while maintaining similar predictive power as the original model. In the setting of BAN, the student network is of the same capacity as the teacher network, which enables the student network to outperform the teacher network 5 . We found that the BAN framework in which both the teacher and the student network is implemented with mutationTCN outperforms the original mutationTCN model. The model structure of MTBAN is shown in Figure 1. In the first step, only the teacher network is trained, with the loss function being the label loss, which refers to the cross entropy loss between the input sequence and the softmax output distribution of the teacher network. In the next step, only the student network is trained, with the loss being the sum of the label loss and the teacher loss. Here, the teacher loss refers to the cross entropy loss between the softmax output distribution of the student network and the softmax output distribution of the teacher network. This softmax output probability distribution can be expressed as follows: where is the logit computed for each class and is the temperature parameter 7 . Using higher temperatures leads to more "softened" output distributions. In our implementation, we used a temperature of 4. By training the student network to learn the softened outputs of the teacher network, the student network can learn the knowledge that was previously learned by the teacher network. In our implementation, both teacher and student networks are trained for 500,000 iterations using the mini-batches with the size of 128. For both teacher and student networks, the learning rate is set to 0.001 when the number of training iterations is smaller than 3,000, and 0.0001 when it is greater than 3,000.
We computed the predictions of MTBAN for a total of 1,032 human protein alignments provided by Hopf et al. 8 These pre-computed predictions on the Hopf dataset were saved and used for obtaining the predictions of MTBAN on human protein variants.

Model Outputs
For a given variant, the model outputs the log probability score, the z-score, the probability of deleteriousness, and the predicted label. First, the log probability score is given by the following: where ( | ) and ( − | ) are the probability assigned to the mutant sequence and the wild-type sequence, respectively, by the generative model with parameters . The log probability score is easily computed as the negative of the loss, as the model loss function is the negative log likelihood 4 . The smaller the score, the more likely the variant has a deleterious effect. Second, the z-score is computed by normalizing the distribution of log probability scores for all possible missense variants against the target sequence of a protein.
This normalization process is done due to the variations in the score distributions across different proteins. Third, the probability of deleteriousness for each variant, ranging from 0 to 1, is computed. This is determined from the set of variants in the Humsavar database (release 03/2020) 9 which overlap with our pre-computed model predictions for the Hopf dataset, which are 1221 deleterious and 1221 benign variants. We obtained the z-score distribution for this set of variants, divided the distribution into equal-length z-score intervals, and calculated the proportion of deleterious variants in each z-score interval. Finally, using the same z-score intervals, we determined a z-score threshold which maximizes the classification accuracy ( Supplementary Fig. S1). This threshold is used to assign a predicted label, either deleterious or benign, to a given variant.

Evaluation Datasets
To evaluate the ability of the model to classify human protein variants as deleterious or benign, we created a test dataset by combining the variant data from datasets used by Grimm et al. 1 and Mahmood et al. 2 Details regarding the datasets can be found in Table 1 We found variants among these datasets for which predictions exist in our precomputed Hopf dataset, and used them for comparison with other methods. Since the number of deleterious variants was significantly larger than that of benign variants, we randomly selected variants from the deleterious variant data to match the data size of the deleterious variants and the benign variants. This resulted in a balanced test set consisting of 1244 deleterious and 1244 benign variants in total.

Evaluation Criteria
The following metrics were used for evaluating the classification ability of the variant effect

Evaluation on human protein variant datasets
We assessed MTBAN and other variant effect predictors on the task of classifying human protein variants as deleterious or benign. As described in Methods section, our test dataset combines the disease-associated variants from Grimm et al. 1 Table S2). Overall, MTBAN shows an outstanding classification ability in both disease-associated variant data and functional assayderived variant data.

Web Server
We offer a user-friendly web server which predicts variant effects using MTBAN ( Supplementary Fig. S2). The server takes in as input a protein UniProt accession and a list of amino acid variants. Upon receiving input, it determines the target protein sequence region, and checks if pre-computed predictions exist for the given variants. If they exist, the server immediately returns predictions to the user. Otherwise, it checks if a multiple sequence alignment of the target protein sequence region is present in the database. If an alignment is present, it uses that alignment for subsequent computations. If an alignment is not present, it generates one using a profile HMM homology search tool 23 and saves it in the database.
During the computation, alignment columns that have more than 30% gaps are dropped. If some of the input variants belong to these un-aligned columns in the alignment, those variants are excluded from prediction and are indicated in the results. The next step is the computation of sequence weights, based on the similarity of sequences in the alignment. This step is included to reduce any sequence bias present in the multiple sequence alignment 4 . Afterwards, the prediction model is trained, and the server returns predictions to the user. After job processing, the predictions are saved so that the server can immediately return the results when the same set of mutations are later submitted as input. In the web server implementation, due to time constraints, the MTBAN teacher network and student network are both trained for 200,000 iterations, with learning rate 0.001.

Discussion
Here of the target sequences were conserved, which is a considerably large proportion.
The results of our work show that the deep generative model is a powerful tool for predicting the effects of sequence variations. We expect that deep generative models will continue to play an important role in discovering the effects of genetic variants. In addition, to our knowledge, MTBAN is the first variant effect prediction tool based on a deep generative model that provides a user-friendly web server for the prediction of deleteriousness of variants.
This method is expected to be a useful tool for the prioritization and identification of variants involved in human genetic diseases.

Data availability statement
The datasets generated during and/or analysed during the current study are available at https://github.com/ha01994/MTBAN.