DeepAmp: A Convolutional Neural Network Based Tool for Predicting Protein AMPylation Sites From Binary Pro�le Representation

AMPylation is an emerging post-translational modiﬁcation that occurs on the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. AMPylators catalyze this process as covalent attachment of adenosine monophosphate to the amino acid side chain of a peptide. Recent studies have shown that this post-translational modiﬁcation is directly responsible for regulation of neurodevelopment and neurodegeneration and also involved in many physiological processes. Despite the importance of this post-translational modiﬁcation, there is no peptide sequence dataset available for conducting computational analysis. Therefore, so far, no computational approach has been proposed for predicting AMPylation. In this study, we introduce a new dataset of this distinct post-translational modiﬁcation and develop a new machine learning tool using a deep convolutional neural network called DeepAmp to predict AMPylation sites in proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, and 0.55 in terms of Accuracy, Sensitivity, Speciﬁcity, and Matthews Correlation Coefﬁcient (MCC) for AMPylation site prediction task, respectively. As the ﬁrst machine learning model, DeepAmp demonstrate promising results which highlight its potential to solve this problem. Our presented dataset and DeepAmp as a standalone predictor are publicly available


Introduction
Post Translational Modification (PTM) is the enzymic or chemical modification of a protein after it is translated or synthesized in the ribosome. The PTMs are occurred via removal of parts of a translated protein, covalent modifications, or degradation of modified proteins 1, 2 . These modifications provide important insight into various cellular functions and biological processes of proteins such as cellular dynamics and elasticity.
PTMs are important mechanisms to increase proteomic diversity, and play a vital role in functional proteomic because they regulate activity, localization, and interaction with other cellular molecules such as proteins, nucleic acids, lipids, and cofactors 3 . They can impact the structure, electrophilicity, and interactions of proteins. PTMs also regulate protein folding via targeting specific subcellular compartments, interacting with ligands or other proteins, or by initiating a change in their functional state including signaling or catalytic activity 4 . A wide range of PTMs have been identified so far. The common PTMs include phosphorylation, glycosylation, ubiquitination, nitrosylation, methylation, acetylation, lipidation, and proteolysis which influence almost all aspects of normal cell biology and pathogenesis 5 .
AMPylation is an emerging Post Translational Modification mediated by a bacterial virulence factor that transfers Adenosine Monophosphate (AMP) from Adenosine Triphosphate (ATP) to a threonine residue of eukaryotic substrates 6,7 . AMPylation is the covalent attachment of AMP to a protein or peptide 8 . It has been studied exclusively with the Fic domain proteins, which are preserved and found in proteins stretching from bacteria to humans. By adding AMP to Rho-family GTPases, these enzymes can thereby mediate both bacterial pathogenesis and eukaryotic signaling 9,10 . The most common and stable form of AMPylation occurs on the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. In the AMPylation process, Adenosine Monophosphate (AMP) gets covalently attached to the amino acid side chain of a protein molecule. AMPylation involves a phosphodiester bond between a hydroxyl group of the molecule undergoing AMPylation and the phosphate group of the adenosine monophosphate nucleotide (i.e. adenylic acid) [14]. The enzymes that are capable of catalyzing this process are called AMPylators. Threonine (T) and Tyrosine (Y) amino acids are usual targets of AMPylation while this PTM can sometimes be observed in Serine (S) as well.
Recent proteomics studies demonstrated that this PTM is more omnipresent than generally acknowledged and it is emerging as a significant regulatory mechanism for both eukaryotic and prokaryotic cells. It is impelled in a vast area of biological processes stretching from regulation of nitrogen metabolism in bacteria and regulation of signaling pathways to pathogenesis in several animal species [11][12][13][14] . AMPylation has also found to play a significant role in the regulation of neurodevelopment and neurodegeneration 15 .
However, to the best of our knowledge, so far no computational approach has been proposed for predicting AMPylation sites of Fic domain protein. One of the main reasons is that there is no AMPylation dataset available to be used for this task. In this study, we are presenting a new dataset of protein AMPylation sites. Furthermore, we also propose a new deep Convolutional Neural Network (CNN) model called DeepAmp for predicting protein AMPylation sites on the newly found dataset of AMP modified proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, and 0.55 in terms of Accuracy, Sensitivity, Specificity, and Matthews Correlation Coefficient (MCC) for AMPylation site prediction task, respectively. As the first machine learning model, DeepAmp demonstrate promising results which highlight its potential to solve this problem. We believe this study will help researchers immensely in terms of mitigating the current research gap in this subject. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.

Evaluation Metrics
In order to ensure standardized evaluation of our model and to provide more insights into our results, we calculate the Accuracy, Sensitivity, Specificity, and Mathews correlation coefficient (MCC) as the evaluation metrics. These metrics are characterized by the following equations: Where tp denotes true positive, and tn, fp, fn denote true negative, false positive, and false negative, respectively.

Comparison with Different Machine Learning Techniques
Since DeepAmp is the first computational model proposed to predict AMPylation PTM, it is not possible to compare model performance with any other studies. However, to investigate the effectiveness of CNN to build DeepAmp, we compare it with other ML models to solve this problem. Results achieved using DeepAmp compared to other ML models including Support Vector Machine (SVM), Random Forest (RF), Linear Regression (LR), Decision Tree (DT), and K-Nearest Neighbor (KNN) using same set of features are presented in Tables 1 and 2 for 5-fold and 10-fold cross-validations, respectively. We present the average of 10 runs of 5-fold and 10-fold cross-validations model for all the metrics in Tables 1 and 2. As shown in these tables, DeepAmp achieves significantly better results in terms of all four metrics than other machine learning methods which are investigated in this study. As shown in Table 1, DeepAmp achieves 73.9%, 78.6%, 71.2%, and 0.49 in terms of Accuracy, Sensitivity, Specificity, and Matthews Correlation Coefficient (MCC) for AMPylation site prediction task using 5-fold cross validation, respectively. Also, according to Table 2, DeepAmp achieves 77.7%, 79.1%, 76.8%, and 0.55 in terms of Accuracy, Sensitivity, Specificity, and Matthews Correlation Coefficient (MCC) for AMPylation site prediction task using 10-fold cross validation, respectively. As   Tables 1, and 2, the results using 10-fold cross-validation are slightly better than those reported using 5-fold cross-validation. This can be associated with larger number of samples used to train our model in 10-fold cross-validation. This highlight that having larger benchmark, DeepAmp is able to achieve even better results. However, the consistency between results using 5-fold and 10-fold cross-validations demonstrates the generality of DeepAmp.
In Figure 1, the receiver operating characteristic curves (ROC curves) clearly illustrate the capability of distinguishing the AMPylation and non-AMPylation sites of the DeepAmp model. Also, as shown in Tables 1, and 2, in terms of the MCC score, the other ML models display mediocre classification quality, conversely, DeepAmp shows significant improvement in the classification quality. It demonstrate the effectiveness of DeepAmp over other classifiers in identification of positive and negative samples, consistently.

Methods and Materials
This section describes the proposed method and benchmark dataset presented in this study.

3/7 Benchmark Dataset
Kielkowski et al 9 has identified the AMPylation in intact cancer cells via LC-MS/MS as well as imaging methods. They identified a total of 162 protein sequences to be involved in this distinct modification. We investigated these proteins through UniProt database and identified a total of 133 unique protein sequences which are used to build our dataset. We then use CD-Hit to remove proteins with over 40% sequential similarities to discard redundancy in the dataset 29 . The resulting dataset contains 130 unique proteins with less than 40% sequential similarities. After that, for each AMPylation and non AMPylation site, a 31-residue peptide containing central AMPylation /non AMPylation site with 15 residues upstream and 15 residues downstream was extracted. We tried different length of peptide-containing which among them, using 31-residue peptides attained the best results. To build the peptides sequence for AMPylation sites at the two ends of the proteins with less than 15 neighboring amino acids on each side, we use equalized by padding with "X" residue. As a result, a total of 153 peptides with AMPylated sites and 28872 peptides with non-AMPylated sites were extracted from 130 protein sequences. From the 28872 non-AMPylated sites, we selected 250 sequences randomly to balance our dataset having almost 2:1 ratio of negative to positive samples. Thus our final dataset of 403 peptide sequences with 153 AMPylated peptides and 250 non-AMPylated peptides was created. This dataset is available at .

Feature Encoding
Feature encoding is an important step in building an effective machine learning model. Binary profile features are straightforward, yet shown to be very effective for the prediction of different functionalities in the multi-omics dataset 30,31 . In this study, we generate Binary profiles for each peptide, by representing each amino acid as a vector of 20 dimensions in term of one hot encoding. For instance, Alanine is replaced by a 20 size one hot vector which is [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. As a result, a sequence of length L was represented by a vector of dimensions L × 20. Considering L= 31 (length of peptides), we extract 620 features for each peptide (31 × 20). This feature encoding process is depicted in Figure 2. Considering that we use Convolutional Neural Network to build our model, binary profile can potentially provide extensive information to train our model.

Classification Technique
Convolutional neural network (CNN) is widely used in computational biology for predicting different biological and chemical functionalities and entities from multi-omics datasets. It has shown tremendous success in the prediction of different PTMs, cancer cell types classification tasks, origins of replication prediction, and many more [32][33][34] . Like any other neural network, a CNN consists of an input layer, hidden layer, and an output layer. Extracting feature maps using convolution operation makes the CNN architecture different from the regular neural nets. Unlike hidden layers of regular neural net which basically constructed by a set of fully connected neurons, the hidden layers of CNN mainly consist of a convolutional layer, pooling layer, and fully connected layer 35 .
The CNN architecture we used is depicted in Figure 3. The input is the L × 20 matrix where L is the length of the protein sequence (31). We applied one-dimensional kernels to the input vectors. The output of our first 1-D convolutional layer which can also be thought of as a motif scanner is then passed to the max-pooling layer. Among the three convolutional layers we used, max-pooling was applied in the first two of them. The last convolutional layer output is directly passed to a fully connected layer and the prediction layer. Rectified Linear Unit (ReLU) was used as activation function for each intermediate layer as it is 4/7 popularly used for its simplicity and effectiveness 36,37 . In each of the convolutional layers and the fully connected layer, we used dropout to avoid overfitting 38 .
Even though for computer vision problems deeper CNN models provide the best result 39 , for biological sequence data which are presented in term of matrix as input, different studies have shown that increasing the depth of the convolutional layer does not necessarily lead to improvement in prediction accuracy specially for the smaller datasets similar to ours 40 . Furthermore, it reduces the chance of overfitting and requires fewer instances for training 38,41 . Considering these aforementioned issues, in this study a shallow CNN architecture is used for constructing DeepAmp.

Evaluation methods
In order to measure the efficacy of DeepAmp, k-fold cross-validation is used here. In k-fold cross-validation, the dataset is split into k subsets. From this k subset, k-1 is used for training and the remaining fold is used for validation. This way the whole dataset gets used for training. Since the training size gets bigger, the classifiers tend to show better results. We used stratified k-fold cross validation which maintains a fixed ratio of negative and positive sites in the training and validation dataset 42 . In this study, we evaluate our model using k = 5 and 10 as two common values for this parameter.

Conclusion
In this study, we presented a new dataset that can be used to evaluate computational methods specially machine learning based models to predict AMPylation PTM. On top of that, we proposed a new deep learning-based tool called DeepAmp for predicting AMPylation using CNN and binary profile feature vector. DeepAmp achieves an accuracy of 77.7% and sensitivity, specificity, and MCC score of 79.1%, 76.8%, 0.55, respectively for 10-fold cross-validation. DeepAmp also significantly outperforms widely used machine learning models including Support Vector Machine, K-nearest Neighbor, and Random Forest for predicting AMPylation sites. Due to the limitation of the sample size available, prediction with high accuracy is strenuous. In the future refinement of our work, we aim to incorporate new AMPylation sites into the dataset and create a larger database for AMPylation PTM. Furthermore, we aim to ameliorate our predictor's performance by using different feature sets and deeper CNN architectures. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.