Phosphorylation is the most important post-translational modification[1] and it is a key mechanism in many biological processes, including DNA repair, transcriptional regulation, environmental stress response, apoptosis, metabolism, immune responses, signal transmission, cellular differentiation[2]. In eukaryotes, phosphorylation occurs in serine(S), threonine (T), and tyrosine (Y) residues, like eukaryotes in Prokaryotes, phosphorylation mainly occurs on S, T, and Y; but in prokaryotic, phosphorylation also occurs on additional types of amino acids, including arginine (R), histidine (H), cysteine(C) and aspartic acid (D) residues.
In the last few decades, phosphorylation site prediction research has attracted much attention, and the development of accurate phosphorylation site prediction methods has become very important. Existing methods can be generally divided into two categories: biological experimental methods, which are expensive and time-consuming, and computational methods, which are fast speed and low cost. moreover, identification based on experimental methods is labor-intensive and requires specialized equipment and technical knowledge. In this regard, phosphosite prediction algorithms are becoming popular and used to predict the list of possible phosphorylation sites in a protein of interest, then experimental methods are applied in verifying the phosphorylation sites that were predicted. So far, many predictors have been introduced to predict PTM sites (such as phosphorylation and methylation, etc.) [3], [4]. But it seems that the specific phosphorylation predictors provide more accurate results.
Computational Phosphorylation site prediction tools are divided into three categories, general (non-kinase specific) site prediction, kinase-specific site prediction, and global prediction, while general tools predict sites that can be phosphorylated and kinase-specific tools predict sites that can be phosphorylated by a specific kinase and also a global Prediction predict a General and Kinase-specific Phosphorylation Sites [5].
Non-kinase-specific tools may be able to predict phosphosites for which the associated kinase is unknown or the number of new substrate sequences of the associated kinase is few [2]. Moreover, by the recent advances in sequencing technology, many genomes of non-model organisms have been sequenced, and more kinases in those reconstructed genomes have been discovered, some of which have no sufficient substrate information to train the kinase-specific prediction algorithms. Thus, there is an increased interest in developing non-kinase-specific tools for a wider variety of species and high specificity for whole-genome annotation [6].
Until now, a few general phosphorylation site prediction models have been proposed, most of the existing methods are different in choosing the machine learning algorithms and feature engineering extraction, which have been used to capture the complex and definite patterns surrounding the phosphorylated residues for phosphorylation site prediction. The most widely used machine learning methods in general prediction tools include artificial neural networks (ANNs), support vector machines (SVMs), linear regression (LR), and random forest (RF). For instance, NetPhos uses neural networks to identify phosphorylation sites [7], while DISPHOS uses the amino acid frequency and disorder information to train an LR model for predicting the phosphorylation sites [8], Biswas et al. in PPRED combine the evolutionary information of the proteins with the SVMs to predict phosphorylation sites [9]. Musite, integrates three sets of parameters, including K nearest neighbor scores, protein disorder scorers, and amino acid frequencies, as features to train an SVM [4] and Phospho- SVM, which is one the most recent prediction tools based on SVMs, combines eight different sequence-level scoring functions using SVMs[6]. RF algorithms can provide insights into the relative importance of each feature; thus, RF classifiers have been applied to Various bioinformatics problems [10]. For example, in RF-Phos the random forest with sequence and structural features has been used to predict the general phosphorylation site [11].
A major recent advance in machine learning is introducing deep artificial neural networks. Deep learning is now one of the most effective fields in machine learning and has made breakthroughs in image and speech recognition, natural language processing, and most recently, computational biology [12]. Compared to traditional machine learning techniques, a deep neural network takes the raw data at the lowest (input) layer and automatically discovers the complex representations, and captures the high-level abstraction adaptively from the training data for classification. Thus, the application of deep learning for biological sequence analysis is growing. For example, DeepBind uses a convolutional neural network (CNN) for predicting sequence specificities of DNA- and RNA-binding proteins [13], MusiteDeep chooses a CNN with a two-dimensional attention mechanism for site prediction [14], DeepNitro uses a multi-layer deep neural network to predict nitration and nitrosylation sites [15], DeepPhos improved upon the performance of MusiteDeep, utilizing a multi-layer CNN architecture [16], DeepPSP extracts both local and global features from protein sequences with two parallel modules [17].
In addition to the deep learning methods that have been introduced based on neural networks, deep networks based on other learning methods have also been introduced, including deep forests [18]. Deep Forest is a classification method that consists of several layers and each layer contains several random forests. This method has fewer parameters than conventional deep learning methods, and the complexity of the model can be automatically identified through the data. Another advantage of this method is that it can produce good results without using backpropagation [19].
In this study, we focused on developing a new phosphorylation site predictor by seeking a more informative encoding scheme and the best machine-learning method. After our preliminary assessment of 37 different encoding schemes for training one of the 9 machine learning methods (The architecture of our phosphorylation predictor is shown in Fig. 1), we found that the composition of k-spaced amino acid pairs (CKSAAP) and the deep forest is suitable for phosphorylation prediction. Then we present DF-Phos, a general protein phosphorylation site predictor that uses a deep Forest and CkSAApair feature extraction method to predict the phosphorylation sites using protein sequence information. We collect human and muse data from two databases the dbpaf [20], and the P.ELM[21]. To avoid a biased classifier, the training set was created with a positive-to-negative ratio of 1:1. The optimal window length was determined using 10-fold cross-validation and independent test methods. Then we evaluated our predictor using a 10-fold cross-validation procedure and compared this method with several phosphosite predictors.