Computational prediction of protein-protein interactions in plants using only sequence information

Protein-protein interactions (PPIs) in plants plays a significant role in plant biology and functional organization of cells. Although, a large amount of plant PPIs data have been generated by high-throughput techniques, but due to the complexity of plant cell, the PPIs pairs currently obtained by experimental methods cover only a small fraction of the complete plant PPIs network. In addition, the experimental approaches for identifying PPIs in plants are laborious, time-consuming, and costly. Hence, it is highly desirable to develop more efficient approaches to detect PPIs in plants. In this study, we present a novel computational model combining weighted sparse representation-based classifier (WSRC) with a novel inverse fast Fourier transform (IFFT) representation scheme which was adopted in position specific scoring matrix (PSSM) to extract features from plant protein sequence. When performed the proposed method on the plants PPIs dataset of Mazie , Rice and Arabidopsis thaliana ( Arabidopsis ), we achieved excellent results with high accuracies of 89.12%, 84.72% and 71.74%, respectively. To further assess the prediction performance of the proposed approach, we compared it with the state-of-art support vector machine (SVM) classifier. To the best of our knowledge, we are the first to employ protein sequences information to predict PPIs in plants. Experimental results demonstrate that the proposed method has a great potential to become a powerful tool for exploring the plant cell function.


Introduction
In plants, the prediction of protein-protein interactions (PPIs) provides important information for understanding the molecular mechanisms underlying biological processes.Recently, a large number of high-throughput experimental approaches have been developed to identified PPIs, such as affinity-purification coupled to mass spectrometry (AP-MS) [1] and yeast two-hybrid (Y2H) [2][3][4][5] screens methods.Although we have accumulated a large amount of plant PPIs data [6][7][8], these experimental approaches also some inevitable drawbacks, which are not only costly, but also laborious and time-consuming.Moreover, these traditional biochemical experiments always suffer from high false positive rates and high false negative rates.And due to the complexity of plant growth and development systems, large-scale prediction experimental methods could not be adopted in plant domain, and now only a small fraction of the whole plant PPIs network can be detected.Therefore, it is very significance to develop the efficient computational approaches to identify PPIs in plants [9,10].
In recent years, much effort has been made to develop PPIs identification methods based on different data types, including literature mining knowledge [11], gene fusion [12] and protein structure information [13,14].A large amount of PPIs dataset has been built, such as TAIR [15], PRIN [16], and MINT [17].There are also some approaches that combine data and information from different sources [18][19][20] to predict PPIs.However, without prior knowledge of corresponding proteins, these methods cannot be implemented.
Recently, the PPIs prediction methods, which extract information directly from amino acid sequences have received much attention [21][22][23][24].Many researchers have worked to provide sequences-based methods to detect novel PPIs, and experimental results indicated that PPIs in plants can be accurately identified using only sequence information [25][26][27][28].For example, Sun et al. [29] presented a method that using a type of deep-learning algorithm called stacked autoencoder (SAE) to use sequence-based approaches for predicting PPIs in human datasets.This model obtained the best results on 10-fold cross-validation which was based on protein sequence autocovariance coding.One of the excellent works that utilizing the protein sequence information to predict PPIs is presented by Shen et al. [30].This method is based on a SVM model that combine with a conjoint triad feature and a kernel function for describing amino acids.Specifically, according to the volumes and dipoles, the 20 amino acid sequence will be clustered into seven classes.Then the conjoint triad method will abstract the features of protein pairs.Wang et al. [31] proposed a novel computational method for detecting PPIs adopting sequence information, and combining Zernike moments descriptor with stacked autoencoder.First, they employed Zernike moment feature representation on a position specific scoring matrix.Secondly, a stacked autoencoder was used for noise reduction.Finally, a powerful model, the probabilistic classification vector machines model (PCVM) was used to handle the classification problem.You et al. [32] also developed a novel computational approach called PCA-EELM to predict PPIs.The main improvement of this study is that they adopted the PCA method to construct the most discriminative new feature set.In addition, many methods based on amino acid sequences have been developed in the literature [33,34].While these studies have achieved some progress, there is still room for improvement in terms of the efficiency and accuracy of the models.
In the present work, we provided a novel computational method to detect the PPIs in plants from protein sequence information, which employing a novel position specific scoring matrix (PSSM) and combining the weighted sparse representation-based classifier (WSRC) with inverse fast Fourier transform (IFFT).This feature representation approach combined with the WSRC has remarkably performed in the prediction of the PPIs in plants.Furthermore, the main idea of our proposed model includes three steps.First, the plant protein sequence could be represented as a position specific scoring matrix so that we can obtain the biological evolutionary information between different types of amino acids.Second, utilizing the inverse Fast Fourier transform (IFFT) method to extracted a 400-dimensional vector from each plant proteins PSSM matrix.As a result, each protein pairs will be described as an 800-dimensional feature vector.Thirdly, a powerful classifier, weighted sparse representation-based classifier, is employed to perform PPIs predictions on three plants PPIs datasets, including Maize, Rice and Arabidopsis thaliana (Arabidopsis).We also compared the proposed model with the state-of-the-art support vector machine (SVM) classifier to further evaluate the prediction performance.The experiments results demonstrated that our approach performs significantly well in distinguishing interacting and noninteracting plants protein pairs.These experimental results further shows that the proposed approach is promising and reliable for the prediction of protein-protein interactions in plants.The source codes and datasets explored in this work are available at: https://github.com/jie-pan111/protein_sequence.

TN TP Acc
TP FP FN TN . .
where true positive (TP) denotes the number of plants protein-protein pairs classified as interacting correctly while true negative (TN) stands for the number of non-interacting PPIs pairs predicted correctly; false positive (FP) denotes the number of samples classified as interacting incorrectly, and false negative (FN) denotes the count of interacting plants PPIs pairs that predict to have no interaction.In addition, we also adopted the receiver operating characteristic (ROC) curves to assess the prediction performance of the proposed approach, and the area under the Receiver Operating Characteristic curve (AUC) is calculated used for demonstrating the quality of prediction model.

Assessment of Prediction Ability.
In this article, we used 5-fold cross-validation to evaluate the predictive ability of our model in three plant data sets involving Maize, Rice and Arabidopsis.In this way, we can prevent overfitting and test the stability of the proposed method.More specifically, the whole data set is partitioned into five roughly equal parts, four of them were used to construct a training set and the rest one was adopted as a testing set.Thus, five models can be generated for the five sets of data.The cross validation has the advantages that it can minimize the impact of data dependency and improved the reliability of the results.
The five-fold cross validation results of the proposed approach on the three plants datasets are listed in Table 1-3.Form Table 1, we can observe that when applying the proposed method to the Mazie data set, we obtained best prediction results of average accuracy, precision, sensitivity, and MCC were 89.12%, 87.49%, 91.32%, and 80.59%, with corresponding standard deviations 0.59%, 1.38%, 0.64%, and 0.94%, respectively.When exploring the proposed method on the Rice dataset, we yield the good results of average accuracy, precision, sensitivity, MCC of 84.72%, 85.04%, 84.44% and 84.10%, respectively.The standard deviations of these criteria values are 0.73%, 0.85%, 0.65% and 1.00% respectively.When predicting PPIs of Arabidopsis dataset, the proposed approach obtained good results of average accuracy, precision, sensitivity, MCC of 71.74%, 69.33%, 77.02% and 58.97% and the standard deviations are 0.48%, 0.58%, 1.15% and 0.38%, respectively.Figure .1-3 shows the ROC curves for the proposed approach on Maize, Rice and Arabidopsis.The average AUC values range from 79.19% to 93.76% (Maize: 93.76%, Rice: 88.75% and Arabidopsis:79.19%),suggesting that our method is fit well for our purposes to predict PPIs in plants from amino acid sequences.
These good results collectively demonstrate that using the information of protein sequence alone to predict PPIs in plants is sufficient enough, and that powerful prediction capability for predicting PPIs can be yielded by adopting weighted sparse representation-based classifier combined IFFT features.This strong prediction performance derives from the feature extraction method for plant protein sequences and the choice of machine learning classifier.The high accuracies and low standard deviations of this criterion values indicate that our proposed model is feasible and effective for predicting PPIs in plants.

Comparison of the proposed model with different classifiers.
Although the WSRC model obtained better performance in predicting PPIs of plants, we also need to further verify the prediction ability of the proposed method.We compared the prediction accuracy of the WSRC model with that of the state-of-art SVM model via the same feature extraction approach based on the Maize, Rice and Arabidopsis datasets, respectively.We applied the same feature extraction approach on the Maize, Rice and Arabidopsis datasets and compared the prediction accuracy of the WSRC model with the state-of-the-art SVM.We employed the LIBSVM tool to run this classification, and 5-fold cross-validation was also adopted in these experiments.In order to obtain better performance of SVM classifier, we should optimize several parameters of SVM classifier.In this study, the penalty parameter C and the kernel parameter g of SVM model was optimized by the gird search method.In the experiments of Maize and Rice dataset, we set c=5, g=0.5 and c=6, g=0.5.when applying on Arabidopsis dataset, we set c=7, g=0.03.
As shown in Table 4, it is clearly seen that when applied the SVM model to predict PPIs of Maize dataset, we yield good results with average accuracy, precision, sensitivity, MCC and AUC of 81.77%, 83.10%, 79.78%, 70.16% and 88.04%, respectively.When identifying PPIs of Rice dataset, the SVM classifier yield good results with average accuracy, precision, sensitivity, MCC and AUC of 79.13%, 78.27%, 80.91%, 66.93% and 86.62%, respectively.When exploring the Arabidopsis dataset, the average accuracy, precision, sensitivity, MCC and AUC come to be 62.55%, 63.49%, 58.97%, 53.03% and 66.87%, respectively.For the three plant datasets, the classification results yield by the SVM-based models are lower than those by the proposed approach.In summary, it is obvious that the overall prediction results of WSRC model is better than that of SVM-based approach.Meanwhile, the ROC curves of these experiments are also shown in Figs 1-3.From Figure 1, it is obvious to see that the AUC value of SVM model on the Maize dataset is 0.8804 and that of the WSRC is 0.8912.From Figure 2, we can see that the average AUC of SVM classifier is 0.8662 and that of WSRC method is 0.8875.From Figure 3 we can see that when predicting PPIs of Arabidopsis dataset, the SVM-based method can obtain good results with average AUC of 0.6687 and that of WSRC method is 0.7919.All these experiments indicates that the average AUC value of WSRC method is so large than that of the SVM-based method.From all of these experiments results, we can draw the following conclusion that the weighted sparse representation-based classifier is an effective and robust model for PPIs prediction in plants.
(a).ROC of SVM method (b).ROC of WSRC method shows the ROC curves performed by WSRC method on Arabidopsis dataset.

Data collection and data set construction.
We verify the proposed model on three plants PPIs dataset.The first dataset is Maize.Maize is one of the most important food, feed and industrial crops in the world and also an excellent model for plant genetics.In order to better demonstrate the prediction performance of the proposed model and understand the molecular mechanisms underlying various traits of Maize, we select the maize as the third plant data set in this study.We collected the Maize dataset from agriGO [38] and Protein-Protein Interaction Database for Maize (PPIM) [39], which covers 2,762,560 interactions among 14,000 proteins.After the strict inclusion and exclusion screening, we select 6250 protein pairs from 6497 maize proteins.As a result, the whole Maize dataset is constructed by 12500 maize protein pairs.
Rice is one of the most important staple foods for more than half of the world's population.To validate the generality of the proposed method, we also performed our method on the Rice PPIs dataset.We collected the Rice dataset from the protein reference database agriGO [38] and PRIN [16].In order to construct the negative dataset, we selected 4800 additional protein pairs which work in different subcellular and assumed that they will not interact with each other.As a result, the whole Rice data set is constructed by 9600 protein pairs from 3760 Rice proteins.
Arabidopsis thaliana (Arabidopsis) is a well-known model plant and we chose it as the third dataset in this study, which we collected from public PPIs databases TAIR [15], IntAct [40] and BioGRID [41].After removing redundant PPIs, we yield the remaining 4120 protein pairs to build the positive data set, which containing 6013 Arabidopsis proteins [42].For constructing the negative data set, we randomly selected the same number of non-interacting protein pairs.On this foundation, the entire Arabidopsis dataset constructed by 8240 protein pairs.

Position-Specific Scoring Matrix (PSSM).
Through the Position-Specific Scoring Matrix (PSSM) which is reported by Gribskov [43] and it achieved great success in protein binding site prediction, protein secondary structure prediction and prediction of disordered regions [44][45][46].The structure of PSSM can be represented as a matrix of N rows and 20 columns.Each protein sequence can be transformed as follows: (5) where N denotes the length of a given plant protein sequence and column 20 represents the number of 20 amino acid.For each query sequence, the value , M  , which could be described as β-th amino acid, will be set up by PSSM at the position of  .Thus, , M  can be calculated as: Thus, the value of Dayhoff's mutation matrix between the β-th and k-th amino acids can be described as ( , )  qk  , and the occurrence frequency score of the k-th amino acid in the position of  with the probe can be represented by ( , )  pk  .Hence, a high value means a strongly conservative position; otherwise, it will imply a weakly conservative position.
In this study, we employed the Position-Specific Iterated BLAST (PSI-BLAST) tool [47] to generate the PSSM for each protein sequence.we assigned the e-value to 0.001 and selected 3 iterations in the process.In addition, all other parameters were set to default values to obtain highly and widely homologous sequences.

Inverse Fast Fourier transform.
In the fields of computational science and engineering, the Fast Fourier Transform (FFT) [48] is one of the most important algorithms.It is an indispensable algorithm in the field of Digital Signal Processing.However, FFT algorithm is not suitable in many practical applications when the data are not uniformly sampled.For this reason, we adopted the inverse fast Fourier Transform (IFFT) [49] method to obtain the transient response in time domain.
In the FFT, the irregularities of the twiddle factors can be solved by the Sine and Cosine transform of the signal.The Cosine and Sine transformations of the input signal are added together to obtain the FFT of the two-dimensional signal.As shown in Equation (7) and Equation ( 8), the required Sine matrix and Cosine matrix for FFT and IFFT can be defined as: ( 1, 1) sin(( / 4) ( Hence, by the rules, the 2-D FFT can be yield by adding the Sine and Cosine transform as shown in Equation ( 9): In our study, each protein sequence in the three plant datasets, will be converted into a 400-dimensional vector by means of an inverse fast Fourier transform.

Weighted sparse representation-based classifier.
Recently, with the improvement of linear representation methods (LRBM) and compressed sensing (CS) theory, sparse representation-based classification (SRC) [50,51] algorithm has been proven to widely applied in signal processing, pattern recognition and computer vision.The SRC assumes that there is a training sample matrix with the linear combination of kth-class training samples as: , when the whole training set representation are taking into account, it can be further symbolized as follows: It is well known that the nonzero entries in 0  are only relevant to the kth class, so if the samples size is too large, 0  will become sparse.For SRC algorithm, the key of it is to search the  vector that formula (12) can satisfy and can minimize the 0 l -norm of itself.It can be represented as: Since problem ( 13) is a NP-hard problem and it is difficult to be solved accurately.According to the CS theory, if the  is sparse enough, we can solve the related convex 1 l -minimization problem instead of dealing with the solution of 0 l -minimization problem directly: To deal with the occlusion, the Eq (14) can be extended to the stable 1 l -minimization problem: where 0   represents the tolerance for reconstruction error.We can solve the Eq. ( 15) by using the standard linear programming methods.
After achieving the sparsest solution 1   , SRC can assign the test sample y to class k via the following rule: However, some studies [52][53][54] have reported that in some cases, locality structure of data is more important than sparsity.Moreover, the traditional SRC could not be guaranteed to be local.
To solve this problem, Lu et al. [55] developed a novel variant of SRC called weighted sparse representation-based classifier (WSRC).The main improvement of this method is that it combines the locality structure of data with sparse representation.Through mapping the training data into a higher-dimensional kernel include feature space, it can yield a better performance of classification.Gaussian kernel-based distance was used in WSRC to calculate the weights: where 12 , d s s R  denotes two samples;  is the Gaussian kernel width.In this way, WSRC can preserve the locality structure of data and it can address the following questions: and specifically, where 0   denotes the tolerance value.To summarize, the WSRC algorithm can be stated as follows:

Support Vector Machine.
There are various methodologies for machine learning models to predict PPIs and support vector machine is one of the most popular classifiers.In 1995, SVM was first developed by Cortes and Vapnik et al. [56] and it is a generalized linear model usually used for classification and regression tasks.The ideal of SVM algorithm is to find the optimal hyperplane that maximally separates training data from the two classes.Hence, we can convert it to a convex quadratic programming problem.The formal definition of SVM can be expressed as: where w represents the normal vector which defined the hyperplane and the classifier parameter can be defined as C; n denotes the number of vectors in the training dataset; The maximization of the margin can be expressed as the first part of the objective function in equation ( 21).The 2 1/ w represents the margin, which is determined by the distance between the nearest vectors and hyperplane.Thus, the maximization of 2 w is the minimization of 2 1/ w .The goal of objective function is to maximize the margin.Because as the margin increases, so too will the variability between the classes, which ensure a cleaner separation.However, if the margins are increased, the probability of misclassification will also be increased.The rate of misclassification is estimated by the slack variable i S , which is set it to be 0 for well-classified vectors, between 0 and 1 for vectors located in the separation region, and above 1 for misclassified vectors.The training examples which defined the separating hyperplane are called support vectors.The parameter C denotes the weight vector: if the value of C is too high, it will lead to an increase of the penalties for misclassification and so that the area of margin will be reduced.The flow chart of our method is shown as Figure 4.

Conclusions
In this study, we present an effective and accurate computational method that utilize the information of amino acid sequence for predicting PPIs in plants.This method is based on a weighted sparse representation-based classifier combining with inverse fast Fourier transform and a position-specific-scoring-matrix.The main of this approach is to employ the unique of WSRC method including better generalization, simply and considering the sparsity and continuity of plants protein sequence data.The whole prediction model is composed of the following steps.Firstly, all the plant protein sequences were converted as the PSSM so that the evolutionary information from each sequence can be obtained.Secondly, we employed the inverse fast Fourier transform to extract feature vector from PSSM.Finally, weighted sparse representation-based classifier would be used as machine learning classifier.The proposed approach performs significantly well on three plants PPIs datasets, including Maize, Rice and Arabidopsis.In order to prove the efficient and reliability efficient of the proposed model, we also compare it prediction performance with the state-of-the-art SVM model.All of these experiments results indicates that our method can improve the accuracy of the PPIs prediction in plants.In conclusion, the proposed method is a reliable, efficient and powerful prediction model for future proteomics research.To be the best of our knowledge, this is the first time to use computational methods to predict PPIs in plants.

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Comparison of the ROC curves obtained by WSRC and SVM-based method on Maize dataset (5-fold n training set and d-dimensional feature vectors, and it also assumes that there are sufficient training samples belonging to the kth class and set up n denotes the sample number of kth class and i l represents the label of ith sample.Thus, the sample matrix X could be defined as which is built by training samples of class k and the class number of the whole samples can be defined as K. Then the SRC set a test sample as a sparse combination of training sample.Finally, we assigned it to the class which can minimizes the residual between itself and 1

Algorithm. 2 . 6 .
Weighted Sparse Representation-based Classifier (WSRC) 1. Input: the matrix of training samples Normalize the columns of X to have unit 2 l -norm.3. Calculate the Gaussian distances between y and each sample in X and employ them to adjust the training samples matrix X to ' Output: the prediction label of y as ( ) arg min( ( ))

ix
are the training vectors with m features; i S are the slack variables; i y is either 1 or -1 and it is the classification of each i x ; and b denotes the coefficient which determines the axis intercepts.

Figure 4 .
Figure 4. Flow chart of the proposed method

Table 2 .
5-fold cross-validation results achieved on the Rice dataset using the proposed method.

Table 3 .
5-fold cross-validation results achieved on the Arabidopsis dataset using the proposed method.

Table 4 :
The comparison of the WSRC method with the SVM-based method on three plant datasets.
is the sample number of training set in class k and W represents a block diagonal matrix about locality adaptor.Dealing with occlusion, we would finally solve the following stable