Computational method for predicting Self-Interactions Protein using Recurrent Neural Network from Protein Evolutionary Information

: Self-interactions Protein (SIPs) play crucial roles in biological activities of organisms. Many high-throughput methods can be used to identify SIPs. However, these methods are both time-consuming and expensive. How to develop effective computational approaches for identifying SIPs is a challenging task. In the paper, we presented a novelty computational method called RRN-SIFT, which combines the Recurrent Neural Network (RNN) with Scale Invariant Feature Transform (SIFT) to predict SIPs based on protein evolutionary information. The main advantage of the proposed RNN-SIFT model is that it used SIFT for extracting key feature by exploring the evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix (PSSM) and employed RNN classifier to carry out classification based on extracted features. Extensive experiments show that the RRN-SIFT obtained average accuracy of 94.34% and 97.12% on yeast and human dataset. We also compared our performance with the Back Propagation Neural Network (BPNN), the state-of-the-art support vector machine (SVM) and other exiting methods. By comparing with experimental results, the performance of RNN-SIFT is significantly better than those of the BPNN, SVM and other previous methods in the domain. Therefore, we can come to the conclusion that the proposed RNN-SIFT model is useful tools and can execute incredibly well for predicting SIPs, as well as other bioinformatics tasks. In order to facilitate widely studies and encourage future proteomics research, a freely available web server called RNN-SIFT-SIPs was developed, and is available at http://219.219.62.123:8888/RNNSIFT/ and includes source code and SIPs datasets. Scale Invariant Feature Transform (SIFT) to predict SIPs based on protein evolutionary information. Extensive experiments show that the RRN-SIFT obtained average accuracy of 94.34% and 97.12% on yeast and human dataset. We also compared our performance with the Back Propagation Neural Network (BPNN), the state-of-the-art support vector machine (SVM) and other exiting methods. By comparing with experimental results, the performance of RNN-SIFT is significantly better than those of the BPNN, SVM and other previous methods in the domain. This is mainly due to the following three reasons: (1) PSSM contains not only the position information but also the evolution information of protein sequence, and retains plenty of prior information. This make it possible to contains a number of key features can be extracted. (2) SIFT uses the concept of “scale space” to capture features at multiple scale levels, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it possible for extracting the evolutionary information embedded in PSSM and capturing self-protein interaction information. (3) Self-protein sequence is nonlinear sequence data and RNN have some characteristics in memory, parameter sharing and Turing completeness and can provide an advantage for learning based on the nonlinear characteristics of


Introduction
Protein-protein interactions (PPIs) prediction revealed multiple roles in many important biological activity. However, an interesting research problem regarding whether proteins can interact with their partner. Self-interactions protein (SIPs) is being considered as a special type of PPIs, which refers to more than two copies of the protein can interact with each other and are the same copies of the protein and can be represented by the same gene. This might bring about the formation of homo-oligomer problem. Many recent studies have shown that SIPs play a vital role in various cellular physiological functions and the evolution process of protein-protein interaction networks (PPINs) [1][2][3]. Therefore, whether a protein can self-interact for interpretation its functions is very important. The research on SIPs can provide a certain help for a far better understanding the regulation of protein function and the molecular mechanisms involved in biological activity and the underlying cellular and genetic disease mechanisms. Many studies have been conducted for the homo-oligomerization that is a vital function for biological activity and plays an absolutely essential role in a wide range of biological processes, such as, signal transduction, gene expression regulation, enzyme activation and immune response [4][5][6][7][8]. In addition, it has been demonstrated by many previous studies that the diversity function of proteins can be variously extended without increasing the length of genome through SIPs. SIPs can also provide some help in improving the protein stability and preventing the protein denaturation by reducing its surface area [9,10]. Therefore, it is becoming more and more important to develop reliable and highly effective computational approaches based protein sequence for predicting SIPs.
As always, a large number of researches have been devoted to develop reliable and highly effective computational approaches to predict PPIs. Gao et al [11] proposed a novelty computational method called RF-AC, which combined the Rotation Forest (RF) classifier with Auto covariance (AC) approach based PSSM. Huang et al [12] presented a new computational approach, which used weighted sparse representation (WSRC) as classifier and employed global encoding (GE) as feature extraction method for predicting PPIs. Pan et al [13]proposed a novelty latent dirichlet allocationrandom forest model (LDA-RF) for predicting human PPIs based on protein primary sequences, which is strong ability for processing large-scale datasets by using LDA-RF model. Zhang et al [14] proposed a novel approach based on protein sequence that used Random Tree and Genetic Algorithm for predicting PPIs, which obtained good prediction results. Yang et al [15] presented a new approach that used Local descriptors to represent protein sequence and employed the k-nearest neighbors for carrying out classification. Guo et al [16] adopted autocorrelation feature extraction technique for generating feature vectors and used the SVM classifier to identify PPIs. An et al [17] proposed a classification algorithm of compound kernel function RVM based on gray wolf optimization algorithm and K-fold cross Validation, which fully consider the special features of local and global of protein-protein interactions position and obtained good predicting results. An et al [18] proposed a feature extraction approach based on local protein sequence PSSM matrix coding and serial multi-feature Fusion. The method can capture protein-protein interaction information of continuous and discontinuous for protein sequence by using the local protein sequence PSSM matrix coding; much key feature information contained protein sequences can be integrated through employing serial multi-feature Fusion. These methods usually explored the correlational information between protein pairs, such as, coevolution, co-localization and co-expression. However, this information is not enough for predicting SIPs. In addition, the PPIs datasets do not contain the PPIs between the same protein partners. With all these reasons, it is not fit for predicting SIPs by using these computational approaches. In the previous study, Liu et al [1] proposed a method integrating multiple representative known properties to create a prediction mode called as SLIPPER to predict SIPs. As far as we know, a number of recent studies have been reported about PPIs, which may also be related to SIPs [19][20][21]. However, there is an obviously drawback that cannot deal with the proteins not covering the current human interatomic by using these methods. Due to all the reasons hereinbefore, it is an urgent work at present for developing efficient computational approaches for predicting SIPs.
In the study, we proposed a novelty computational method named RRN-SIFT, which combines the Recurrent Neural Network (RNN) with Scale Invariant Feature Transform (SIFT) to predict SIPs based on protein evolutionary information. The major advantage of the proposed RNN-SIFT model is that it used SIFT for extracting key feature by exploring the evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix (PSSM) and employed RNN classifier to carry out classification based on extracted features. Extensive experiments show that the RRN-SIFT obtained average accuracy of 94.34% and 97.12% on yeast and human dataset. We also compared our performance with the Back Propagation Neural Network (BPNN), the state-of-the-art support vector machine (SVM) and other exiting methods. By comparing with experimental results, the performance of RNN-SIFT is significantly better than those of the BPNN, SVM and other previous methods in the domain. Therefore, we can come to the conclusion that the proposed RNN-SIFT model is useful tools and can execute incredibly well for predicting SIPs, as well as other bioinformatics tasks.

Dataset
The UniProt database contains 20,199 curated human protein sequences [22]. The PPIs datasets can be downloaded from different databases, including DIP [23], BioGRID [24], IntAct [25], InnateDB [26] and MatrixDB [27]. The PPIs data was constructed in the paper that only contains the same two interaction protein sequence, whose interaction type was defined as 'direct interaction' in relevant databases. As a result, 2994 human Protein Self-interactions protein sequences were obtained. In order to verify the performance of the RNN-SIFT model, we constructed the experimental datasets by using as following three steps [28]: (1) the protein sequences whose length less than 50 residues and longer than 5000 residues were removed from the whole human proteome;(2) we selected the SIPs data to create the positive dataset, which must satisfy one of the following conditions: (a) it has been detected for the Self-interactions by at least two kinds of large scale experiments or one small-scale experiment; (b) the protein has been defined as homooligomer (including homodimer and homodimers) in UniProt; (c) it has been reported by at least two publications for the Self-interactions;(3) for constructing the negative dataset, we removed all types of SIPs from the whole human proteome (including proteins annotated as 'direct interaction' and more extensive 'physical association') and UniProt database. Consequently, we selected 15,938 non-SIPs as negatives samples and 1441 SIPs as positives samples for creating the human dataset [28]. In addition, we also used the same strategy to construct the yeast dataset that contains 5511 negative and 710 positive samples [28].

Position Specific Scoring Matrix (PSSM)
Position Specific Scoring Matrix (PSSM) contains not only the position information but also the evolution information of protein sequence. As a result, the PSSM is used to extract the evolutionary information in the paper. Using Position Specific Iterated BLAST (PSI-BLAST) [29] to convert each sequence into a PSSM. Assuming the length of a given protein sequence is L, its PSSM can be expressed as an L × 20 matrix. Figure 1 shows the schematic of a PSSM. Figure 1 the schematic of a PSSM Where L represents the length of a given sequence, 20 are the number of 20 amino acids, and represents the score of the ℎ amino acid in the ℎ position for the query sequence. The can be greater than 0, less than 0 or equal to 0. If is greater than 0, it means that the ℎ amino acid is easily mutated into the ℎ amino acid during the evolution process, and a larger value indicates a higher mutation probability. Conversely, if is less than 0, the position is conservative and the probability of mutation is small. Smaller are more conservative. To extract evolutionary information from protein sequences, each SIP's sequence was converted into a PSSM by using the PSI-BLAST tool. To obtain highly and widely homologous sequences, PSI_BLAST's e-value parameter was set to 0.001 and three iterations were selected.

Scale Invariant Feature Transform (SIFT)
Scale Invariant Feature Transform (SIFT) is an image descriptor developed by David Lowe, which was used to match and recognition image-based [30,31]. The original SIFT descriptor was calculated from the image intensities around interesting locations in the image domain which can be named interest points or key points. These interest points are obtained from scale-space extrema of differences-of-Gaussians (DOG) within a difference-of Gaussians pyramid. Lindeberg [32,33] proposed a new method for finding out interest points by using the SIFT approach. This method can be viewed as a variation of a scale-adaptive blob detection approach, where blobs with associated scale levels are detected from scale-space extrema of the scale-normalized Laplacian. The scalenormalized Laplacian is normalized with respect to the scale level in scale-space and is defined as： For obtaining the maximum value of the DOG image under different scale magnifications, the smoothed image values of a given original image is convolved with Gaussian kernels of different widths by using SIFT algorithm, a scale-variable Gaussian function is defined as follow: These Gaussian-blurred images are grouped according to their scale magnification, so the number of Gaussian blur images processed in each group is the same. At this time, the DOG image can be obtained by subtracting two adjacent Gaussian blurred images in the same group. The difference-of-Gaussians operator constitutes an approximation of the Laplacian operator of different widths, where denotes the standard deviation and the variance of the Gaussian kernel. The difference-of-Gaussians operator constitutes an approximation of the Laplacian operator is defined as follow: Which by the implicit normalization of the differences-of-Gaussian responses, as obtained by a self-similar distribution of scale levels +1 used by Lowe, also constitutes an approximation of the scale-normalized Laplacian with ∆s∇ 2 = ( 2 − 1) ∇ 2 = ( 2 − 1)∇ 2 thus implying After the DOG image is obtained, the maximum and minimum values can be found and is referred to as key points in the DOG images. In order to quickly find the key points, each pixel of the DOG image will be compared with the eight pixels around itself and nine pixels at the same position in the same group of the DOG images at adjacent scales. The maximum and minimum values of these pixels are called key points. As a result, the critical point detection of SIFT algorithm is actually a variant of Blob detection, which use Laplacian to compute the maximum values in each magnification space. The Gaussian difference can be approximated as the result of Laplace operator operation. SIFT employs the concept of "scale space" to capture features at multiple scale levels or image resolutions, which not only increases the number of available features but also makes the method highly tolerant to scale changes.
In the paper, we assumed that each PSSM is an image matrix. As a result, we used SIFT feature extraction method to generate feature vectors and its dimensional is 128. The technology roadmap of the proposed method is shown in Figure 2.

Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) is used to solve the problem that the input training samples is a continuous sequence and the length of the sequence is different, such as the problem based on time series. The basic neural network only establishes weight connections between layers. The biggest difference of RNN is that the weight connections also established between layers of neurons [34][35][36]. The structure of RNN is as follows: Figure 3 the structure of RNN It can be seen from Figure 3 that the output of RNN at any moment is related to the current input and the previous output. RNN's forward propagation is a combination of multiplication, addition and set operations. It is well known that t moment of a given ordered sequence will lead to computation of the hidden layer t times. The current state of hidden layer ℎ( ) is determined by the current input ( ) and the output ℎ( − 1) of the previous layer. The mathematical description is as follows: ( ) = ( ) + ℎ( − 1) + ℎ( ) = ( ( )) = ( ( ) + ℎ( − 1) + ) Where represents activation function. The output of the current hidden layer can be calculated by using the following function: ( ) = ℎ( ) + The Softmax function can be used to carry out classification and output the final prediction probability value, which is shown as follow: = ( ( )) = ( ℎ( ) + ) Here, the loss function of is different from . In practice, we can select different loss functions according to the need of the different problem, such as, the log loss function, the square loss function, and so on. The loss function of the RNN model at moment t can be expressed as follows: The loss function (global loss) of the RNN model at all moments N can be expressed as follows: The gradient of three parameters U, V, and W of the global loss can be defined as follows: The most commonly used method for optimization problems is the gradient descent. In the paper, the gradient update for the three parameters can be expressed as follows: The major advantage of RNN model in learning nonlinear sequential data is well-known and has been utilized in language modeling and sequential labeling. In consideration of SIPs dataset is also a kind of nonlinear sequence data, so we used RNN model to predict SIPs in the study. The prediction flowchart of RNN-SIFT model is displayed in Figure 4.

Performance Evaluation
In the paper, we employed the following measures to assess the performance of RNN-SIFT. Where Ac is Accuracy, Sn represents Sensitivity, Sp is specificity，Pe represents Precision and Mcc is Matthews's correlation coefficient. TP and TN represent the number of true interacting and true non-interacting pairs that were correctly predicted, respectively. FP and FN is the count of true non-interacting pairs and true interacting pairs falsely predicted, respectively. In addition, we used Receiver Operating Curve (ROC) to further evaluate the performance of RNN-SIFT in the experiment.

Performance of the proposed RNN-SIFT model
In the experiment, we used yeast and human datasets to evaluate the proposed RNN-SIFT model. Generally overfitting will affect experimental results. Therefore, we divided the whole datasets into the training datasets and independent test datasets for preventing overfitting. Specifically, we split the yeast dataset into 6 parts, and selected 5 parts of them as the training set and the remaining dataset selected as independent test dataset. The human dataset was also processed by using the same strategy. Meanwhile, fivefold cross-validation tests was employed to evaluate the performance of the RNN-SIFT for fair comparison and several parameters of the RNN model were optimized through using the grid search for ensuring fairness. Here, we set up learning rate 0.001, training step 1000 and hidden units 200. Table 1 Table 2 was that the RNN-SIFT also achieved better prediction results on human dataset, whose average accuracy, sensitivity, precision, and MCC are 97.12%, 83.70%, 85.24% and 79.35% respectively. As a result, the proposed RNN-SIFT model has high value in research.
The good experimental results for predicting SIPs are mainly attributed to use the SIFT feature extraction method and RNN classifier. The main advantage of the RNN-SIFT model is that SIFT method can extract key evaluation feature from PSSM and RNN classifier has the advantage of processing sequence data. As discussed, this is mainly due to the following three reasons: (1) PSSM contains not only the position information but also the evolution information of protein sequence, and retains plenty of prior information. This make it possible to contains a number of key features can be extracted. (2) SIFT uses the concept of "scale space" to capture features at multiple scale levels, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it possible for extracting the evolutionary information embedded in PSSM and capturing self-protein interaction information. (3) Recurrent neural networks have some characteristics in memory, parameter sharing and Turing completeness, so which provide an advantage for learning based on the nonlinear characteristics of sequences. Therefore, RNN is use to carry out classification for predicting SIPs. The results demonstrate two things. First, SIFT method is very suitable for extracting self-protein sequence feature. Second, the RNN classifier performs well for predicting SIPs, giving good results.

Comparison with the Method of BPNN-based and SVM-based
It is interesting to note that the RNN-SIFT model is very suitable for predicting SIPs and can obtain good prediction results. However, to further evaluate the performance of RNN-SIFT model, we compared the results of the RNN classifier with those of the Back Propagation Neural Network (BPNN) classifier and the Support Vector Machine (SVM) classifier by using the same SIFT approach on yeast and human datasets, respectively. In order to ensure fair comparison, several parameter settings of BPNN were optimized by employing grid search approach. Specifically, the epochs, the eta, the BS and the WS of BPNN are set to 100, 0.006, 0.5 and 0.7. Similarly, by using the same strategy as described above, the RBF kernel parameters of the SVM were optimized, where c is 0.5 and g is 10.8 and other parameters takes the default value. In addition, SVM classifier used the LIBSVM tool [37] to carry out classification.
Table 3-6 below shows the experimental results of BPNN-SIF and SVM-SIFT on yeast and human dataset, respectively. Meanwhile the comparison of ROC Curves on yeast and human dataset between RNN, BPNN and SVM are shown in Figure 5-6 below respectively. As outlined in Table  3-4, the BPNN-SIFT model achieved 91.31% average accuracy and the SVM-SIFT model obtained 89.58% average accuracy on yeast dataset. Similarly, as can be seen from table 5-6, the results of average accuracy 93.84% and 91.79% are obtained by the BPNN-SIFT model and the SVM-SIFT model on human dataset, respectively. When comparing our results to those of BPNN-SIFT and SVM-SIFT, it must be pointed out that the performance of RNN classifier is significantly better than that of the other two classifiers. At the same time, from Figure 5 and Figure 6, the ROC curves of RNN classifier are also significantly better than that of the other two classifiers. A major reason for good prediction results is that Self-protein sequence is nonlinear sequence data and RNN classifier have some characteristics in memory, parameter sharing and Turing completeness and can provide an advantage for learning based on the nonlinear characteristics of sequences. From the above analysis, the paper comes to the conclusion that the proposed RNN-SIFT model is useful tools for identifying SIPs, as well as other bioinformatics tasks.

Comparison with Other Methods
To go a step further and validate the performance of the proposed RNN-SIFT model, we compare the prediction results of RNN-SIFT model with those of the previous methods ,such as ,SLIPPER [38],CRS [28], SPAR [28] , DXECPPI , PPIevo [39] and LocFuse [40]. Table7-8 shown a detailed comparison results on yeast and human dataset. It can be seen from Table 7 that the average accuracy of RNN-SIFT is obviously higher than those of the other six approaches on yeast dataset. Similarity, Table 8 displays the prediction accuracy obtained RNN-SIFT model is also significantly better than those of the other six methods on human dataset. A similar conclusion was reached by comparing the results from Table 7-8 that the proposed RNN-SIFT model has an excellent prediction capability and can be used for quality predicting SIPs. This is a result of using a robust RNN classifier and an effectively SIFT feature extraction technique. These comparison results is further evidence that the RNN-SIFT is suit for predicting SIPs.

Conclusion
In the study, we proposed a novelty computational method named RRN-SIFT, which combines the Recurrent Neural Network (RNN) with Scale Invariant Feature Transform (SIFT) to predict SIPs based on protein evolutionary information. Extensive experiments show that the RRN-SIFT obtained average accuracy of 94.34% and 97.12% on yeast and human dataset. We also compared our performance with the Back Propagation Neural Network (BPNN), the state-of-the-art support vector machine (SVM) and other exiting methods. By comparing with experimental results, the performance of RNN-SIFT is significantly better than those of the BPNN, SVM and other previous methods in the domain. This is mainly due to the following three reasons: (1) PSSM contains not only the position information but also the evolution information of protein sequence, and retains plenty of prior information. This make it possible to contains a number of key features can be extracted. (2) SIFT uses the concept of "scale space" to capture features at multiple scale levels, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it possible for extracting the evolutionary information embedded in PSSM and capturing self-protein interaction information. (3) Self-protein sequence is nonlinear sequence data and RNN have some characteristics in memory, parameter sharing and Turing completeness and can provide an advantage for learning based on the nonlinear characteristics of sequences. Therefore, we can come to the conclusion that the proposed RNN-SIFT model is useful tools and can execute incredibly well for predicting SIPs, as well as other bioinformatics tasks.