An efficient computational model for class imbalance problem in Self-Interaction Proteins Prediction

: Background ： Self-interaction Proteins (SIPs) play a key role in a variety of biological activities of organisms. In consideration of the time-consuming and expensive of high-throughput methods, and the number of positive and negative samples is very imbalanced in SIPs datasets. How to develop accurate and efficient computational approaches for assisting and accelerating the study of identifying SIPs is a challenging task. with high accuracy and robustness. It is anticipated that the WELM-SURF method is a useful computational tool to facilitate widely bioinformatics studies related to SIPs prediction. For further encouraging future proteomics research, we developed a freely available web server called WELM-SURF-SIPs. It is available at http://219.219.62.123:8888/WELMSURF/ and includes SIPs datasets and source code.


Background
A large number of studies have shown that Protein-protein interactions (PPIs) play a variety of key roles in many important biological activities. However, whether proteins can interact with their partners is an important research direction of proteomics research. Self-interactions protein (SIPs) refers to two or more copies of a protein that is the same copies and is represented by the same gene, which can interact with each other and is considered as a special type of PPIs. This might bring about the formation of homo-oligomer problem. In recent years, many studies have proved that SIPs plays an important role in the evolution of various cellular physiological functions and protein-protein interaction networks (PPINs) [1][2][3]. Therefore, it is important for a protein to express function through its own interactions. The research related to SIPs can provide a certain help for better understanding of the molecular mechanisms involved in biological activity, the regulation of protein function, and the underlying disease mechanisms of cellular and genetic. Homologous oligomerization is an important function of biological activity and plays an absolutely important role in gene expression regulation, signal transduction, immune response and enzyme activation [4][5][6][7][8]. In addition, many previous studies have revealed that the diversity function of proteins can be different degrees expanded through SIPs without increasing genome length. SIPs can also improve the stability and prevent the denaturation of proteins through reducing their surface area [9,10]. As a result, it is increasingly important for developing reliable and efficient computational methods to predict SIPs based on protein sequences.
As always, a large number of researches have been devoted to develop reliable and highly effective computational approaches to predict PPIs. You et al [11]proposed a new Multi-scale Local Descriptor (MLD) feature extraction method based on protein sequence and used the Random Forest (RF) to carry out classification. The MLD can capture multi-scale local information and RF is an ensemble learning approach. Huang et al [12] proposed a new computational method called WSRC-GE that combined weighted sparse representation (WSRC) with global coding (GE) for predicting PPIs. Wang et al [13] presented a new computational method through combining Discrete Cosine Transform (DCT) feature extraction method with ensemble Rotation Forest (RF) classifier for predicting PPIs. An et al [14] proposed a computational model called MKRVM-GWO that is a classification algorithm of multi kernel RVM based on gray Wolf optimization. In order to capture the information of protein interaction, the proposed method takes full account of the characteristics of local and global of protein-protein interactions position, which achieves good experimental results. Zhang et al [15] proposed a new computational prediction model, which combined Random Tree with Genetic Algorithm to predict PPIs based on protein sequence. The prediction model obtained good prediction results. Yang et al [16] used the k-nearest neighbors for carrying out classification and employed Local descriptors to extract feature from protein sequence. Guo et al [17] presented a novel computational model called SVM-AC, which used Autocorrelation to generate feature vectors based on protein sequence and employed SVM classifier to predict PPIs. An et al [18] proposed a new feature extraction method that can capture protein-protein interaction information of continuous and discontinuous by using the PSSM matrix coding of local protein sequence. A number of key features can be integrated by using serial multi-feature Fusion. The above methods can explore the correlational information between protein pairs, such as, coevolution, co-localization and co-expression. However, this information is not sufficient to predict SIP. In addition, the PPIs dataset does not contain PPIs between the same protein partners and SIPs dataset is very imbalanced. In the previous study, Liu et al [1] proposed a prediction model called SLIPPER for predicting SIPs, which integrate multi representative known properties. As far as we know, many research results have been reported about SIPs in recent studies [19][20][21]. However, these methods have an obvious disadvantage that cannot deal with the proteins without covered current human interatomic and solve the class imbalance problem in SIPs. For these reasons, it is an urgent work at present for developing efficient computational approaches for solving the imbalanced class classification of predicting SIPs.
In the paper, we proposed a new computational method called WELM-SURF for predicting SIPs. More specifically, for exploiting protein sequence feature, Position Specific Scoring Matrix (PSSM) is applied to capturing protein evolutionary information and Speed up robot features (SURF) is employed to extract key feature of protein sequence from PSSM. Take account of the advantage that the Weighted Extreme Learning Machine (WELM) has short training time, good generalization ability, and most importantly ability to efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. Therefore, the WELM classifier is used to perform classification based on extracted features for predicting SIPs. A large number of experiments show that the average accuracy of WELM-SURF is 95.25% and 98.79% on yeast and human dataset, respectively. We also compared our performance with Extreme Learning Machine (ELM), the state-of-the-art Support Vector Machine (SVM), and other existing methods. Compared with the experimental results, the performance of WELM-SURF in the domain is obviously better than ELM, SVM and other previous methods. These experimental results proved that the proposed WELM-SURF model is competent for predicting SIPs with high accuracy and robustness. It is anticipated that the WELM-SURF method is a useful computational tool to facilitate widely bioinformatics studies related to SIPs prediction.

Datasets
The PPIs datasets from the previous research, including DIP [22], BioGRID [23], IntAct [24], InnateDB [25] and MatrixDB [26] and the Uniprot database contains 20,199 curated human protein sequences [27]. In order to construct the SIPs experimental dataset, the protein sequences that only interact with themselves were selected from the above PPIs dataset and the type of interaction has been defined as "direct interaction" in the relevant database. In order to assess the performance of WELM-SURF, 2994 human self-interaction protein sequences were screened in for creating the experimental dataset by adopting as following three steps [28]: (1) the protein sequences whose length less than 50 residues and longer than 5000 residues were removed from the whole human proteome; (2) to create the positive samples, one of the following conditions must be satisfied: (a) At least two kinds of large scale experiments or one small-scale experiment has detected its Self-interactions; (b) the Uniprot dataset has defined the protein sequences as homopolymer; (c) it has been reported by at least two publications for its Self-interactions;(3) to construct negative samples, we removed all types of SIPs from the entire human proteome (including proteins annotated as "direct interactions" and more broadly as "physical associations") and the Uniprot database. Finally, 15,938 non-SIPs were selected as negatives samples and 1441 SIPs were selected as positives samples to construct the human dataset [28]. At the same time, we also construct the yeast dataset, which includes 5511 negative samples and 710 positive samples by using the same strategy [28]. There are about 8 times as many positive samples as negative samples for yeast dataset and about 11 times for human dataset. Therefore, SIPs datasets are very imbalanced class samples.

Position Specific Scoring Matrix (PSSM)
Due to proteins are functionally conserved, the prediction performance can be improved by using the evolutionary information of protein sequence. The position-specific scoring matrix (PSSM) contains not only the position information of the protein sequence, but also the evolution information that reflects the conservative function of protein. In the experiment, each protein sequence was converted a L × 20 PSSM by using Position Specific Iterated BLAST (PSI-BLAST) tool [29],where L represents the length of different protein sequences. Therefore, we employed the PSSM for extracting the sequence evolutionary information because of its advantage in the paper. The diagram of PSSM is displayed in Figure 1. represent the probability that the ℎ amino acid in the sequence is mutated to the ℎ type amino acid during biological evolution. The is greater than 0, equal to 0 and less than 0. If the is a positive number that indicates the ℎ amino acid can be easily mutated to the j th amino acid. In practice, the larger number of means a higher mutation probability. Conversely, if is negative number, it means the mutation probability is small, and a smaller number indicates more conservative. For using evolutionary information of protein sequences to capture more key features, we converted each SIP's sequence into a PSSM through employing PSI-BLAST tool. In the experiment, we set the parameter of PSI_BLAST's e-value is 0.001 and selected three iterations for obtaining widely and highly homologous sequences.

Speed up robot features (SURF)
Speed up robot features (SURF) [30] feature extraction algorithm is an improvement of Scale Invariant Feature Transform (SIFT) algorithm [31,32], which runs faster than SIFT in algorithm execution efficiency. The SIFT uses Gaussian differences to approximate Laplace Gauss distribution to find scale space. However, the SURF uses Box Filter to approximate LOG. The major advantage of SURF is that it is easier to calculate the convolution with the box filter by using the integrated image, which can be done in parallel at different scales. The execution of the SURF algorithm depends on the determinant of the Hessian matrix and the determinant of the position. The SURF algorithm includes the following two steps: feature point detection and feature adjacent description.

1) Feature Point Detection
The SURF uses continuous Gaussian filters of different scales to process image and detects feature points of mesoscale invariant through Gaussian differences. SURF can represent Gaussian fuzzy approximation by using the square filter to replace the Gaussian filters of SIFT. The filter can be expressed as: The square filter can greatly improve the computation speed through using integral graph that only calculates the value the four corners of the square filter. The determinant value of hessian matrix represents the change around pixel points. Since SURF USES hessian matrix of spot detection to identify feature point whose value should be defined as the maximum or minimum value of determinant. In addition, in order to achieve scale invariance, SURF also USES the determinant of scale σ to carry out detection of feature point. For example, given a point p=(x, y) in the graph, the Hessian matrix of scale σ is can be represented as follows: Where the L xx (p, σ) , L xy (p, σ), L xy (p, σ) and L yy (p, σ) are the gray-order image after the second order differentiation. The SCALE of SURF isn't continuous Gaussian ambiguity and down sampling processing. On the contrary, it is determined by the size of square filters. The lowest scale (initial scale) of square filter of is 9 × 9, which is approximately σ =1.2 Gaussian filter. The size of the upper scale filter will get larger and larger, such as15 × 15,21 × 21, 27 × 27 … The transformation formula of its scale is as follows: The descriptor of SURF uses the concept of Hal wavelet transform. In order to ensure the rotation invariance of feature point, each feature point is assigned a direction. The SURF descriptors calculate the Hal wavelet transform of 6σ pixels of direction of X and Y around feature point. A vector can be obtained by add components of corresponding X and Y of the wavelet in each interval. The longest (the largest X and Y components) of all vectors is the direction of the feature point. After the direction of the feature point is selected, the descriptor of feature point can be created by using the direction of surrounding pixels. For example, the 5 × 5 pixel points were defined as a sub region. As a result, a number of 16 sub regions can be generated by extracting the range of 20*20 pixel points around the feature point and the ∑ and ∑ of the Hal wavelet transform in the X and Y directions within the sub region can be calculated. Finally, a feature vector with dimensional 64 can be generated.
In the experiment, we used SURF method to create feature vectors whose dimensional is 64. Figure 2 shows the flow diagram of our method.

Weighted Extreme Learning Machine (WELM)
In consideration of not all samples class is evenly distributed, as a result, how to efficiently execute classification for imbalanced class samples is a challenge task. Therefore, in order to solve the problem of imbalanced samples classification, Zong et al [33] proposed a Weighted Extreme Learning Machine (WELM) based on Extreme Learning Machine (ELM). For the classification for imbalanced SIPs datasets, we also build the WELM model based on ELM for predicting SIPs.

/ 16
The network structure of ELM is as follows: Where ℎ is the output weight of the ℎ hidden layer neuron, represents activation function of hidden layer neuron, ℎ and ℎ is defined as the input weight and biases of hidden layer neuron, is input samples, represents the actual output value of ℎ training sample, is the expected output of ℎ training sample. According to the literature [15], there are training samples { , } =1 , ∈ . There are ( ℎ , ℎ ) and ℎ , which make ∑ || − || = =1 0 and single-hidden layer feedforward network (SLFN) can approach the training set{ , } =1 , ∈ with zero error. The equation 1 can be simplified as follow: = Where and are the output matrix and the output weight matrix of the hidden layer respectively and is the expected output matrix corresponding training samples. The output weight of the hidden layer can be expressed as follow: The output function of ELM can be defined as follow: WELM has two weighting strategies [34], one is automatic weighting and can be defined as follow: Where ( ) represents the number of class in the training sample. The other sacrifices the classification accuracy of the majority class for obtaining the classification accuracy of the minority class. This splits the minority class and the majority class into 0.618: 1(golden ratio) and is defined as follow: The output weight of WELM hidden layer can be represented as follow: Where the weighting matrix is a × diagonal matrix, and the diagonal elements correspond to samples. Different weights are assigned to different sample classes, and the weighting weights of the same class are the same.
The WELM has the advantage of short training time and good generalization ability and can efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. Considering that SIPs dataset is very imbalanced class samples and advantage of WELM model in imbalance classification. As a result, the WELM classifier was used to predict SIPs by employing the automatic weighting strategy. The prediction flow diagram of WELM-SURF model is shown in Figure 4.

Performance Evaluation
The following measures were used to evaluate the prediction performance of WELM-SURF in the work.
Where Acc represents Accuracy, TPR is Sensitivity, PPV is Precision and MCC represents Matthews's correlation coefficient. TP and TN represent the count of real interaction and real non-interaction protein sequence pairs correctly predicted. FP and FN is the number of real non-interaction and real interaction protein sequence pairs mistakenly predicted. Meanwhile, Receiver Operating Curve (ROC) was employed to further assess the prediction performance of WELM-SURF in the work.

Performance of the proposed WELM-SURF model
In this work, we proposed a prediction model based on computational method to predict SIPs, called WELM-SURF, which used WELM to execute imbalanced classification and employed SURF to generate high efficiency features. Above all, the performance of WELM-SURF was evaluated on benchmark datasets. The overfitting usually affects the prediction results. As a result, in order to prevent overfitting, the whole dataset is divided into training dataset and independent test dataset. In other words, we randomly divided the human dataset into 5 equal parts, of which 4 parts were used as training dataset and the rest as independent test dataset. The same strategy was also applied to the yeast dataset. At the same time, to evaluate WELM-SURF's ability of predicting SIPs, the WELM-SURF is carried out on yeast and human dataset under five-fold cross-validation. In order to ensure the fairness of comparison, several parameters of the WELM classifier were optimized by grid search algorithm. Where the number of Hidden layers is 3000, C = 200 and other parameters were set up the default value. Table 1-2 shows the results of five-fold cross-validation of WELM-SURF model on yeast and human dataset, respectively.
As can be seen from table 1, under five-fold cross-validation, the proposed WELM-SURF performs an average accuracy of 95.25 %, an average TPR of 93.05%, an average PPV of 94.35% and an average MCC of 86.44% As shown in Table 2, the WELM-SURF model also obtained very good experimental results on human dataset, whose average accuracy, average TPR, average PPR, and average MCC are 98.79%, 95.15%, 96.65% and 91.89% respectively. The prediction results demonstrated that our WELM-SURF is suitable for SIPs prediction.
The WELM-SURF can obtain very good prediction results, this attributes to SURF can capture key features from PSSM and WELM classifier has the strong classification ability for imbalanced class samples. Specifically, there are three main reasons: (1) The PSSM contains not only the position information of the protein sequence, but also the evolution information that reflects the conservative function of protein and a number of prior information. Therefore, it can provide a certain help in extracting evolutionary information of protein sequence and capture key SIP features. (2) SURF can improve computational speed compared to SIFT. The main advantage of SURF that it uses the concept of "scale space" to capture features at multiple scale levels, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it can capture self-protein interaction information and extract high efficiency features from PSSM. (3) For the sake of SIPs datasets are very imbalanced class samples and the WELM has the advantage of short training time and good generalization ability and can efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. Therefore, WELM is use to carry out classification and performs much better for identifying SIPs in the study. More specifically, the WELM can better perceive the distribution information of imbalanced class by assigning larger weight to the minority class samples and push the separating boundary from the minority class towards the majority class through using weight strategy. This makes it can provide help in sensitive learning by assigning different weight. As a result, the results demonstrate two things. First, SURF feature extraction approach is suitable for extracting SIP feature from the PSSM of protein sequence, and secondly, the WELM classifier can obtain good prediction results for predicting SIPs by imbalanced learning.

Comparison WELM-SURF method with the ELM-based and SVM-based
Experimental results demonstrate that the WELM-SURF model can accurately and efficiently predict SIPs and obtain better experimental results. However, to demonstrate the performance improvement of WELM-SURF model, the performance of WELM classifier was compared with the performance of ELM classifier and the SVM classifier through employing the same SURF feature extraction method on yeast and human datasets, respectively. For fair comparison, several parameter of ELM were optimized through employing the same grid search method. More specifically, the number of hidden layers of ELM is set to 126 and other parameters take the default value. At the same time, the RBF kernel parameters of the SVM were optimized by using the same strategy, where c = 0.3 and g = 5.2 and other parameters were set up the default value. In the experiment, LIBSVM tool [35] was used to execute classification. Table 3-6 displays the prediction results of five-fold cross-validation of ELM-SURF and SVM-SURF on yeast and human dataset, respectively. At the same time, the comparison of ROC Curves between WELM, ELM and SVM on yeast and human dataset are shown in Figure 5-6. As can be seen from Table 3-4, the ELM-SURF model and the SVM-SURF obtain average accuracy of 92.04% and 89.58% on yeast dataset, respectively. Similarly, as outlined in table 5-6, the ELM-SURF obtained 94.04% average accuracy and the SVM-SURF achieved 91.79% average accuracy on human dataset. It should be emphasized that the classification ability of WELM is obviously better the other classifiers by comparing these experimental results. Meanwhile, as can be seen from Figure 5 and Figure 6, the ROC curves of WELM are also significantly better than the other classifiers. One important reason is that the WELM focus on the imbalanced class classification relative to ELM and SVM. It has the advantage of short training time and good generalization ability and can efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. Specifically, the WELM can better perceive the distribution information of imbalanced class by assigning larger weight to the minority class samples and push the separating boundary from the minority class towards the majority class through using weight strategy. From the above analysis, the paper comes to the conclusion that the proposed WELM-SURF model is a useful tool for predicting SIPs, as well as other bioinformatics tasks.

Sensitivity
Comparison of ROC Curves between WELM,ELM and SVM on human dataset WELM+PSSM+SURF ELM+PSSM+SURF SVM+PSSM+SURF prediction model on yeast and human dataset. It is easy to find from Table 7-8 that the prediction accuracy of WELM-SURF is significantly better than the other six prediction models on yeast and human dataset. By comparing the results in Table 7-8, a similar conclusion that can be reached the proposed WELM-SURF method has very good predictive ability and can be used to high-quality predict SIPs. These comparison results further demonstrate the applicability of WELM-SURF forecasting SIP. This is mainly because the WELM is a robust and efficiently classifier and SURF can extract useful feature information of protein sequence. These comparison results further demonstrated that the WELM-SURF is suitable for identifying SIPs.

Conclusion
In the paper, we put forward a new computational method called WELM-SURF for predicting SIPs, which combines the Weighted Extreme Learning Machine (WELM) with Speeded up robust features (SURF) to predict SIPs based on evolutionary information of protein sequence. The experimental results proved that the proposed WELM-SURF model is competent for predicting SIPs with high accuracy and robustness and its prediction ability is significantly better than that of the ELM, SVM and other previous methods in the domain. The excellent performance of WELM-SURF mainly attributes to the following several important factors: (1) The PSSM contains not only the position information of the protein sequence, but also the evolution information that reflects the conservative function of protein and a number of prior information. Therefore, it can provide a certain help in extracting evolutionary information of protein sequence and capture key SIP features. (2) SURF can improve computational speed compared to SIFT. The main advantage of SURF that it uses the concept of "scale space" to capture features at multiple scale levels, which not only increases the number of available features but also makes the method highly tolerant to scale changes. This makes it can capture self-protein interaction information and extract high efficiency features from PSSM. (3) For the sake of SIPs datasets are very imbalanced class samples and the WELM has the advantage of short training time and good generalization ability and can efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. The WELM classifier can better perceive the distribution information of imbalanced class by assigning larger weight to the minority class samples and push the separating boundary from the minority class towards the majority class through using weight strategy. Therefore, we can come to the conclusion that the proposed WELM-SURF model is useful tools and can execute incredibly well for predicting SIPs, as well as other bioinformatics tasks.

Declarations
Ethics approval and consent to participate：Not applicable Consent for publication：Not applicable Availability of data and material：In this study, our experimental datasets contain yeast and human dataset, which can be obtained from the publicly available DIP [23], BioGRID [24], IntAct [25], InnateDB [26] and MatrixDB [27]. Competing interests: The authors declare no conflict of interest. Funding: This work is supported by 'the Fundamental Research Funds for the Central Universities (2019XKQYMS88)'.The role the funder is Ji-Yong An who is corresponding author and first author. Author Contributions: AJY and ZY conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript; YZJ and ZYJ designed, performed and analyzed experiments and wrote the manuscript; all authors read and approved the final manuscript.