Sequence-Order Frequency Matrix - Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) for Protein Remote Homology Detection

Background : Protein Remote Homology Detection (PRHD) is used to find the homologous proteins which are similar in function and structure but sharing low sequence identity. In general, the Sequence-Order Frequency Matrix (SOFM) was used for protein remote homology detection. In the SOFM Top-n-gram (SOFM-Top) algorithm, the probability of substrings was calculated based on the highest probability value of substrings. Moreover, SOFM-Smith Waterman (SOFM-SW) algorithm combines the SOFM with local alignment for protein remote homology detection. However, the computation complexity of SOFM based PRHD is high since it processes all protein sequences in SOFM. Objective : Sequence-Order Frequency Matrix - Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) algorithm is proposed for predicting the protein remote homology. The SOFM-SMSW algorithm used the PVS method to select the optimum target sequences based on the uniform distribution measure. Method : This research work considers the most important sequences for PRHD by introducing Proportional Volume Sampling (PVS). After sampling the protein sequences, a feature vector is constructed and labeling is performed based on the concatenation between two protein sequences. Then, a substitution score which represents the structural alignment is learned using k-Nearest Neighbor (k-NN). Based on the learned substitution score and alignment score, the protein homology is detected using Smith-Waterman algorithm and Support Vector Machine (SVM). By selecting the most important sequences, the accuracy of PRHD is improved and the computational complexity for PRHD is reduced by using structural alignment along with the local alignment. Results : The performance of the proposed SOFM-SMSW algorithm is tested with SCOP database and it has been compared with various existing algorithms such as SVM Top-N-gram, SVM pairwise, GPkernal, Long Short-Term Memory (LSTM), SOFM Top-N-gram and SOFM-SW. Conclusion : The experimental results illustrate that the proposed SOFM-SMSW algorithm has better accuracy, precision, recall, ROC and ROC 50 for PRHD than the other existing algorithms.


INTRODUCTION
In living organisms, the proteins are considered as an important functional unit and those are involved in various biological processes [1]. Proteins had similar functions and structures in the same family. Further details on an obscure protein can be gained based on the protein family [2][3][4][5][6]. Protein Remote Homology Detection (PRHD) methods are processed with the aim of finding the families of a protein. Development of new drugs for a specific disease is achieved using protein remote homology detection. Generally, PRHD methods are categorized as discriminative methods, ranking methods and sequence-based alignment methods. According to the similarities among a couple of protein sequences, protein homology detection is achieved in sequence-based alignment methods [7]. Discriminative methods [8] extract features from initial protein sequences and differentiate protein families based on the extracted features. Ranking methods calculate the proteins homology relationship by depicting all the proteins into a feature space according to the distance in the feature space. From these methods, the alignment-based methods achieve the state-of-the-art performance for PRHD.
A Sequence-Order Frequency Matrix (SOFM) [9] was used for Protein Remote Homology Detection -Fold Recognition (PRHD-FD) that combined the sequence-order effects of amino acids with the Multiple Sequence Alignment (MSA). After the construction of SOFM, Top-n-gram was performed on that matrix to transform it into fixed length vector. Then, a SOFM-Top was processed for PRHD-FD. In order to find the similarity between any two SOFMs, Smith-Waterman local alignment algorithm was used [10,20]. The local alignment similarity was given as input Support Vector Machine (SVM) for PRHD-FD. In this research work, the Proportional Volume Sampling (PVS) method is introduced to consider only the target proteins for PRHD-FD which reduces the computation time for SOFM based PRHD.
Furthermore, the error rate of the SVM is further reduced by considering protein structural alignment along with the protein local alignment for PRHD. A substitution score is predicted using kNN that is used in Smith-Waterman algorithm for refining the sequence alignment. After that, MSA is applied on the sequence alignment to obtain refined SOFM matrix and alignment score. The alignment score is trained in SVM for PRHD. The remaining sections of this research work are organized as follows: Section 2 elaborates the literature study of existing techniques in protein remote homology detection, Section 3 illuminates the methodology of proposed SOFM-SMSW method, Section 4 emphasizes the results and discussion for the SCOP database and finally Section 5 describes the conclusion.

LITERATURE SURVEY
Multi-layer Support Vector Machine (SVM) classifier is used for homology detection and fold recognition. One of the layers in multi-layer SVM detects the super family and family in the Structural Classification of Proteins (SCOP) [11] hierarchy by using fine-tuned binary SVM classification rules and Bio-kernel function. Another layer of multi-layer SVM was used to detect protein fold level in SCOP hierarchy using discriminative SVM with string kernel.
However, the high dimensional feature vector affects the accuracy and processing time of homology detection and fold recognition process [12]. A tool is developed to detect protein remote homology using Markov Random Fields (MRF) and stochastic search. The MRF was used to capture standard Hidden Markov Model (HMM) and pairwise association between amino acid residues bonded together in β-sheets. Nevertheless, in many real cases MRF was computationally impractical. So, stochastic search was used which provided optimal or near optimal solution for protein homology detection. However, this tool required a template which was built from a set of protein chains [13].
The feature extraction technique utilized Position Specific Scoring Matrix (PSSM) to calculate the tri-grams of protein sequence and predicts the protein fold recognition [14]. Based on the tri-grams, a matrix was constructed using PSSM which determined the fold of a protein sequence. However, this technique still needs further improvement in terms of recognition accuracy. Also, the Soft Ngrams technique is utilized for protein homology detection [15].
Ngram was a profile-based representation for protein sequences that permitted to consider whole information in the profile. Then, the representation was converted into a feature vector by using a hybrid generative-discriminative scheme. Finally, the feature vectors were processed in SVM to detect the protein homology. However, soft Ngrams is computationally expensive. HMM-HMM arrangement and dynamic programming is used for effective recognition of protein fold [16].
Initially, Profile HMM (PHMM) matrix was extracted from the protein sequence by applying HMM-HMM alignment on the protein sequence. After that, kernalized dynamic programming was explored to calculate the distance between the corresponding PHMM matrices. Based on the distance between the two proteins, the protein fold was recognized. By including other features from physicochemical attributes, the recognition accuracy will be improved.
Protein fold recognition is achieved using the Computational Predicator [17]. In the computational predictor, the sequence features were extracted from the protein sequences and then a dictionary was constructed which holds the extracted features. The dictionary was given as input to Sparse Representation Classifier (SRC) for protein fold recognition. Advanced machine learning methods will be used to enhance the fold recognition. The characteristics of protein sequences was extracted to enhance Deep Extreme Learning Machine (DELM) based protein fold prediction [18]. Bacterial Foraging Optimization-Genetic Algorithm (BFO-GA) algorithm is using for the purpose of multiple sequence alignment of measures carrying out and improve the multi objective [19]. Deep learning technique named Protein Remote Homology Detection based on Bidirectional Long Short-Term Memory (ProDec-BLSTM) [21] and ensemble classifier named SVM-Ensemble [22] are used to detect the protein remote homologies. PATSIM [23] tool is used to analyze the protein patterns based on the Self Optimized Prediction Method (SOPM) server. The computational methods for protein remote homology detection is discussed and it can be divided into three groups such as discriminative, alignment and ranking methods [24]. CONVERT method concerns homology detection as a translation task and presents a concept of illustrative protein [25]. A discriminative method named ReFold -MAP extracts the comprehensive features based on Motif-PSSM, ACC-PSSM and PDT Profile [26]. Machine learning algorithms are used to predict the protein homology of un-annotated sequences [27]. Principal Component Analysis (PCA) was applied in the extracted features to reduce the dimensionality of extracted features. The extracted features and the original features were processed in DELM and Linear Discriminant Analysis (LDA) to recognize the protein fold [16]. However, it is limited to high dimensional data. To overcome the disadvantages in the existing research works, this research work planned to propose SOFM-SMSW algorithm for protein remote homology detection.

METHODOLOGY
Initially, SOFM is constructed for the protein sequences based on Multiple Sequence Alignment (MSA). Then, the PVS is applied on the SOFM of each sequence to get the most important sequences (i.e., target sequence) which reduces the computational complexity for PRHD. Smith-Waterman technique is used to refine the sequence alignment of target sequences using local alignment and structural alignment. From the structural alignment and local alignment of target sequence, a feature vector and label are generated. Then, the feature vectors and label are trained using kNN to forecast the match of the position (i.e., substitution score).
The substitution score is used in Smith-Waterman algorithm for refining the sequence alignment and the alignment score of the refined matrix is trained in SVM for efficient PRHD. In this section, the proposed SOFM-SMSW is described in detail for protein remote homology detection. The overall framework for the proposed SOFM-SMSW algorithm is shown in Fig.1.

Input Protein sequence
The Structural Classification of Proteins (SCOP) 1.53 and SCOP 1.67 benchmark datasets are used for the experimental analysis. Fig.1(a) shows the sample subset of protein sequence for the remote homology detection.
Substring , with ℎ amino acids at location in sequence is denoted as, In Eq. (3), ℎ(ℎ = 1,2, … ) denotes the length of substring , . Assume denote the group of all substrings with ℎ amino acid at position .
where, elements in are repeatable and size of equals to the sum of protein sequences in MSA.
In order to create the profile, the sequence-order information is combined when these substrings in every column of the MSA are used. According to the probability of the substring , appearing , the SOFM alignment scores are calculated. SOFM can be represented in matrix format which is given as follows: In Eq. (5), -length of the protein sequence , 20 -standard amino acids and 20 ℎ denotes the total number of all possible substrings ( = 1,2, … 20 ℎ ) of length ℎ. The , (0 < , < 1) is occurring probability of substring in position ( = 1,2, … − ℎ + 1) during the evolutionary process, which is given as follows: Alignment is shown in Fig.1 (b).

Selection of target sequence using PVS
After the computation of alignment score for each protein sequence, the most important sequence is selected using PVS. It chooses target sequence of size with probability proportional to ( ) times (∑ ∈ ) for a measure (uniform distribution). Assume ⊆ [ ] be of size no more than an integer . Then, In the PVS method, an SOFM of each protein sequence is given as input along with an integer and then find the uniform distribution measure on SOFM. Then, convex relaxation is solved to get a fractional solution with ∑ =

=1
. After that, the SOFM of each protein sequence is sampled with Pr [ = ] ∝ ( ) ( Ƭ ) and ( ) may be defined using the solution of . Add the − | |, when | | < and finally it returns a set which has optimal information in . The selection of target sequence using Proportional Volume Sampling algorithm is shown in Fig.1 (c).

Proportional Volume Sampling Algorithm
Step 1: Given an input = [ 1 , 2 , … ], a positive integer and measure on Step 2: Solve convex relaxation to obtain a fractional solution with ∑ =

Generate feature vector and label for target sequences
The local alignment and structural alignment of known homologous are used to learn the substitution score which is used in Smith-Waterman algorithm [20] for refining protein sequence alignment. Assume ( , ) be the query sequence and target protein sequences (1Y64 and 1UX4) [28][29] correspondingly. Initially, feature vector at and is the concatenation of query and target's residues feature vectors which are given as follows: where, is the concatenation of query and target protein sequences around the residue which is given as follows: In Eq. (10), is the window size. This feature vector is defined at each residue pair of the query sequence and target protein sequences. But it is calculated within the areas, where the window moves along with alignment path since information from the residue pairs that are far from the alignment path is not informative. A label is assigned as 0 or 1 at and . The generation of feature and label for target sequence is shown in Fig.1 (d).

Train KNN for prediction of substitution score
After labeling the sequences, the pairwise protein structural alignment is calculated using Smith-waterman method. It needs a substitution score for every residue pair which is learned using k-Nearest Neighbor (k-NN). The substitution score is used to forecast the match of the position. It calculates the substitution score by find the distance between and , where is the feature vector of query and target protein sequences in testing dataset and is the feature vector of and protein sequences in training dataset. The distance values are sorted and choose the minimum K minimum distances. Then, mean of the values in the K distances is assigned as a substitution score for testing data. Training of KNN algorithm for prediction of substitution score is shown in Fig.1 (e).

Input:
, , , Size of k Output: Substitution score Step 1: For query and target protein sequences calculate the distance between and by using Step 2: Sort the distance in descending order and select k minimum distances.
Step 3: Take the mean of the value and it is returned as substitution score of and protein sequences.

Use substitution score in Smith-Waterman algorithm to refine the sequence alignment
The substitution score is used in the Smith-Waterman method for PRHD. Smithwaterman technique performs sequence alignment for identifying the similar region among two strings of protein sequences such as = 1 2 … and = 1 2 … . A similarity ( , ) is given between sequence elements and . A matrix is constructed to find pairs of segments with high degree similarity. Initially set, The preliminary values of have the interpretation that is the maximum similarity of two segments ending in and , correspondingly. These values are obtained from the relationship In Eq. (13), ≤ ≤ and ≤ ≤ , −1, −1 is the score of aligning and , is the substitution score, − , − is the score if is at the end of a gap of length , , −1 − is the score if is at the end of gap of length and 0 means there is no similarity between and . Starting at highest score matrix and ending at a matrix cell which has a score of 0, trace back based on the source of every score recursively to produce the best sequence alignment. Use of substitution score in Smith-Waterman algorithm to refine the sequence alignment is shown in Fig.1 (f).

Apply MSA on refined sequence alignment to get refined SOFM matrix and alignment score
After obtaining the best sequence alignment, MSA is applied on it to get the refined SOFM and refined alignment score. Multiple Sequence Alignment is applied on refined sequence alignment to get refined SOFM matrix and alignment score is shown in Fig.1 (g) 6. Generate feature vector and label of target sequences using Eq. (9) and Eq. (11).
7. Process the feature vector and label of target sequences in kNN for prediction of substitution score.
8. Calculate the substitution score using k-NN. 9. Construct a scoring matrix , using Smith-Waterman algorithm.
10. Apply MSA on optimal protein sequence alignment and get a refined alignment score.
11. Train the alignment score in SVM for PRHD.
refined alignment score is given as input to SVM for PRHD (1A6K & 1MTJ) and it is shown in Fig. 1 (h) [30][31]. LIBSVM package is used for the protein remote homology detection. The Radial Basis Function (RBF) kernel is used for the SVM algorithm to predict the remote homologues. Regularization parameter of the SVM is set to 1.0 and the kernel co-efficient gamma is set to 'scale'. The pseudocode of the proposed algorithm is shown in Fig.2.

RESULTS AND DISCUSSION
This

Accuracy
Accuracy metric measures the ratio of correct protein remote homology detection over the total number of proteins evaluated. It is calculated as:    greater than SVM Top Ngram, 6.74% greater than SVM pairwise, 5.61% greater than GPkernal, 8.98% greater than LSTM, 7.86% greater than SOFM-Top and 3.37% greater than SOFM-SW. For the SCOP 1.67 dataset, the accuracy of SOFM-SMSW is 10.22% greater than SVM Top Ngram, 9.09% greater than SVM pairwise, 6.81% greater than GPkernal, 5.68% greater than LSTM, 9.09% greater than SOFM-Top and 4.54% greater than SOFM-SW. From this analysis, it is proved that the proposed SOFM-SMSW algorithm has the highest accuracy than other methods for SCOP 1.53 and SCOP 1.67 datasets.

Precision
Precision is defined as the fraction of aligned positions that are correctly aligned based on SVM Top-N-gram, SVM pairwise, GPkernal, LSTM, SOFM-Top, SOFM-SW and SOFM-SMSW methods. Precision is calculated using Eq.16. Table 2

Recall
Recall is the fraction of align able residues that are correctly aligned based on SVM Top-N-gram, SVM pairwise, GPkernal, LSTM, SOFM-Top, SOFM-SW and SOFM-SMSW methods.
It is calculated using Eq.17. Table 3

CONCLUSION
In this research work, the SOFM-SMSW algorithm is proposed for predicting PRHD.
The SOFM-SMSW algorithm used the PVS method to select the optimum target sequences based on the uniform distribution measure. Initially, a SOFM matrix is constructed from MSA and then a uniform distribution of each protein's SOFM is calculated. Based on it, the target sequence is obtained. After that, labeling is performed to find the concatenation position of two protein sequences and it is processed in kNN for prediction of substitution score. It is processed in Smith-Waterman algorithm to refine the sequence alignment and it is processed over MSA which returns refined alignment score. Finally, the alignment score is processed in SVM for PRHD. The experimental results illustrate that the proposed SOFM-SMSW algorithm has better accuracy, precision, recall, ROC and ROC 50 for PRHD than the other existing algorithms.

Availability of Data and Materials
The source of data is collected from the Astral Sequences & Subsets in SCOPe online repository.

Human and Animal Rights
No Animals/Humans were used for this study.