Protein Remote Homology Detection Based on Deep Convolutional Neural Network

Background: Protein remote homology detection has long received great attention in the field of bioinformatics, but the property of low protein sequence similarities heavily influences the accuracy of detection. Recently, such deep learning methods as LSTM have been adopted to deal with the problem. However, LSTM-based models will consume much time during the training process because of their cyclic connection mechanism and such problem will become more serious when dealing with long protein sequences. Results: In this paper, we propose a CNN-based network, called ConvRes, to address the aforementioned shortcomings of existing methods in this field, which combines a variant Inception and Resnet block. Experimental results show that (1) this CNN-based network can classify the family of the remote homology proteins with comparable precision to the existing state-of-art method (ProDec-BLSTM) on the SCOP benchmark dataset. (2) ConvRes consumes less time with using only 15000 seconds whereas ProDec-BLSTM taking 150000 seconds. Conclusion: This paper showcases that our proposed ConvRes network outperforms other existing models with regard to detecting remote homology proteins. The experimental results prove ConvRes network to be a viable and efficient model for remote homology protein detection. In the future work, we will improve the performance of ConvRes network by using other dataset and explore new representations to adapt for variable-length sequences.

similar structures and functions 1,2 . In the past decades, varieties of technologies and algorithms have been developed and designed for solving the aforementioned problem.
Many alignment-based methods including BLAST 3 , FASTA 4 , UCLUST 5 , CD-HIT 6 , profile alignment [7][8][9][10][11][12] and HMM alignment methods [13][14][15] have been proposed to compute the similarity of protein sequences. These methods are based on sequences alignments, consequently generating a similarity score. However, the performance of these methods has been restricted because of the low protein sequence similarities of these remote homology proteins.
Traditional machine learning methods have been successfully applied to pattern recognition by using the given fixed features as input. Inspired by this, some researchers proposed discriminative methods for protein remote homology detection, which trains a classifier based on positive and negative samples and then classifies these protein sequences at the prediction stage. Several kinds of kernels have been applied in the research such as LA kernel 16 , motif kernel 17, and mismatch kernel 18 . In addition, other research which combines the physicochemical property to improve the accuracy of detecting the representation of protein [19][20][21][22] continues to emerge. However, the classification performance of these methods largely relies on the fixed features extracted by priori knowledge.
Compared with traditional machine learning, deep learning technologies can automatically capture the patterns of input data without priori knowledge. Several architectures of deep learning technologies, including Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN) have shown their merits on feature extraction and representation especially in the field of image [23][24][25] and Nature Language Processing (NLP) 26 . Meanwhile, there are varieties of deep learning-based methods successfully applied to bioinformatics in recent years such as protein classification 27 , protein structure prediction 28,29 , and protein subcellular localization 30,31 . Biological sequences can be considered as a special language and some researchers used RNN, especially Long Short-Term Memory (LSTM), to process biological sequences and find motifs among different sequences. LSTM has also been applied in the detection of remote homology proteins, among which ProDec-BLSTM 35 achieves the best performance by using a bidirectional LSTM (BLSTM) as classifier.
ProDec-BLSTM converts initial protein sequences into pseudo proteins, and encodes these sequences by one-hot technology. Then, it adopts BLSTM as a classifier to recognize the family of the input sequences. However, training a neural network by BLSTM would consume much time because of the cyclic connections and this problem becomes more challenging when dealing with long protein sequences. By contrast, focused on the local sequence pattern, CNN can be promising to deal with biological sequences. This paper proposes a ConvRes model, a CNN-based deep neural network, and proves it to be an efficient remote homology protein detector. Compared with the existing 10 related methods, the proposed model gains the state-of-art performance (evaluated by AUROC) on SCOP benchmark dataset. Furthermore, the cost of training time by this model is greatly reduced in contrast to the existing state-of-art method (ProDec-BLSTM).

SCOP Benchmark Dataset
The Structural Classification of Proteins (SCOP) database has been widely used to evaluate the performance of various methods on protein classification such as in ProDec-BLSTM. In this work, SCOP1.67 dataset is thus used (the same as ProDec-BLSTM) and it is accessible online.
Positive and negative samples of training and testing data are randomly selected for each of the 102 families contained in our dataset, with the average of 9077 sequences in each training dataset. There are 507,119 different sequences in total in this dataset, of which the minimum length is 13 and maximum length is 1264. The sequences with their length shorter than 400bp account for 96% of the dataset. Hence, the sequence length is constrained to 400bp in this study, which means that sequences with their length over 400bp will be correspondingly controlled at 400 th bp.

Sequences Representation
Since the physiological properties of protein rely on the physiological properties of amino acids, this study uses physiological properties of aminos acids to denote protein sequences. Table 1

Inception and Resnet
This section illustrates our proposed ConvRes (shown in Fig.1),, which combines a variant Inception and a Resnet Block. Input data are fed into a variant Inception block, aiming to extract abstract features of protein sequences by using various kernel sizes. The features of protein sequences can be enhanced after the Inception block because different kernel sizes can be seen as different window sizes according to protein sequences. Then, Resnet block is employed as a detector by using the aforementioned features as input. Finally, this architecture will recognize whether the input sequence belongs to a certain family.
More details will be clarified in the following subsections.

1-D Inception Block
Inception network is a frequently used structure in the field of Convolutional Neural Network (CNN), which extracts features by several kernels with different sizes. More abstract features can be received through the Inception network even if the objective possesses different sizes in the set of pictures. As for biological sequences, the window size plays a vitally important role on the accuracy of classification. However, no previous studies could help stipulate the optimal window size. So this paper adopts a variant Inception, called 1-D Inception block, combining the Inception structure with 1-dimentional convolution (shown in Fig.1).. The enhanced features extracted from this block will be concatenated by channels, and sent to the following Resnet classifier to generate the final result.

Resnet classifier
Deep residual network (Resnet) is a highly configured edition of conventional CNN, which is formed by several convolutional layers and a residual operation between every two layers. Resnet solves the problem of gradient vanishing to some degree because of the residual operations, thus achieving better performance than conventional CNN. This paper employs 18 layers of Resnet as the classifier for remote homology protein detection. This Resnet classifier contains an independent convolutional layer followed by a max-pooling layer, and 4 residual blocks (with 2 convolutional layers in each block) followed by an average-pooling layer and a full connection layer. The concatenated features extracted by the 1-D Inception block will be sent to this Resnet classifier, in which each layer uses the extracted features and initial input of the previous layer as input and provides feature extraction with a higher-level abstraction. The final dense layers will recognize whether the input sequence belongs to the current family or not.

Performance evaluation
In this paper, the area under the receiver operating characteristic (AUROC) is used to evaluate the performance of our method and the existing methods. Receiver Operating Characteristic (ROC) curve is plotted by employing the true positive rate as x axis and the false positive rate as y axis according to different classification threshold. AUROC refers to the area under ROC plot, whose score is between 0 and 1.The stronger and better performance the classification achieves, the closer the AUROC score is to 1.
As described in Section 1, ProDec-BLSTM model includes two essential parts, which are pseudo proteins processing and a BLSTM classifier. To further evaluate the performance of our ConvRes model and the ProDec-BLSTM model, this work compares the training time of BLSTM (removes pseudo protein processing of ProDec-BLSTM) and ConvRes model on 16 families respectively. The result (shown in Fig.3) showcases that it is much quicker to train the ConvRes model than the BLSTM model (nearly 10 times). For a protein family, BLSTM takes about 150000s to train for 150 epochs, while ConvRes costs only 15000s. It is obvious that the CNN framework operates much faster than the BLSTM framework.
Moreover, ProDec-BLSTM contains the processing of pseudo proteins using PSI-BLAST to generate PSSM, which will also consume lots of time. So our model requires much less training time with the performance comparable to ProDec-BLSTM.

Conclusion
This study proposes a CNN-based network that combines Inception and Resnet block to detect remote homology proteins. The proposed network can precisely classify proteins into the specific family that they belong to. Experimental results show that ConvRes achieves the top performance in comparison to other related methods on the SCOP benchmark dataset. Furthermore, this model saves much time than that of the existing state-of-art method (ProDec-BLSTM), which benefits from the local pattern detection properties of CNN. Furthermore, different window sizes in Inception block also enhance the features of protein sequences.
In the future work, we will improve the performance of ConvRes network by using other dataset and explore new representations to adapt for variable-length sequences. Declarations Acknowledgements Not applicable.

Funding
Not applicable.

Availability of data and materials
The SCOP benchmark dataset supporting the conclusion of this article was published in [34], which is available on http://www.bioinf.jku.at/software/LSTM_protein/.   Figure 1 The architecture of our work.

Figure 2
Performance of several related methods evaluated by mean AUROC.

Figure 3
Training time of BLSTM and ConvRes model.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.