Localnet: A Simple Recurrent Neural Network Model for Protein Secondary Structure Prediction Using Local Amino Acid Sequences Only


 BackgroundProtein secondary structure prediction (PSSP) is important for protein structure modeling and design. Over the past a few years, deep learning models have shown promising results for PSSP. However, the current good performers for PSSP often require evolutionary information such as multiple sequence alignments and even real protein structures (templates), entire protein sequences, and amino acid property profiles. ResultsIn this study, we used a fixed-size window of adjacent residues and only amino acid sequences, without any evolutionary information, as inputs, and developed a very simple, yet accurate RNN model: LocalNet. The accuracy for three states of secondary structures is as high as 85.15%, indicating that the local amino acid sequence itself contains enough information for PSSP, a well-known classical view. By comparing to other predictors, we also achieve an state-of-art accuracy on dataset of CASP11, CASP12 and CASP13.ConclusionThe well-trained models are expected to have good applications in protein structure modeling and protein design. This model can be downloaded from https://github.com/lake-chao/protein-secondary-structure-prediction.


Background
Protein is a large organic molecule and acts as the main undertaker of life activities of all creation, and its function mainly depends on spatial structure through the correct conformational folding and structural transitions. Protein secondary structure forms rst, and acts as the seed in determining how proteins fold [1,2] and how fast they fold [3]. Three dimensional structures of over 150,000 proteins, mainly determined by experimental approaches such as X-ray crystallography and nuclear magnetic resonance spectroscopy, are available from Protein Database Bank (PDB) [4]. Even though having advanced at an ever-faster rate, most experimental approaches remain expensive, time-consuming, and insu cient [5]. Therefore, computational, fast and high-precision protein secondary structure prediction (PSSP) from amino acid sequences is of great signi cance for understanding the function of proteins in the eld of bioinformatics [6][7][8].
However, these models with good performance for PSSP often requires evolutionary information such as multiple sequence alignments and even real protein structures (templates), entire protein sequences, and amino acid property pro les as input. The SPIDER3-single [27], which was based upon entire amino acid sequences, only reported a Q3 accuracy of 72.5% using a deep neural network model of LSTM-BRNN. SPIDER3-single was inferior to its original model SPIDER3, and higher prediction accuracy seems to depend upon a good combination of evolutionary information and long-range interactions.
Deep Learning [28,29] methods allow deep neural networks discovering the representations from raw data for speci c tasks such as classi cation and pattern detection. Among various deep learning models, the most commonly used in bioinformatics is arti cial neural network (ANN), composed of input layer, output layer, and hidden layer (Fig. 1a) and Recurrent neural network (RNN).
RNN has loops, which allow information to be passed from one step of the network to the next (Fig. 1b). In the past decade, RNN has had an incredible success in various problems such as speech recognition, language modeling, and translation, and one key to the successes is the use of the LSTM model, a special kind of RNN. LSTM model, introduced by Hochreiter & Schmidhuber, [30], includes the cell state (Fig. 1c), which is like a conveyor belt running through a sequence. LSTM allows information to ow along a sequence unchanged, enables linear interactions with each element in the sequence, and is thus able to capture and model long-range interactions.
RNN seems a natural architect for PSSP because it is created for and takes advantage of continuous sequence data. In this work, we designed a simple LSTM-RNN model, called LocalNet, for prediction of secondary structures. Unlike SPIDER3 and other high-performance models, we did not use any evolutionary information such as multiple sequence alignments and only used amino acid sequences from windows of xed size as the input.

Model optimization
In this study, the output from RNN is directly connected to the output layer. We experimented with various number of fully connected hidden layers and observed no noticeable improvement in accuracy and only worsened over tting problem. For the LSTM cell in the RNN network, we tried various numbers of units and found that 32 units produced best accuracy with the minimum number of weights.
Multi-class cross-entropy loss is used as the cost function. For the ADAM optimizer, a learning rate of 0.001 produced a satisfactory result. The training over the training data set was terminated at 20 epochs due to noticeable over tting after that, and it took about 18 minutes on the Linux workstation we used.
Here we plot the cost and accuracy versus epoch for the window size of 19 as an example. In Fig. 2 (right), the cost dropped signi cantly during the rst 3 epochs, and followed by a gradual decrease. Correspondingly, the accuracy increased dramatically during the rst 3 epochs followed by a gradual change. After 8 epochs, the accuracy on the training data set kept increasing, but the one on the validation data set started to degrade, apparently due to over-tting. The optimal accuracy on validation data set is 0.836 at 8th epoch ( Fig. 2 (left)). The models with the best accuracy on the validation data set were saved and used to benchmark the testing data set.
The nally minimized cost and all benchmark metrics numbers for the three data sets of training, validation and testing are given in Table 1. As shown in Table 1, LocalNet performs generally better as window sizes increase. For validation data set, the best accuracy is reached at the window size of 21, and for test data set, the best window size is 19. We tried to extend the window size further and observed no signi cant improvement on prediction accuracy or even degraded performance on the validation data set. This implies that protein secondary structures are mainly determined by local sequences; long range interaction, as claimed in several literatures, does not seem necessary to achieve a good prediction accuracy.
Performance on three states of helix, strand and coil We measured the performance of the optimal model for window size of 19 residues on CASP11 [31], CASP12 [32] and CASP13 [33] datasets, which contain 105, 96, and 125 domain sequences, respectively.
The performance of LocalNet is comparable among these four data sets (Fig. 3). Taken CASP11 as an example, the detailed prediction accuracies of Q3, H, E, and C are 85.0%, 92.6%, 82.2%, and 60.5%, respectively. Besides, the prediction accuracy of H is higher than 90% for CASP11, CASP12, CASP13, and Culled PDB.

Comparison of the recent predictors
We compared the performance between LocalNet and other state-of-the-art models on three independent datasets: CASP11, CASP12, and CASP13. All protein targets (template-based and free-modeling targets) were used to evaluate LocalNet and the results are listed in Table 2. For data sets of CASP11 and CASP13, LocalNet's accuracy (85.0%) is comparable to those of DCRNN, MUFOLD-SS and Ensemble of Contextnet. For CASP12, LocalNet performs worse than these three top performers with Q3 accuracy of 80.5%, but it is still better than Spider3, RaptorX and DeepProf.
DCRNN used both a deep convolutional and recurrent neural network with multiscale CNNs and three layers of BGRU, and it is much more complicated than LocalNet. DCRNN's input includes protein amino acid sequence, long-range contacts, sequence pattern, and other amino acid pro les. DCRNN's performance on the three CASP data sets is only marginally better than LocalNet, which only used a single RNN module and local amino acid sequences.
In terms of input, both SPIDER3-single and our model are based upon amino acid sequences only. The LSTM-BRNN structure of SPIDER3-single is similar to SPIDER3, but the accuracy is signi cantly lower. The authors contribute the accuracy of 72.5% to using the whole protein sequence as input and capturing longrange interactions between residues. LocalNet, having a much simpler structure than LSTM-BRNN, achieved better accuracy, and the sliding window strategy may account for the enhanced accuracy. By using a short window of amino acid sequence instead of the entire protein sequence as input, we are able to generate a much larger number of samples to train LocalNet. Su cient sample size is particularly crucial for deep learning models to extract the functional relationship between variables.
For each feature and each dataset, the best three scores are marked in bold. Models which implement RNN or LSTM algorithms are marked italic. Empty cells represent predictions that were not reported. The Q3 accuracy is taken from the papers. [22,[34][35][36]  DeepSeqVec [36], the average running time for a single protein was 0.08 with a minimum of 0.006 for the batch containing the shortest sequences (67 residues on average) and a maximum of 14.5 s (9860 residues on average). The only processing LocalNet needs is to break protein sequences into continuous fragments, and it took less than a mini second even for proteins of 9860 residues.

Discussions
In this study, we only used sliding windows of amino acids of xed size on protein amino acid sequences, or fragments, as input and we did not utilize any information of evolution such as multiple sequence alignment and long-range interaction. For a window size of 19, LocalNet achieved a Q3 accuracy of 85.2%, comparable to other top performers. Our results are consistent with the traditional view that the secondary structures are of local nature and mainly determined by local amino acid sequences. [5] Long-range contacts may have some impacts on protein secondary structure, but the impacts are likely insigni cant.
Two factors may have contributed to LocalNet's excellent performance. First, even for a window size of 19 residues, the training data set consists of over 750,000 samples. For deep learning models of high dimension, su cient training data size is crucial to reasonably approximate the unknown underlying mapping function from inputs to outputs. Generally, it is common knowledge that too little training data results in poor approximation. Second, as described above, in LocalNet, a fragment is considered to form H if the 5 consecutive residues at the center are assigned as H by DSSP, and a fragment is considered to form E or C if the 3 consecutive residues at the center are assigned as E or C, respectively by DSSP. These rules are consistent with well-known knowledge and helpful in removing noisy data; data quality is another crucial factor in building good deep learning models.
LocalNet's Q3 accuracy is comparable to those of other top performers; but LocalNet's accuracy for H is signi cantly higher than other models' and its accuracy for C is signi cantly lower. There are two reasonable explanations for these differences. First, among H, E, and C, only H is a relatively stable secondary structure. E is not a locally stabilized secondary structure; instead, it is stabilized by forming hydrogen bonds with distance residues on amino acid sequences. Unlike H, the geometry of E is irregular, and C is even more irregular and rarely stable. Second, other top performers generally use evolutionary information, and their better accuracy in prediction of C is likely derived from multiple sequence alignment.
This explanation is also consistent with the observation that models utilizing templates have improved accuracy.
The generated high-quality model for PSSP in this study is expected to have good applications in protein structure modeling and protein design. For protein folding problems, for example, if alpha helix could be reliably predicted, it will signi cantly reduce the sampling space for locating the global free energy minimum. In protein engineering such as antibody engineering, amino acids need to be changed to improve protein's physical/chemical and other properties. An accurate model for secondary structure prediction is obviously helpful in guiding such design.

Conclusions
We developed a very simple, and yet highly accurate models for Q3 prediction using LSTM-RNN algorithms.
The high accuracies show that local amino acid sequence itself contains su cient information for secondary structure prediction without any homogenous information. The trained models are expected to have good applications in protein structure modeling and protein design, and they may also help in understanding protein folding mechanism.

Materials And Methods
Dataset and Hardware 12,358 protein X-ray structures in the precompiled culled PDB list [38] from PDB were selected for this study. This list used a cutoff of 30% amino acid sequence identity, and all proteins in this list have a resolution better than 2.0Å and a R-factor smaller than 2.5 (Table 3). We removed proteins with 40 residues or less and generated a re ned list of 11,897 proteins. DSSP software [39] was used to assign proteins' Q3 secondary structures. Labels H, G, and I are assigned to class H; E and B to E; and S, T, and C to C. The 11,897 proteins were split into the training data set of 10,719 entries, test data set of 581 entries, and validation data set of 597 entries, randomly.

Model
As illustrated in Figure 4, LocalNet starts with an RNN module, followed by one output classi cation layer. The LSTM cell consists of 32 units. The input to the RNN is the fragment sequence with each residue encoded by a one-hot vector. SoftMax layer is used as the classi cation layer, and the output layer consists of 3 nodes for H, E, and C.
Backpropagation is used for training the network [40]. Optimization of the loss function is carried out by mini-batch of a size 128 and the ADAM optimizer [41], which is implemented as tf.train.AdamOptimizer in the Tensor ow library [42].

Input features and preprocessing
In this study, we focused on Q3 prediction of H, E and C [43]. The Q3 accuracy was calculated by following equation: Where Ntotal is the total number of residues and Ni is the number of correctly predicted residues in state i [34].
For a residue with assigned secondary structure by DSSP, this residue and its neighbor residues are extracted from a protein sequence to form a data sample. To test the impact of window size on prediction accuracy, 8 window sizes, 7,9,11,13,15,17,19, and 21 were used. A fragment is considered to form H if the 5 consecutive residues at the center are assigned as H by DSSP, and a fragment is considered to form E or C if the 3 consecutive residues at the center are assigned as E or C, respectively, by DSSP. These rules are based upon known biochemistry and used to ensure data quality. A typical α helix contains about ten amino acids (about three turns) due to stabilizing interactions [44], and β sheets consist of typically 3 to 10 amino acids long with backbone in an extended conformation [45].
The sizes of the extracted samples of three data sets at different window sizes are given in Table 4.

Availability of data and materials
The data and model can be downloaded from https://github.com/lake-chao/protein-secondary-structureprediction.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.  Illustration of ANN structure (a), RNN structure (b) with LSTM memory cell, which contains forget gate, input gate, output gate and cell state (c).  Prediction accuracy of Q3, H, E, and C for CASP11, CASP12, CASP13 and the culled PDB database.