Resolution of protein three-dimensional structure is one of the most important research problems in the field of structural biology. The structure of a protein is directly related to its function, and structural prediction is an important goal of bioinformatics and theoretical chemistry, with great potential benefits in the fields of medicine and biotechnology. Hence, how to predict three-dimensional structures from protein sequences has been an unsolved and significant problem. Although amino acid sequences determine protein structures, other factors also contribute to structural modification, which demands us find an efficient technique to delineate the global properties of protein structure space [1–4]. Current techniques for the determination of protein structures include X-ray crystallography, nuclear-magnetic-resonance (NMR) spectroscopy, structure alignment and so on. With modern machine learning methods such as neural networks and support vector machines, some of these new methods also appear in protein structure prediction work [5–18]. For example, Chou develops methods that make allowance for taking into account the coupling effect among different amino acid components of a protein by a covariance matrix [8, 9]. Brevern defines a structural alphabet, which allows the local approximation of the 3D protein structure by using a Bayesian approach based on the relation protein block amino acid propensity [11]. Wood provides a method called DESTRUCT using a sequence and structure representation and an iterative prediction algorithm [12]. Jung has created a web server providing structural information and analysis based on the backbone torsional representation of a protein structure [13]. More and more structure prediction software tools have appeared recently, including homology modeling, protein threading, ab initio methods, secondary structure prediction, transmembrane helix and signal peptide prediction, such as RaptorX [19], I-TASSER [20], HHpred [21]. However, these methods often require time-consuming analysis of experimental results, especially for large protein molecules which make them unreliable and ineffective for structure prediction. Thus, the speed of computation and accuracy still have room for improvement. As we know, there are many examples of proteins which have the same amino acid sequence but different structures. Beside of this, many existing methods may have limitations and drawbacks for predicting the structures of these kinds of sequences since these tools only obtain the most likely possible structure for each sequence. Therefore, it is necessary to develop a more accurate, fast and effective method to delineate the relationship between sequence code and structure space.
Here, we have therefore attempted to develop a methodology that uses primary amino acid sequence information to make a precise and effective prediction of the possible structures for a particular protein, and to visualize the comparison between the native structure and the predicted structure. Our method is based on the integration and analysis of torsion angle information from the Protein Data Bank (PDB) database, which contains information from over 10 million torsion angles. By taking into account the torsion angles between protein sequences, our algorithm improves secondary structure prediction in general. It not only determines the class of the most likely structure for a given amino acid sequence, but it can also predict and model multiple structures of the same sequence, something many other software tools are not able to achieve this point. We performed our method and compared our results with previously published methods [8, 9, 22] for prediction of protein domain structures in two large CATH protein structure classification datasets [23]. The CATH database contains a hierarchical classification of protein domains on the basis of class (C), architecture (A), topology (T) and homologous superfamily (H). This new prediction method performed well with an average of 92.5% accuracy for structure classification, which is a great improvement than Rackovsky’s previous research. The method was also applied to a single amino acid sequence to model four different known protein structures. We also used the RaptorX method to predict the structure of the same sequence and compared the results with our method. The precision and reliability of our results were verified by calculating the dissimilarity of the predicted and actual protein structures. We used both the root-mean-square deviation (RMSD) measure and the Yau-Hausdorff distance to calculate dissimilarity [24, 25]. The Yau-Hausdorff distance is a metric to measure the difference of two proteins of any lengths based on the three-dimensional coordinates of their atoms which does not need aligning and superimposing two structures [24, 25]. Our results demonstrate that this new approach is efficient and reliable on protein structure prediction, and can obtain multiple different structures for a same sequence, improve protein-folding recognition, classification of structural motifs and refinement of sequence alignment.