The primary objective of the present study is to design a novel procedure which attempts to identify, characterize and predict the structure of the protein under study, prove its functional capability and predict its therapeutic potential for the treatment of diseases. Even though this is the standard procedure for protein analysis, a complete singular article encompassing the steps in order with benchmarked tools was lacking.
Proteins belonging to the same family usually have a series of conserved sequence motifs that act as a fingerprint enabling the identification of novel sequences. The Cystatin C coding amino acid sequence of Danio rerio obtained from the NCBI database should have a Cystatin coding domain. The Cystatin Superfamily coding sequence motifs consists of a N-terminal glycine and a QXVXG sequence motif [36]. The Conserved Domain (CD) search revealed the presence of a Cystatin domain with the two evolutionary conserved sequence motifs. In the event a DNA sequence coding for a novel protein is present, it is recommended to commence the study using NCBI’s BLASTx search [37].
Conserved segments in DNA or protein sequences play a role in the functioning of the macromolecules encoded by them [38]. Likewise, proteins of similar function belong to the same family. Therefore, the positioning of a predicted protein amoung others of shared functions and the presence of these evolutionarily conserved sequences would enable the validation of a predicted protein [39].
The phylogenetic construction of the Cystatin superfamily tree conducted by the MEGA 6 software revealed a Cystatin C dendrogram which coincides with the existing validated Cystatin protein family. The generated phylogenetic tree depicted in Fig. 2 had a common ancestry between the Type 1 Cystatins, Cystatin A and B, with a branching off to a shared ancestry between Type 2 and 3 Cystatins. A complete separate branch is shown by Type 4 Cystatins or Fetuins. This phylogenetic pattern which was successfully reproduced in the study is characteristic to the Cystatin superfamily [14]. The protein understudy was found to be positioned among the Type 2 Cystatin C proteins. This confirms the accurate prediction of the phylogenetic tree.
Execution of a Multiple Sequence Alignment (MSA) post phylogenetic analysis enables the accurate recognition of evolutionarily conserved sequence motifs among closely related proteins that are truly involved in its characteristic functions. It also enables the identification of sequence motifs that are most critical in the governance of the protein’s function [40]. Cystatin C proteins are supposed to have 3 such evolutionarily preserved sequence motifs, namely, an N-terminal Glycine, a QVVAG motif and a PW motif [14]. The MSA revealed the presence of all three sequence motifs being conserved in the polypeptide chain of the protein understudy.
The preceding step after protein characterization is the process of functional analysis which requires the prediction of the protein’s native 3D tertiary structure. In an iterative fashion this involves the identification of the protein’s secondary structure first [2]. The secondary structure of a typical Cystatin C protein must have two alpha helixes, one short and one long alpha helix and between them, five antiparallel beta sheets [41]. All of these features were shown to be present by the JPred server in the amino acid sequence under study.
The prediction of the protein’s tertiary structure through computational techniques has two-fold advantages. Conventional methods of NMR and X-Ray Crystallography are expensive and not readily available to the masses. Also, the structure of short half-life proteins cannot be predicted using these methods [1]. The current pipeline successfully utilizes the two main methods of protein modelling available, namely, homology and ab-initio modelling [42].
The I-TASSER server was utilized for the Homology prediction of the protein structure. The I-TASSER server undertakes a hierarchical approach to the prediction of protein structure [43]. Its reliability has been proven by the Critical Assessment of methods of protein Structure Prediction (CASP) experiments [44]. The predicted structure contained all the structural components that must be present in a Cystatin C protein such as both a short and long alpha helix, five anti parallel beta sheet and two disulfide bonds [21].
Ab-initio or de-novo protein modelling was conducted utilizing the QUARK server. In the absence of known templates this technique can be utilized to predict the protein structure from the amino acid sequence alone [42]. The QUARK server was selected due to its reliability proven through the CASP experiments [45]. But the ab-initio predicted model failed the validity test for both of its alpha helixes were long and had no disulfide bonds.
The structure of a protein depends on its structure. A valid tertiary structure should be at a minimum free energy and be stereochemically stable [46]. These two tests carried out by the Pro-SA web test and the Ramachandran plots revealed the structure had a few discrepancies. In such an event, the refinement of the protein has proven to be an efficient solution. The ReFOLD server validated through the CASP12 experiments was capable of removing these discrepancies and showing significant improvements on retesting of the refined model [47].
It is known that electrostatically charged and hydrophobic residues on a protein’s surface plays a key role in long distance protein-protein interactions and are vital for drug design. Therefore, these regions where mapped before the conduction of the virtual screening process [48]. The Cystatin C understudy had 35 such regions.
Cystatin C proteins are dubbed to be the most active protease inhibitors from the entire Cystatin family. It is capable of competitively blocking the active site of Papain, Cathepsin B, H, L and S [14]. Protein-protein docking proved this statement to be accurate for in each instance the cysteine active site of the Cathepsin protein was blocked by the Cystatin C protein. Based on Gibbs free energy laws proteins forming valid complexes should have minimum negative values [1]. The inhibition of the Cathepsins all occurred under favourable negative binding energies further cementing the results of the pipelines chosen docking software.
Residues that continuously take part in the formation of the protein-protein complexes are believed to be related to the protein’s function [4]. Such residues for the protein active site [49]. Utilizing this principal, the active site of the Cystatin C protein was identified by recognition of the amino acid sequences that were always involved in the inhibition of the Cathepsin proteins (Fig. 10).
Genetic interactions mapping reveals functional pathways involved in molecular mechanism [50]. Some of these mechanisms can become clinically significant if compromised and lead to disease conditions such as tumorigenesis. Therefore, the study of gene-gene interaction maps of model organisms and humans have been incorporated in drug discovery and design [51]. Gene interaction mapping of Cathepsins B, L1 and S revealed a number of clinically compromised pathways caused by Cathepsin overexpression (Fig. 11). Considering the fact that the Cystatin C protein is a competitive inhibitor, cross analysis of the binding affinity of Cathepsins with that of their natural substrates and the Cystatin C protein should reveal the potential of the protein understudy to act as a potential drug [52]. The viable candidates were diseases caused by overexpression of Cathepsin B and L1.
The benchmarking of the pathway was led with the prediction of the structure and function of the Cystatin C of D. rerio. To further test the reliability and accuracy of the proposed pathway and the software used, the AChE proteins of both humans and rats was selected. This also enabled the extension of the pathway from being protein-protein to protein-ligand analysis as well.
The use of the hAChE macro molecule was to test the reliability of the I-Tassser software against already identified proteins structures. The hAChE protein’s tertiary structure already exists in the PDB databank. Even though the server predicted a structure closely resembling the true native model after refinement there were minor deviations which resulted in the RMSD value being 0.609. Apart from this, both the models were successfully analyzed for their functionality.
The AChE protein was utilized to comparatively analyze the use of AutoDock Vina and Glide docking. The resultant data showed that in contrast to the Glide docking result that the AutoDock Vina failed to accurately predict the ligand binding site. Instead of the ligand forming a complex in the active site gorge it was bound to the surface of the protein. Thus, ensuring that the Glide docking software was more reliable in the prediction of protein-ligand interactions.