1. Identification of sequence homology
The BLASTp result of the FASTA sequence shows the sequence homology with other identical proteins (Tables 1 and 2). Construction of phylogenetic tree using multiple sequence alignment generated from BLASTp result shows the evolutionary relationship of the selected hypothetical protein (WP_130598461.1) in Figure 2.
Table 1: Similar proteins obtained from the non-redundant database.
Table 2: Similar proteins obtained from Swissprot database
Entry
|
Protein names
|
Identity
|
Score
|
E-value
|
A0A396TZK2
|
Uncharacterized protein (Colwellia sp. RSH04)
|
74.2%
|
894
|
1.3e-120
|
A0A545UCJ6
|
VHL domain-containing protein (Aliikangiella sp. M105)
|
34.3%
|
81
|
8.3e-28
|
A0A1Z4R2C0
|
VHL domain-containing protein (Calothrix sp. NIES-4101)
|
36.6%
|
150
|
1.5e-9
|
A0A1I6H391
|
Por secretion system C-terminal sorting domain-containing protein (Robiginitalea myxolifaciens)
|
37.1%
|
133
|
7e-6
|
A0A2S7JPT4
|
VHL domain-containing protein (Limnohabitans sp. TS-CS-82)
|
35.1%
|
124
|
2e-5
|
2. Analysis of physicochemical properties
The physicochemical properties of a protein can be characterized by an analysis of the analogous properties of the amino acids. The hypothetical protein is negatively charged as the theoretical pI: 4.22 and the total number of positively (Arg + Lys) and negatively charged residues (Asp + Glu) were found to be 10 and 27, respectively. The computed instability index (II) was 32.71 classifying the protein as a stable one. The aliphatic index was 77.37 which gives an indication of proteins’ stability over a wide temperature range and all the other properties have been summarized in table 3.
Table 3: Physicochemical properties of the hypothetical protein (WP_130598461.1)
Properties
|
Value
|
Molecular weight
|
23229.44
|
Theoretical pI
|
4.22
|
The total number of negatively charged residues (Asp + Glu)
|
27
|
The total number of positively charged residues (Arg + Lys)
|
10
|
The instability index (II) is computed to be
|
32.71
|
Formula
|
C1024H1552N262O346S5
|
The total number of atoms
|
3189
|
Aliphatic index
|
77.37
|
Grand average of hydropathicity (GRAVY)
|
-0.261
|
3. Secondary structure analysis
The secondary structure of a protein can be able to provide some worthy information about the function. The query hypothetical protein shows the percentages of alpha-helix, beta-turn, extended strand, and the random coil of protein 21.13%, 9.91%, 33.33%, and 36.15%, respectively from SOPMA. The results of the secondary structure were also cross-checked by the PRISPRED server which shows a summary of similar results. The representative secondary structure of the hypothetical protein (WP_130598461.1) has been shown in Figure 3.
4. Assessment and validation of protein 3-dimensional structure
PROCHECK program was used for the validation of predicted tertiary structure, where the distribution of φ and ψ angle in the model within the limits are shown (Table 4, Figure. 4). The model was presumed to be a good one according to the Ramachandran Plot Statistics, with 91.1% residues in the most favored regions. Finally, the structure validation server Verifiy3D and ERRAT was implicated to verify the established model of 3D structure for the target sequence. In the Verify3D graph, 93.75% of the residues have averaged a 3D-1D score ≥ of 0.2 which indicates that the environmental profile of the model is good and the overall quality factor predicted by the ERRAT server was 60.7143 indicates a quality model. From ProFunc, the average G-factors of the hypothetical protein are calculated to be -0.20, which indicates a usual protein model.
Table 4: Ramachandran plot statistics of the predicted 3D model for the target protein EMK97_00595 (WP_130598461.1)
Plot Statistics
|
Number of amino acid residues
|
Percentage
(%)
|
Residues in the most favored regions [A, B, L]
|
51
|
91.1%
|
Residues in additional allowed regions [a, b, l, p]
|
4
|
7.1%
|
Residues in generously allowed regions [~a, ~b, ~l, ~p]
|
0
|
0.0%
|
Residues in disallowed regions
|
1
|
1.8%
|
Number of non-glycine and non-proline residues
|
56
|
100.0%
|
Number of end-residues (excl. Gly and Pro)
|
2
|
|
Number of glycine residues (shown as triangles)
|
4
|
|
Number of proline residues
|
2
|
|
Total number of residues
|
64
|
|
5. Active site calculation
The active site of the selected hypothetical protein constituted by 11 amino acids of an area with 52.957 and a volume of 22.609. Chain X of the hypothetical protein shows the amino acids involved in the active site (F, V, Y, Y, T, L, E, V, T, Q, W), supplementary Figure 6 (A & B).
6. Assessment of protein subcellular localization and topology
The subcellular localization of the hypothetical protein seems to be an extracellular secretory signal peptide. Protein-sol and SOSUI both predict the hypothetical protein as a soluble protein. HMMTOP, TMHMM predicted the protein as a non-transmembrane protein (Table 5). The predicted topology of the protein has shown here from N terminal to the C terminal.
Table 5: Assessment of subcellular localization
Prediction
|
Servers
|
Results
|
Prediction of subcellular localization
|
Busca
|
Extracellular space, Signal peptide
|
Cello
|
Extracellular
|
PsortB
|
Unknown, Signal Peptide
|
Cell-PLoc
|
Extracellular
|
PSLpred
|
Extracellular protein
|
SOSUIgramN
|
Outer membrane
|
Signal Peptide prediction
|
Predisi
|
Secreted protein, Signal peptide
|
SignalP-5.0 Server
|
Signal Peptide
|
Prediction of protein solubility
|
SOSUI
|
Soluble protein
|
Protein-sol
|
Soluble protein
|
Prediction of Transmembrane helices
|
HMMTOP
|
None
|
TMHMM
|
None
|
Sable
|
No transmembrane domain
|
7. Functional annotation of the hypotheticalprotein
The initial protein domain was achieved from the Conserved domain database (CDD) of NCBI. The region of the domain, superfamily, and family classifications have been determined by the servers – CDD, Pfam, SMART, Interpro, SCOP, Supfam, MotifFinder, ProFunc, Phyre 2, and CATH-Gene3D. The domain, Superfamily, and Family were selected based on the lowest e-value of the following domain. The higher e-value has been filtered out from the selection procedure. The e-value 9.11e-05 of VHL beta domain from ProFunc, 2.71e-09 of VHL superfamily from SCOP, 8.1e-03 of VHL family from Supfam indicate extremely good protein alignment respectively. The overall alignment range of the VHL beta domain was 133-212, VHL superfamily and Family were 144-200 respectively. Protein coil nature was determined by using PCoils from the Bioinformatics toolkit server. According to Phyre 2, the folding pattern of the following hypothetical protein is pre-albumin-like. On the other hand, PEF-FunSeqE is called the protein immunoglobulin-like. Both are secreted protein as well as soluble protein and hence provide a properly defined similarity indication of VHL protein (Table 6).
Table 6: Function annotation of hypothetical protein through the analysis of protein domain/superfamily/Family
Servers
|
Domain/Superfamily/Family
|
e-value/
Confidence
|
Region/
Alignment
|
Functional annotation from sequence
|
Conserved Domain Database (CDD)
|
Superfamily: pVHL
|
6.22e-05
|
146-197
|
Pfam
|
Family: VHL (VHL beta domain)
|
1.3e-02
|
144-200
|
SMART
|
VHL
|
1.2e-02
|
133-205
|
Interpro
|
VHL superfamily
|
-
|
144-199
|
VHL beta domain
|
-
|
131-212
|
Superfamily 1.75 (SCOP)
|
Superfamily: VHL
|
2.71e-09
|
144-199
|
Family: VHL
|
8.1e-03
|
Supfam
|
Superfamily: VHL
|
1.54e-09
|
144-199
|
Family: VHL
|
8.1e-03
|
|
Motif
(From Pfam)
|
VHL beta domain
|
8.1e-03
|
146-200
|
Functional annotation from the 3D structure
|
ProFunc
|
VHL beta domain
|
9.11e-05
|
131-191
|
Phyre 2
|
Superfamily: VHL
Family: VHL
|
99.8% (Confidence)
|
135-212
|
CATH-Gene3D (From Interpro)
|
VHL beta domain
|
-
|
131-212
|
8. Analysis of protein network
The STRING interaction of VHL protein from Homo sapiens has been shown in Figure 8 as a model. VHL interacts with various proteins based on their combined score (table 7). The network has 11 nodes, 40 edges, average node degree 7.27, local clustering coefficient 0.819, expected number of edges 18, and the p-value of protein-protein interaction enrichment 7.07e-06 indicates the network has significantly more interactions than expected.
Table 7: Interacting proteins and their combined score from STRING 11.0 server
Interacted protein
|
Combined score
|
AKT1 (RAC-alpha serine/threonine-protein kinase)
|
0.997
|
AKT2 (RAC-beta serine/threonine-protein kinase)
|
0.994
|
CUL2 (Cullin-2; Core component of multiple cullin-RING-based ECS E3 ubiquitin-protein ligase complexes)
|
0.999
|
EGLN1 (Egl nine homolog 1)
|
0.989
|
EPAS1 (Endothelial PAS domain-containing protein 1)
|
0.994
|
HIF1A (Hypoxia-inducible factor 1-alpha)
|
0.999
|
PPP2CA (Serine/threonine-protein phosphatase 2A catalytic subunit alpha isoform)
|
0.993
|
RBX1 (E3 ubiquitin-protein ligase RBX1)
|
0.982
|
TCEB1 (Elongin-C)
|
0.999
|
TCEB2 (Elongin-B)
|
0.998
|