3.1. AF2-scores represent residue flexibility
AF2 provides the per-residue pLDDT (predicted local distance difference test) scores for each residue of the final model. This score is in the range of [0,100] and represents the confidence of the predicted structure compared to the “true” (ground truth) structure. We used the AF2-scores, as a reversed normalization of the pLDDT scores (see Methods section), and found that the AF2-scores are highly correlated with the residue flexibility. In reality, even the “true” X-ray crystallographic structure represents an ensemble of protein configurations embedded in the crystal lattices. The flexibilities of the atoms are usually recorded as the temperature factors (or B-factors) in the PDB files. From MD simulations, the residue flexibility can be calculated as the root-mean-square fluctuation (RMSF, see Methods).
Figure 2 compares various measures related to residue flexibility for the AF2 models built from the sequences of four proteins. The first protein (Fig. 2a, 133 AA) is the apo-form of the full length lanmodulin (LanM) protein. LanM is an interesting protein that could shed light on rare earth element sequestration48–50. The LanM protein solved by NMR binds to three yttrium ions and has its N-terminus (residues M1 to A22) cleaved51 (PDB ID 6MI5). For the apo LanM, MD simulation shows high-flexibility of the N-terminal tail, consistent with the AF2-scores. The RMSF of all residues is highly correlated with the AF2-scores (Pearson’s correlation coefficient, or PCC = 0.843, p = 0). However, the disorder prediction by IUPRED2 interprets that the N-terminus is in a well-ordered state (the median disorder content of 0.07), contradicting both the AF2 score and the RMSF calculated from MD.
The second protein studied (300 AA) is a dehalogenase recently sequenced from the bacterium Delftia acidovorans strain D4B52. Because D. acidovorans had been cultured in presence of perfluorooctanoic acid (PFOA)52, this dehalogenase might be involved in defluorination of PFOA (or other fluorinated compounds). For this model, the RMSF from MD correlates well with the AF2-score for the dehalogenase (PCC = 0.941, p = 0), as shown in Fig. 2b. Moderate correlation is observed between RMSF and the IUPRED2 scores (PCC = 0.183, p = 0).
The third protein is part of the human PAS-A-domain containing serine/threonine kinase (1323 AA, Q96RG2 in the AF2 database). Here the PAS-A domain sequence (M130-R237, M130 is mutated from the original P130) is used to build the PAS-A domain model. The PAS-A domain is speculated to regulate the kinase activity by sensing environmental stimuli. In general, PAS domains are found in all three domains of life and have a well-conserved tertiary structure, albeit with diverse sequences53. It is shown in Fig. 2c that the high flexibility of the central region of the PAS-A domain (residues 45 to 70) revealed by the RMSF profile is also reproduced by the high AF2-scores (low pLDDT scores). However, the IUPRED2 score does not correlate with the RMSF for this protein.
The fourth protein is an antifreeze protein (AFP, Fig. 2d, 66 AA), which also has a high-resolution X-ray crystallographic structure (PDB entry 1HG754, resolution 1.15 Å), such that the experimental B-factors are available for comparisons. In this example, the AF2 score has a near-perfect correlation with the RMSF (PCC = 0.972, p = 0). However, the IUPRED score does not show positive correlation with the RMSF. The crystal lattice effect in the X-ray structure may affect the B-factor profile (square-root used, see Eq. 2), but it also shows good correlation with the RMSF (PCC = 0.578, p = 0).
The data for all four models in Fig. 2 indicate that the AF2 scores can be used to predict the residue flexibilities, as measured by the RMSF from MD simulations. However, for these proteins, the disorder predictors (such as IUPRED2) for these proteins do not seem to predict residue flexibility. It had been shown that combining both the flexibility (B-factor) and the disorder contents the protein sequences can be classified into four different categories: low-B-factor ordered, high-B-factor ordered, short-disordered and long-disordered regions55. This also explains why the IUPRED2 scores and the RMSF values do not correlate well. The above results indicate that in addition to solving 3D structures from amino acid sequences, AF2 accurately predicts residue flexibilities from the pLDDT scores (or AF2-scores). It should be pointed out that all protein sequences in Fig. 2 have relatively large MSA depths (> 1000)一here, the MSA depth is defined as the number of aligned or partially aligned sequences from the BFD (see Table 1 for the MSA depth of all models used in the present work).
3.2. PAE from AF2 is associated with protein dynamics
The predicted aligned errors (PAE) derived from the predicted template modeling (pTM) scores clearly show that the residues within the same domain exhibit lower PAEs than the inter-domain residues. The AF2 model that serves as the tutorial for the PAE is the human GNE protein, a two-domain, bifunctional enzyme playing a key role in sialic acid biosynthesis56. Using this model structure, an MD simulation was performed and the distance variations (DVs) among the Cɑ atoms of residues were computed to compare with the PAE map.
For multi-domain systems, all-atom structural superposition based on the minimization of the overall RMSD57 may be inappropriate. For these systems, a principal component analysis can be used to examine the relationship between the domains58. The AF2 profile of the two domain GNE protein (Fig. 3a) shows that the linker (residue 381 to 401) between the two domains has high AF2-scores, together with both the C- and N-termini, indicating high flexibility of these regions. Instead of all-atom structural superposition, we used a domain-specific approach: first, only the residues of domain 1 (1 to 380) were superimposed and the RMSF values for domain 1 were acquired based on this superposition; then only the residues of domain 2 (402 to 722) are superimposed and the RMSF values for domain 2 calculated. However, the RMSF values of the linker region (residues 381 to 401) were averaged from both superpositions. This domain-specific superposition approach yielded RMSF of the whole protein highly consistent with the AF2-scores (Fig. 3a). This analysis is in line with our hypothesis that AF2 correctly predicts the residue flexibility via the pLDDT or AF2-scores (Fig. 2). We also calculated the RMSF values using an all-atom superposition approach as those for the one-domain proteins (Fig. 2), however, this approach cannot correctly address the flexibility, especially that of the linker between the domains (Figure S2 in the SI). The RMSF profile of the LanM protein shown in Fig. 2a is obtained from an all-atom superposition. Similar to the domain-specific approach used for GNE, because LanM has a long disordered N-terminus (residues 1–22), if we average the RMSFs from superposition with or without the N-terminus (residues 23 to 133), the AF2-scores have a better correlation to the RMSF plot, as shown in Figure S3 in the SI.
MD simulation studies often use the RMSD (root mean square deviation) profile with respect to the first frame of the trajectory, to determine whether the system has equilibrated. However, in the two-domain system, we observed that the principal movement (PC1 from PCA) corresponds to the anti-correlated movement of the two domains (Fig. 3b and Supplementary movie). With the large amplitude interdomain movement, the RMSD profile of this protein does not converge to a plateau. This has been observed previously for the MerR homodimer system regarding opening-and-closing dynamics59. We also monitored the interdomain center-of-mass (COM) distances from the MD trajectory, observing large amplitude fluctuations during the 100 ns MD with the interdomain COM varying from 47.5 to 60 Å (Figure S4 in the SI). Interestingly, the RMSD profiles monitored using the two extreme interdomain COM configurations as the reference states exhibited mirror symmetry owing to the anticorrelation of the RMSD profiles. This hidden symmetry from MD simulations of vibrating systems had also been observed in both MerR and CurR homodimer proteins59.
We calculated the distance variations (DV) among the Cɑ atoms of all residues. The DV map and PAE map from AF2 (Fig. 3c) show highly consistent patterns: the distance variations (or predicted errors) of residues within the same domain are relatively small; whereas the variations/errors of residues from different domains are relatively large. For this two domain protein, the maximal PAE reported from AF2 is 31.8 Å, and the maximal calculated DV is 18.0 Å. This may be due to that the DV is estimated as the IQR, i.e., the Cɑ-Cɑ distance at the 75% quantile subtract that at the 25% quantile, which may be roughly half of difference between maximal and minimal Cɑ-Cɑ distance. Also the DV calculation is a 1D simplification of the real PAE (Fig. 1), which may also give errors to the estimation. However, the consistent trends between PAE and DV maps (PCC = 0.732, p = 0, Fig. 3c) indicate that PAE originates from protein dynamics, and that the Evoformer neural network of AF2 decodes this dynamics information (encoded in the protein sequences) through multisequence alignment.
The PAE and DV heatmaps for the models in Fig. 2 (see Figure S5) show statistically significant correlations similar to Fig. 3c. These results substantiate the usefulness of the PAE’s predicted by AF2 for capturing protein dynamics.
3.3. Large proteins
It remains a challenge for AF2 to model extremely large proteins, such as the human Titin protein (34,350 AAs) which include a long IDPR of over 2100 AAs60. The AlphaFold database of the human structurome does contain 3D models for smaller fragmental (1400 AA) of the Titin 4. Other proteins containing residues as large as 2,180 AA’s (with no structural homologues) have also been reported with the TM-score of 0.963. Here, we have also modeled two large proteins (> 1000 amino acids), including one with considerable disordered regions.
Figure 2c represents the result for the PAS-A domain protein, which is only the regulatory part of the PAS-A domain containing kinase (PAS-kinase) from H. sapiens53. We modeled the structure of the full length PAS-kinase, which contains 1,323 AAs. This structural model has been also modeled by the AF2 team4 (AF2 entry: Q96RG2) and can be obtained from the AF2 database at (https://www.alphafold.ebi.ac.uk/entry/Q96RG2). In the present work, both the AF2-scores and the PAE map (Fig. 4a) indicate the PAS-kinase model indicates two structural domains: the PAS-domain that comprises both the PAS-A domain (residue 130 to 237, Fig. 2c) and PAS-B domain (residue 238 to 401) as well as the kinase-domain (residues 892 to 1269). However, the other regions of the full PAS-kinase are mostly disordered (see the PAS-kinase structure in Figure S1). Both the AF2 and IUPRED2 scores correlate well with the RMSF calculated from the MD simulation (Fig. 5a). The PAE map also indicates the existence of two structure regions (or domains): the PAS-domain and the kinase domain, which is also supported by the DV map (Fig. 4c). Moreover, the interdomain regions in the DV map have relatively small distance variations, which is consistent with the PAE map and reflects the interactions between the two domains.
We also analyzed the ice nucleation protein inaZ from Pseudomonas syringae (UniProt code P06620). This 1,200 amino acid protein enables the microbial organism to facilitate the crystallization of supercooled water61. The ice nucleating properties of P. syringae are key for their biological function62,63, and confer them a role in cloud glaciation and precipitation64,65, as well as in artificial snow making66. Fragments of the inaZ protein had been modeled67 but the structure of the full-length protein has never been predicted. The AF2 structure (Figure S1 in the SI) indicates that the inaZ has a N-terminal domain in ɑ/β fold (residues M1 to A110) and a ca. 30 nm-long domain constituted by antiparallel β-strands (residues Q171 to K1189), followed by C-terminal tail in coil state (residues P1180 to K1200).
For the inaZ protein, the AF2-scores are also strongly correlated with the RMSF from the MD simulation (Fig. 4b). The PAE map from AF2 and DV map from MD (Fig. 4d) both indicate the existence of two separated segments in the inaZ protein. The AF2 profile (Fig. 4c) shows repeating peaks from the β-strands, which is also reflected in the RMSF profile. Although the magnitudes of these peaks are different, the peak positions are precisely consistent. This consistency is also shown in other systems (Figs. 2 and 3). For the other large protein PAS-kinase (Fig. 4a), overall correlation between AF2 and RMSF profiles is observed, but not at the finger-print-like accuracy of inaZ, which may be owing to the IDPRs in the PAS-kinase (Figure S1 in the SI). Similarly, the PAE and DV maps are also consistent, but the PAE from AF2 modeling generally propose larger error ranges than the DV from MD. Not only because DV can be regarded as a 1D simplification as PAE (see Methods), this may also be derived from the PAE evaluation, which is based on the calculation of the predicted template modeling scores of the predicted residues compared to the imaginary “true” models3.
The modular protein-protein interaction network (PIN) in a living cell suggests that the protein functions are dependent on their interactions68,69. The AlphaFold-Multimer8 has been incorporated into AF2 to model the protein multimers一both homomers and heteromers一that is applicable to analyze the interactions and dynamics of the PPIs, at least in silico. RoseTTAFold7 was also applied with a similar approach to model cellular core complexes involved in key cellular functions such as transcription, translation and DNA repair9. The multimer models from AF2 or/and RoseTTAFold are extremely useful for understanding the PINs and modular protein functions. Known PINs are mainly based on curations of the experimental results such as those from the yeast two-hybrid experiments70. These networks are binary, i.e., the strength, or affinity, of the two interacting proteins are unknown71. The multimer models also provide the opportunity to simulate the protein-protein interaction strengths. Here, we modeled and simulated both a heterodimer and a homodimer to test the trends we observed in the monomers, as shown in Fig. 5.
In the heterodimer model, we used the sequences of the PAS-A domain and kinase domain as two independent entries, and applied AlphaFold-Multimer to construct the dimer structure. In this model, consistent with the monomer models, the AF2-scores correlate well with the RMSF from MD, indicating that the AlphaFold-Multimer also captures the residue flexibility (Fig. 5a). In addition, both the PAE and DV heatmaps (Fig. 5c) show interactions between the PAS-A and kinase “proteins”, consistent with the full PAS-kinase model. Therefore, beyond the multimer structures, AF2 can also predict the dynamics characteristics associated with the protein-protein interactions.
We used the sequence of a probable MerR family transcriptional regulatory protein from Mycobacterium tuberculosis (UniProt ID O53384). The monomer of this protein is available at the AF2 database at https://alphafold.ebi.ac.uk/entry/O53384. Using AlphaFold-multimer, the homodimer structure of this protein was constructed. Again, the AF2-score profile is consistent with the RMSF from MD (Fig. 5b), and the strong interactions between the two chains of this homodimer are shown in both the PAE and DV heatmaps (Fig. 5d), agreeing well with the known structures and dynamics of the Hg(II)-dependent MerR homodimer59. Moreover, the opening-and-closing dynamics of this homodimer was shown as the largest amplitude movement (PC1 from the principal component analysis of the MD trajectory), consistent with the previous simulations of the MerR family homodimers59.
3.5. Intrinsically disordered and randomized proteins
Intrinsically disordered proteins (IDPs) or protein regions (IDPRs) are abundant in all organisms72–74. Proteins with structures deposited in the protein data bank75, even proteins with high-resolution X-ray structures, also contain significant disorder contents in their sequences60,76. It has been shown that the pLDDT scores provided by AF2 can also be applied to detect disorder19. For example, pLDDT scores lower than 50 are often indications of disorder in a protein25,31. The human structruome constructed by AF2 covers 58% of residues with a confident prediction (pLDDT > 70)4, indicating the prevalence of IDPs and IDPRs in proteomes77.
We examined a fully disordered protein, NVJP-1 from the marine sandworm Nereis virens78. For this protein, no MSA hit has been found by AF2. The NVJP-1 protein is fully disordered, reflected by the IUPRED2 scores and the pLDDT scores (Fig. 6a). For clarity, the pLDDT scores are divided by 100 and are not normalized. Consistent with the IUPRED2 trend, all residues in NVJP-1 have a median pLDDT score of 42.8, and an IQR of 7.3, with 334 out of 387 residues demonstrating pLDDT scores lower than 50.0, suggesting disorder for these residues31. The RMSF profile of the NVJP-1 protein does not correlate with the AF2 (or pLDDT) scores, however, it shows a moderate correlation with the IUPRED scores (Fig. 6a). In addition, as indicated in the PAE and DV maps (Fig. 6c), all-atom superposition for the RMSF calculation is insufficient to estimate the flexibility of the residues due to large distance variations among the residues (also see Fig. 3A and Figure S2).
We also compared the PAE and DV maps of NVJP-1 (Fig. 6c), which exhibited significant similarity (PCC = 0.529, p = 0). Unlike other globular proteins (Fig. 3c and Figure S1 in the SI), the PAE or DV maps indicated that all residues in the protein have considerably high PAEs and DVs to other residues, even to their closely adjacent residues. The definition of “disordered” is as ambiguous as the definition of “ordered”,79 given that rapid configurational dynamics in any protein continually occur in the cells80. Moreover, many IDPs may acquire folded structures upon interaction with a variety of partners, particularly, in the crowded cellular environment81. Here, we show that the AF2 models can be used to qualify the states of intrinsic disorder in proteins: large PAEs (and DVs) among adjacent residues serve as a signature of disorder.
We randomly constructed the amino acid sequence of a protein using the methods described previously60. For the randomized protein, AF2 did not find any MSA hits, similar to NVJP-1. However, unlike the NVJP-1 model that did not show any folded elements in its structure, the randomized protein contained folded regions (Figure S1), indicating that fully disordered IDPs such as NVJP-1 do not occur by chance. This is in line with the previous results that randomized proteins have roughly half folded and half unfolded regions based on the disorder content calculations60. For both NVJP-1 and the randomized protein (Fig. 6a/b), the AF2-scores exhibit no or little correlation to the RMSF, indicating that AF2 cannot extract dynamics without co-evolutionary information from MSA (Table 1). As shown in Fig. 6d, the PAE/DV maps (PCC = 0.482, p = 0) of the random protein are featureless, and resemble those of NVJP-1. A recent study starts with the random “hallucination” sequences that also yields featureless 2D contact maps15. However, the contact maps can be optimized by interactions of Markov chain Monte Carlo simulations, and the optimized contact maps corresponded to highly featured protein folds validated by X-ray crystallography15. Therefore, the neural network used in AF2 and RoseTTAFold contains sufficiently rich structure and dynamics information for useful protein engineering.
We show here that the structural models predicted by AlphaFold2 not only produce the atomic coordinates (or 3D fold) of the proteins, but also give important information regarding residue flexibility associated with the protein dynamics, which are comparable to the results from molecular dynamics simulations. For globular proteins and protein complexes, the AF2-scores derived from the pLDDT scores (i.e., the confidence scores evaluated by AF2) are highly consistent with the RMSF profiles from MD. For these protein models, the PAE maps predicted by AF2 also showed consistent trends as the distance variation maps from MD. Anfinsen's rule illustrates that the protein structure is determined by its primary sequence. This rule also indicates that the protein structure determines its function, which is derived from protein dynamics. Therefore, protein dynamics information has been encoded in the protein primary sequences to determine protein function. Our results suggest that AF2 is capable of decoding dynamics characteristics which are crucial to understand protein interactions and functions. The low pLDDT scores in the AF2 models might not result from “low confidence” but be attributed to high residue flexibility, which contributes to protein function.