Proteins present a unique problem in that they are too large to be used in conventional parameter optimization operations. This operation uses reference data for a large number of small systems, so the properties for each datum can be calculated rapidly, allowing a complete parameter optimization to be carried out in only one or two days using a 3GHz computer. However, calculation of even one datum for an entire protein system would require more computational effort than that for all the other systems combined, so simply adding proteins to the optimization was impractical.
An alternative, using a single proxy reference datum to represent each particular type of clash, eliminated the need for any proteins to be present in the optimization.
A consequence of this use of a single proxy datum to define three new parameters was that these parameters would not be uniquely defined, an ambiguity that was resolved by the presence of the large amount of reference data on small systems. Their presence in the optimization automatically had an effect on the new parameters, allowing them to become uniquely defined.
Optimizing parameters for use in modeling proteins has hitherto been impractical because of the very large computational effort that would be required. However, in the approach described here, by constructing proxy reference data that represent individual interactions of the type that occur in proteins but do not occur in small chemical systems, a considerable reduction in time has been achieved. When these proxy reference data representing longrange interactions are used together with enough reference data on small systems to allow shortrange chemical properties to be properly represented, optimization of parameters for use in modeling proteins requires approximately the same time as that for a conventional optimization.
A comparison was made of individual clashes in original PDB structures and their equivalent in PM6ORG optimized structures. As expected, all the results from geometry optimization were consistent with the assumption that the largest changes would involve the most flexible coordinates. This can be illustrated using the potassium channel protein 1JMV, the protein whose experimental geometry had the largest clashscore. Within this system, the largest clash involved Ala31 and Leu35 of chain A, as shown in Fig. 2, together with the optimized PM6ORG geometry. An inspection of the leucine sidechain showed that the large increase in O – Hγ distance, 0.78 Å, was being caused mainly by a rotation of the CCαCβCγ torsion angle.
A deficiency in two semiempirical methods, PM7 and PM6D3H4, which had allowed pairs of atoms in proteins to approach closer than expected, has been corrected by the addition of a simple empirical function that represented the vdW repulsion, a term that had not been important when only small systems were being modeled. After this correction was made, the clashscores decreased to less than 20% of their values when PM7 and PM6D3H4 were used, and to 60% of the original PDB clashscore.
Previous semiempirical methods modeled chemical systems by using two types of interatomic interaction: those that represented strong covalent interactions, such as chemical bonds, and the much weaker non-covalent stabilization interaction caused by hydrogen bonds, electrostatics, and dispersion effects. The addition of the vdW repulsion interaction represents a new and third type of term in NDDO semiempirical methods. Although this term is the smallest and weakest of the three interactions, its absence from the computational model was the direct cause of large geometric errors in the predicted structures of proteins.
Optimizing semiempirical models to more accurately reproduce properties of proteins was achieved by the development of reference data that acted as proxies for individual diatomic interactions in proteins. This strategy is highly flexible, and would likely be applicable to addressing other errors found while modeling biochemical systems.
Comparison of methods for modeling small molecules
Compared with PM7, PM6-ORG represents a reduction of 8% in AUE in predicted heats of formation, a 21% reduction in AUE in dipole moments, a 10% reduction in AUE in bond-lengths, and a 12% reduction in AUE in bondangles. One metric, the ionization potential, showed an increase of 17%.
As noted earlier, because the D3H4 correction was added to the un-modified PM6, the heats of formation predicted by PM6-D3H4 were shifted by more than 10 kcal mol− 1, on average, and would have rendered any comparison of AUE meaningless. This particular fault was corrected in the current parameterization.
Comparison of methods for modeling intermolecular interactions
The most dramatic change was in the repulsive contacts, where the range of errors decreased considerably, from about 30 kcal mol− 1 for PM6D3H4 and PM7, to about 12 kcal mol− 1 for PM6ORG. Because the range of errors was significantly reduced, the incidence of large errors in non-covalent interactions would also be reduced. This would be expected to result in a corresponding improvement in the accuracy of the prediction of protein - ligand interaction energies, particularly those involving unusual combinations of elements.
Comparison of methods for modeling proteins
Average errors for clashscores, backbone deviation, and percent volume change are shown in Table 8.
The largest reduction in errors occurred in the clashscores, which implies that the computational model is more realistic and should help alleviate some of the doubts regarding the usefulness of these methods for modeling protein behavior. Both the RMSD and the volume change are intermediate between PM6-D3H4 and PM7, although the RMSD for PM6-ORG was significantly worse than that for PM6-D3H4. This deterioration could be attributed to the possible presence of large distortions in the Xray structures of the sodium and potassium channel proteins, as implied by their reported unusually large PDB clashscores. When these two proteins were removed from the RMSD calculation, the PM6-OPT value dropped to 0.91 Å.
Table 8
Average errors in Proteins
|
Quantity
|
PDB
|
PM6ORG
|
PM6-D3H4
|
PM7
|
Clashscore
|
7.48
|
4.72
|
32.41
|
28.76
|
RMS deviation (Å)
|
|
0.96
|
0.83
|
0.97
|
Volume (% change)
|
|
-5.80
|
-5.13
|
-7.60
|
Hydrogen bond lengths
An anomaly was found in the PM6-D3H4 distribution of hydrogen-bond lengths involving two oxygen atoms in the region between 1.7 and 2.1 Å, in that the number of hydrogen bonds predicted using PM6-D3H4 was significantly larger than that predicted using the original PDB geometry, as shown in Fig. 3. Examination of the types of hydrogen bonds showed that there was an increased propensity for water molecules to form hydrogen-bonds, for example, in chymotrypsin, H2O306 formed two hydrogen bonds, both of length 2.01 Å, with the peptide oxygen of Ala111. A survey was run to determine the frequency of a water molecule forming two hydrogen bonds with the same oxygen atom. No examples were found when the PDB or PM6ORG geometry was used, but 35 examples were found when PM6-D3H4 was used. These are listed in the Supplementary Material.
Proteins
Although the objective of the new method is to improve the ability to model enzyme mechanisms and other dynamic phenomena that occur in proteins, any investigation into such phenomena would, because of their great complexity, require considerable effort. Therefore, for the purposes of this report, only static properties were examined.
Individual proteins
Bacteriorhodopsin is a good example of a protein that contains a large amount of α-helix. Its structure consists of a stack of seven α-helices surrounding a protonated Schiff base formed by Nζ on Lys216 and a retinal group. This Schiff base forms[29] strong hydrogen bonds with Asp85 and H2O402, and in turn the H2O402 moiety forms strong hydrogen bonds with Asp85 and Asp212.
All these features, i.e., the helices, the Schiff base and the compact hydrogen bonded structure, were reproduced by PM6-ORG. In both the original PDB structure and in the optimized PM6ORG structure the charge distribution in the Schiff base was delocalized over the extended conjugated π-system of the retinal. This prediction is consistent with a recent report[30] on related rhodopsins, where, as the authors of that paper noted, it is in variance with the current consensus opinion that the positive charge would be localized at the site of the Schiff base.
Green fluorescent protein (GFP) provides a good example of the other main secondary structure in proteins: the anti-parallel βsheet. It consists of a single protein chain folded into 11 β-strands that form a barrel. Inside the barrel is a chromophore of the type phydroxybenzylidene-imidazolidone composed from residues Ser65, Tyr66, and Gly67. This chromophore is held in place by a strong electrostatic interaction between the ionized guanidine group on Arg96 and the O2 on the imidazole ring. Like the chromophore in bacteriorhodopsin, the chromophore in GFP has[31] an extended conjugated π-system. The positive charge in the Schiff base was stabilized by delocalization, which suggested the possibility of a similar stabilization in GFP. Several attempts were made to transfer a proton from the guanidinium group of Arg96 to the oxygen on the imidazole ring, in the hope that the resulting cationic charge on the chromophore would be stabilized in a similar manner. These all failed. This negative result could be regarded as confirmation that the original 1994 description[31] of the neutral chromophore was correct.
The secondary structures of both bacteriorhodopsin and GFP are held together by a large number of hydrogen bonds. As a result, both of these structures are relatively rigid.
In contrast to these two proteins, the structure of barnase involves two α-helices and a multistrand antiparallel βsheet, as well as a β-hairpin bend, and several intrinsically disordered regions, and, as such, provided a useful test for the ability of PM6-ORG to reproduce the experimentallyobserved geometry. Given that the PM6-ORG RMSD of the barnase backbone, 0.86 Å (Table 6), is slightly lower than the average, 0.95 Å (Table 8), the inference can be made that the predictive power of PM6-ORG to reproduce the geometry of proteins is not significantly impaired by disorder in the protein geometry.
The smallest protein examined, crambin, with only 46 residues, contains two short α-helices; the rest of the structure is disordered. Because of these features crambin has been used extensively as a test case for experimental work and for computational modeling, and for the same reasons was chosen as a test case in the early stages of this work. Only one caveat is made regarding its usefulness: because RMSD tends to increase with system size, and because crambin was the smallest protein, the RMSD, at 0.80 Å, is, to a degree, artificially small. Nevertheless, there is no indication that this would compromise the significance of the results.
Two proteases, chymotrypsin and 3CLpro were selected. To catalyze the peptide bond hydrolysis, chymotrypsin uses a catalytic triad Ser195, His57, and Asp102, and 3CLpro uses the dyad His41 and Cys145. In both enzymes the binding site is adjacent to the reaction site. Simulation of the catalytic mechanism of chymotrypsin has already been modeled[1], and, as mentioned earlier, is not controversial.
Of topical interest is the binding site of 3CLpro. Examination of the optimized PM6-ORG geometry of this site indicated that it was reproduced with useful accuracy, and tests to establish that it had the capability of binding known ligands are currently underway; preliminary results suggest that their results are encouraging.
Proteins containing other elements
Sodium
In a biochemical environment, sodium atoms are invariably ionic. That is, they do not form covalent bonds. Thus, modeling their behavior presents problems; but, a simple test would be to model the behavior of sodium ions in a channel. In such an environment the ions could be expected to be able to migrate along the channel in response to an electrophoretic force: that is, there should be little resistance to motion of the ions along the channel in response to an electric field.
In the X-ray structure of the prokaryotic sodium channel protein 4CBC, four chains, A, B, C, and D, form the channel, the sides of which consist of oxygen atoms from the residues Thr176, Leu177, Glu178, and Ser179 of each chain. One oxygen atom from each residue contributes to the formation of a square, and each of these four squares is perpendicular to the axis of the channel. Three sodium ions are positioned on the axis, with two of them separated by 2.2 Å, as shown in Fig. 4.
Geometry optimization resulted in the structure of the channel being conserved, albeit the width of the channel decreased about 12%, and the sodium ions migrated along the axis of the channel. Migration of this type would normally be regarded as an error, but in this system it is evidence that the computational model predicts that there would be little resisitance to motion of sodium ions along the axis.
Magnesium
In biochemical systems, magnesium atoms occur in two important environments. One is at the center of a porphyrin ring, as in chlorophyll, the other is at the center of a set of oxygen atoms, normally six arranged in an approximately octahedral coordination.
In the PDB entry for peridinin-chlorophyll, entry 2X20, the magnesium atom forms bonds that range in length from 2.01 to 2.07 Å with the four nitrogen atoms of the porphyrin ring, and a fifth, non-covalent bond, of length 2.21 Å, with the oxygen atom of a water molecule located above the ring; for these quantities, PM6-ORG predicts the Mg-N bonds to be 2.06–2.15 Å long, and the Mg-O bond to have a length of 1.99 Å.
In the PDB structure of magnesium–loaded ALG-2, a magnesium atom is in an approximately octahedral environment of oxygen atoms, with five of the Mg-O distances ranging from 2.005–2.284 Å, and one at 2.858 Å. PM6-ORG predicts a similar environment, with five of the Mg-O distances ranging from 1.920–2.218 Å, and one at 3.459 Å.
Potassium
Potassium, like sodium, is invariably ionized, forming no covalent bonds, and therefore the only property of interest is its ability to migrate through a protein. In the X-ray structure of the KcsA potassium channel protein, PDB ID: 1JVM, the potassium ions were replaced by rubidium ions[32]. For the purpose of this work, these ions were replaced by the original, slightly smaller, potassium ions. The tunnel in the filter contains three K+ ions, a tetrabutylammonium cation, and one water molecule. All these species were retained in the modeling.
In the X-ray structure of 1JVM, two of the potassium ions are located at the center of a slightly distorted square antiprism of oxygen atoms, and the third ion is at the center of a slightly distorted tetragonal prism. After the geometry was optimized these structures were still present, but were significantly distorted, with the K – O distances ranging from 2.55 to 3.88 Å, whereas, in the X-ray structure, they ranged from 2.90–3.36 Å.
Calcium
The largest error in the predicted geometry of 1UOW involved one of the carboxylate oxygen atoms on the highly-conserved[33] Asp309 bonding to calcium. In the PM6-ORG geometry, this distance was 2.06 Å, while in the PDB geometry the smallest Ca – O separation was 2.41 Å. There is evidence of a significantly larger covalent interaction in the PM6ORG geometry, in that the atomic partial charge on Ca bound to Asp309 was + 1.24 in the PDB geometry, whereas in the PM6-ORG geometry this decreased to + 1.05.
Iron
Two common forms involving a heme ring system were modeled. In one form, found in cytochrome P450, the iron atom is at the center of a porphyrin ring system, and covalently bonds to Cys332 at Sγ, and non-covalently binds to O3 of Gol402. A comparison of the X-ray and calculated environment of the iron atom is shown in Fig. 5. This system, PDB ID: 7TTP, with 345 residues and 217 water molecules, was one of the larger systems studied.
The other form, found in human hemoglobin is tetrameric. The X-ray structure of this protein, PDB ID: 5WOH, contains 566 residues and is very large. For computational convenience, only the first subunit consisting of 137 residues and 197 water molecules was modeled; in this subunit, iron forms a covalent bond with His87 at Nε2 and a non-covalent bond with the oxygen of H2O312, in addition to the standard porphyrin ring system. The environment of the iron atom is similar to that in P450, except that in P450 there is a Fe-S covalent bond, in hemoglobin there is a Fe-N covalent bond. PM6-ORG predicts this bond to be1.993 Å long, versus the PDB value of 2.135 Å.
Cobalt
In PDB ID: 2V3N, the cobalt atom is octahedrally coordinated to the four nitrogen atoms of a corrin ring, a nitrogen atom of a dimethylbenzimidazole group, and a carbon atom of a cyano group. A comparison of its X-ray and calculated environment is shown in Fig. 6.
Zinc:
In the zinc finger, the average of the six Zn-N distances was predicted to be 2.054 Å, identical to that in the Xray geometry, and the average Zn-S distance was 2.311 versus the X-ray value of 2.298 Å.
The various bond-lengths in zinc endoprotease are shown in Table 9.
Table 9
Bond lengths from Zinc in Zinc endoprotease (Å)
|
Bond
|
PM6-ORG
|
PDB
|
Zn-O (H2O202)
|
1.998
|
1.933
|
Zn-O (Asp93)
|
1.894
|
1.948
|
Zn-N (His87
|
1.984
|
2.014
|
Zn-N (His83)
|
1.975
|
2.006
|
In biochemical systems, zinc invariably occurs as ZnII, and almost always adopts a tetrahedral coordination, bonding to a combination of nitrogen, oxygen, and sulfur atoms, and therefore modeling its chemistry is relatively uncomplicated.
Selenium
In proteins, selenium occurs most often as selenomethionine, the selenium analogue of methionine, where it forms two covalent bonds with carbon atoms. The average of the Se-C bond lengths in plasmodium falciparum, PDB ID: 1D5C, and adenylyltransferase, PDB ID: 1O6B, was 2.000 Å, slightly longer than the average of the X-ray structures of 1.921 Å. The CSe-C angles averaged 97.4⁰ compared to the X-ray value of 100.0⁰.
Comparison of physics and computational chemistry results
There are fundamental differences in the ways that experimental physics methods and computational chemistry model proteins.
In the physics approach, information from experimental samples is used to generate threedimensional models of proteins and any associated small molecules. The general objective of these methods is to produce structures that are as similar as possible to those that would exist in vivo so that they can be a source of information for use in biochemistry. The Protein Data Bank contains a large curated collection of experimentallydetermined structures, together with validation reports on their quality. These usually include the results of a Molprobity analysis of the structure; analyses that have high clashscores normally indicate lowerquality structures.
Experimentally-determined protein structures provide an invaluable insight into the biochemical processes that occur in enzymes. For example, examination of the structure of chymotrypsin allowed the complicated catalytic mechanism used in the hydrolysis of a peptide bond to be elucidated, including the discovery of the significance of structures such as the catalytic triad and the oxyanion hole and their role in reducing the activation barrier.
On the other hand, the objective of computational chemistry methods is to model physical and chemical phenomena that occur in protein chemistry. Two of the potentially most important applications involve modeling the binding of ligands to various sites in enzymes, and modeling the mechanisms used by enzymes to catalyze reactions.
Another difference is that computational chemistry is dominated by energy considerations: calculation of the heats of formation and their derivatives with respect to geometry is essential for locating stable intermediates and transition states. Calculated heats of formation of reactants, transition states, intermediates, and products for a reaction mechanism can not only be used in understanding the mechanism, they provide considerably more insight than can be obtained when only the static experimental geometries are used. For example, a quantitative estimate can be made[1] of the relative importance of the oxyanion hole and the catalytic triad in lowering the activation barrier in chymotrypsin. Another example, also in chymotrypsin, concerns the role of the catalytic triad residue Asp102. In the literature, the general consensus has been that at all stages in the mechanism this particular residue is ionized, yet the suggestion has also been made[34] that at one point in the catalytic cycle it might be neutral. When the mechanism was modeled, the results of the computational method agreed with the consensus, predicting that in the lowest energy path Asp102 was always ionized, and that neutralizing it at any point inevitably resulted in an increase in energy.
Correcting the large clashscore error in PM7 and PM6D3H4 required only the addition of a very simple function that represented the missing vdW repulsion between wellseparated atoms. As mentioned earlier, from a practical perspective the precise form of the repulsion function was unimportant. A singlesided Gaussian was used in this work, but any similar alternative, such as the hyperbolic tangent function, would also have been perfectly acceptable. Adding this correction resulted in a large reduction in clashscores without introducing any significant errors. The significance of this improvement is that existing semiempirical chemistry models lacked a vdW repulsion term, and consequently the predicted protein geometries had large clashscores, but, when the repulsion term was added, the fault was corrected and the clashscores decreased dramatically.
Importance of appropriate weightings
The first step in converting a reference datum into a form suitable for use in optimizing parameters is to render it dimensionless. In this work, data representing energies were assigned weighting factors that depended on the energy range involved, from about 1 up to 100/(kcal mol− 1).
In addition to the weighting factors depending on the types of reference data used in the training set, weighting factors can also be used in altering the focus or objective of the method. In the present work, the aim was to reduce the clashscores and to improve the accuracy of prediction of intermolecular interactions of the type found in ligands non-covalently bound to proteins. The results of using this particular focus can be seen in Table 3 and Table 8. If a different focus were to be desired, for example, to increase the accuracy of intermolecular interactions, then the appropriate weighting factors would need to be changed: i.e., reduce the factors for clashscores or increase the factors for intermolecular interactions or both, then the parameter set reoptimized. To assist in such a parameterization project, all files that were used in generating PM6ORG are made available in the Supplementary Material. Operations of this type are straightforward, and would allow a method tailored to any specific need to be developed.
Accuracy of PM6ORG
The following summary of the features of the new method can be used in deciding its applicability to any specific protein system.
Overall, clashscores improved in going from an average of 32.4 for PM6-D3H4 and 28.5 for PM7 to 4.8 for PM6-ORG.
The average RMS errors in the geometry of protein backbones increased by 14% relative to PM6D3H4, but were still very slightly smaller than those in PM7. This increase was accompanied by an increase of 0.6% in the difference of the volumes of the proteins.
Two proteins optimized using PM6D3H4 had severe faults. In cobalamin, a spurious covalent bond formed between Cys98 and H2O2031, and in zinc endoprotease a spurious bond formed between Cys112 and Arg79. Neither of these faults was present in the PM6ORG optimized structures, nor were any new faults found.
Limitations
Most of the limitations in the applicability of PM6ORG to modeling proteins are caused by the underlying software limitations of the MOPAC program. Because of their size, solving the SCF equations for proteins always requires the MOZYME procedure to be used. In its current form, MOZYME is limited to closedshell systems, so openshell systems of the type encountered in free radical biochemistry and photochemistry, e.g., photosynthetic pathways and other proton pumps, cannot be modeled.
Another class of systems that could cause problems involves heterocycles with transition metal ions in their center, such as corrin with CoIII as in cobalamin, and porphyrin with FeII in P450. In their ground states the geometries of these systems can be modeled using MOZYME, as shown in 2V3N and 7TTP, but modeling electronic phenomena of the type that occur in reaction mechanisms when the oxidation state of a metal atom changes would not be possible.
About 7,000 atoms is the practical upper limit of the size of system that can be modeled. The most timeconsuming step in a modeling study is the initial geometry optimization. For a system of 7,000 atoms and using a 3GHz computer this would require about one to two CPU weeks. This limit can be avoided for proteins in which allosteric behavior is not important. In such systems, all chemical effects on binding sites and active sites caused by distant atoms can be ignored, so that the system could be trimmed down to only include atoms within about 12 Ångstroms of any atom of interest. The resulting system would then be in the 1,000–2,000 atom range and simulations would run much faster. In practice, initial geometry optimizations of trimmed systems required only one or two days and subsequent optimizations would run in just a few hours.
Use of PM6ORG for investigating protein chemistry
Although the addition of the vdW repulsion term to the computational method improved the clashscores for proteins, in order to carry out meaningful simulations other criteria relating to the computational model must also be satisfied. These criteria all involve issues relating to chemical behavior. Resolving some of these might present difficulties, so the following suggestions are provided in the hope that they might prove useful.
Preparing proteins
Before attempting to model protein behavior it is essential that several steps must be carried out. The first, and by far the most important of these, is that the model should be as realistic as possible.
A good starting point for this is to hydrogenate the experimentallygenerated geometry of a protein together with any associated small molecules such as water and, if present, a ligand. Three steps are involved.
First, add hydrogen atoms to neutralize all sites in the protein except for non-covalently bound atoms that would normally be ionized, such as Ca+ 2, Na+, K+ and Cl−. Second, ionize all ionizable residues. Third, add and delete hydrogen atoms to ensure that the starting model structure is correctly ionized.
To simulate the in vivo environment of a protein, the use of implicit solvation is essential. For this, the COSMO model is ideal.
Generating a selfconsistent field for macromolecules is best done using the MOZYME localized molecular orbital approach. MOZYME begins by generating a Lewis structure for the entire system. One result of this process is the generation of a list of all ionized sites, information which is extremely useful when checking for errors in hydrogenation and bonding. The LMO’s are then constructed and used in solving the SCF equations. An incidental beneficial result of using Lewis structures as the starting LMO’s is that a common problem when working with proteins, ensuring that the net charge is correct, is eliminated.
Problematic protein geometries
Experimental geometries that have low clashscores typically do not present any problems, but geometries with high clashscores can present problems with hydrogenation and with solving the SCF equations. The default method in MOPAC for hydrogenating systems relies on the topology of the system, but large clashes can alter the topology to such a degree that the resulting system becomes unrecognizable. By using keywords to selectively edit the topology, a chemically-sensible structure can be generated and hydrogenated. Because the starting LMO’s used by MOZYME depend on the topology, an incorrect topology could cause MOZYME to generate an incorrect SCF. This particular fault can also be avoided by using the same keyword that was used to edit the topology for hydrogenation to edit the topology used in constructing the starting LMO’s.
An example of a nonstandard topology can be seen in PDB ID: 1PY4, a protein of 388 residues. 1PY4 has a reported clashscore of 104 and consists of four chains, A, B, C, and D. Chain D is unusual in that the carboxylate group of Glu16 is in close proximity to the four backbone atoms of Ser20, and Cγ on Glu16 is close enough to Cγ on Lys19, 1.73 Å, to give rise to the incorrect assumption that these atoms are covalently bonded together. The orientation of these residues is shown in Fig. 7. This disorder was indicated by the reported fractional occupation of the atoms in these residues of 0.01.
Five topologic bonding errors were generated by MOPAC and are shown in green. By selectively deleting these connections, the correct topology would be revealed and the system could then be correctly hydrogenated. The same set of deletions that was used in hydrogenation would also be used to correct the topology in preparing the LMO’s for MOZYME. When the geometry was optimized using PM6ORG the calculated clashscore dropped from 104.50 to 1.58. As would be expected from a system with such a high clashscore, the RMSD, 1.564 Å, was unusually large.
Geometry optimization
The next step would be to optimize the geometry of the system and compare the results with the experimental geometry. Known faults in the calculated geometry, such as systematic backbone motion, can safely be ignored; instead, attention should be focused on the local environment of sites of interest. In general, the geometries of reaction and binding sites are quite strongly conserved[35], so any significant distortion from the experimental geometry would be a cause for concern. Whenever that occurs, further simulations should be postponed until the cause of the distortion is found, and, if necessary, a correction made. Two of the most common causes are incorrect ionization of residues and missing water molecules. Identifying incorrectly ionized sites is straightforward – the geometric change caused by the incorrect presence or absence of a charged site is normally obvious. Deciding whether a water molecule is missing is more problematic. In one case[2], the experimental structure contained two copies, “A” and “B,” of the protein being studied. Copy “A” was selected as the starting system. When the optimized geometry was examined, an unusually large distortion was found in the binding site. Examination of the “B” structure showed that there was a water molecule at that site that was missing in the “A” structure. When the missing water molecule was added to the “A” structure, the distortion vanished.
In side-chains of the residues glutamine and asparagine, where incorrect assignment of oxygen and nitrogen can occur, and histidine, where flipping is possible, errors in conformer orientation can usually be detected by the large distortions that appear in the optimized conformer, or by obviously incorrect or missing hydrogen bonds. More exotic errors, such as a Mg2+ ion being mistaken for a water molecule, can also be identified by the relatively large distortions that occur in water clusters.