Definition and Meaning of Objective Variable and Box Correlation Coefficient (bCC)
In this study, we defined bCC as a score used to evaluate local protein structure and used it as an objective variable in machine learning. (Fig. 2a)
Since we defined the high-resolution structure as the correct one, it was necessary to quantify the agreement between the structure to be evaluated and its high-resolution counterpart. Specifically, the electron density map of the correct structure — rather than its coordinates — was used for this quantification. This is because the coordinates of a high-resolution structure are only a model used to describe the electron density map; therefore, the electron density map was considered more appropriate for this purpose.
We selected structures determined at a resolution higher than 1.5 Å as the correct structures. The corresponding correct electron density map had a 2mFo-DFc electron density, which is commonly used in crystallography and free from the bias of the structural coordinates used in electron density calculation . Hereinafter, this electron density is referred to as ρcorrect, obs.
The coordinates to be evaluated were converted to electron density using the atom scattering factors in X-ray crystallography to compare with ρcorrect, obs. The electron density of an atom is distributed depending on the distance from the atom center and a B-factor, according to a Gaussian function .
As we wanted to evaluate the coordinates without considering B-factors, we set the B-factor to be isotropic and fixed at 2.0, which is a sufficiently small value. The electron density of an atom is given by [see Methods]
where r is the distance from the center of the atom, ai and bi are the atomic scattering factors , and Biso is the isotropic temperature (B-factor) fixed at 2.0.
The electron density ρatom, calc (r) was calculated for all atoms in the structure to produce the ρmodel, calc electron density map. Using both the ρcorrect, obs. and ρmodel, calc maps, a cube centered on the center of gravity of the amino acids of interest was extracted. The size and grid of the cubes were arbitrarily determined and were the same for all the residues.
The objective variable is defined as follows with the cubical boxes:
where var is the sample variance and cov is the sample covariance.
Because bCC is a correlation coefficient, it ranges from 0 to 1, and values closer to 1 indicate that the electron density and coordinates are consistent. The bCC value can be described as follows.
First, bCC would have the maximum value at the location where the coordinates of the correct structure best match the electron density ρcorrect, obs. However, the location where the maximum bCC is obtained may vary because the thermal vibrations of the protein could affect its location and electron density noise may also be present (Fig. 2b 1-1, 2-1). Different states of electron density have different possible maxima; therefore, the bCC values have relative implications. The bCC decreases when the evaluated structure deviates further from the correct structure (Fig. 2b 1-2,3,4 and 2-2,3).
By using bCC as an objective variable, we can predict the degree of agreement between the putative high-resolution electron density and the coordinate structure in a resolution-independent manner. Depending on the size of the box, bCC also includes the electron density derived from neighboring amino acids, water, compounds, and noise in the box. The box size used in this study is 12 Å × 12 Å × 12 Å.
Hereafter, the value of bCC calculated from the actual electron density of the correct structure in the training data is referred to as bCCact. and the value of the predicted bCC is referred to as bCCpred.
Data Preparation and Overview of QAEmap
We obtained 22 protein structure data (coordinate and structure factor files) with a resolution higher than 1.5 Å from PDB and set them as the correct structures (Fig. 3a and Supplementary Table 1). Using the coordinates and the structure factor file as a starting point, we created structures containing various errors at various resolutions and the corresponding electron density maps using the crystallographic refinement and homology modeling techniques (see Methods). Here, water and other compounds were removed from the initial coordinate files, and only the atoms belonging to the proteins were retained for simplicity. Next, both the electron density map and coordinates were cut into 12 Å cubes for all the amino acids in the structure (Fig. 3b). The coordinates were further divided by atom species and converted into electron density using the objective variable calculation. These were used as the input data for training as three-dimensional descriptors. As the created structures corresponded to the correct structures in the initial state, bCCact., defined earlier, could be calculated.
QAEmap was trained separately for each of the amino acid species. The amount of input data for each species differed depending on the amino acid, but no adjustments were made, and the data for all 263,689 amino acid residues were used (Supplementary Fig. 1).
Our QAEmap 3D-CNN architecture was based on SqueezeNet  and implemented using TensorFlow (Release ver. 1.6.0) (Fig. 3c). All training procedures and parameters were the same for all amino acid types, the initial hyperparameter values were used, and the learning rate was set to 1e-05. The three-dimensional input data were rotated to all possible orientations at intervals of 90°, i.e., the model was trained with the objective variable for all 24 rotations. The QAEmap model was trained until convergence, which was reached after 40 epochs.
Evaluation of QAEmap on Test Data
The trained QAEmap was evaluated on test datasets prepared from Biliverdin Reductase (PDB1LC0) and SET Domain Methyltransferase (PDB 3F9X, chain D) (Supplementary Fig. 1). QAEmap predicted bCCpred., and the correlations between bCCpred. and bCCact. were observed per amino acid for three different resolutions (Fig. 4a, Supplementary Figs. 2 and 3). The correlation coefficients varied among the amino acids, and the maximum and minimum observed correlation coefficients were 0.865 for Proline and 0.691 for Phenylalanine, respectively. The correlation decreased for amino acids with lower resolutions; for those acids with resolutions under 4.5 Å — such as Glutamine and Tyrosine — it decreased to 0.5 or even lower values.
Subsequently, we compared the correct structure of 3F9X and its simulated low-resolution structure, which was refined with structure factor data truncated at 3.0 Å resolution (Fig. 4b and Supplementary Fig. 4). The two structures were almost identical with a root-mean-square difference (RMSD) of 0.16 Å for all atoms, and the bCCact. values exceeded 0.6 for almost all amino acid residues. When bCCs were predicted using the electron density ρmodel, obs of each resolution as input data, bCCpred. correlated well with bCCact. (for the correct structure, the average difference between bCCpred. and bCCact. was -0.016 and standard deviation was 0.020, and for the simulated low-resolution structure, the average difference and standard deviation were 0.041 and 0.029, respectively). It was shown that the bCC of the correct coordinate structures could be estimated independent of the resolution.
As examples of incorrect structures, we examined two model structures, which were refined 5V2N-templated homology models, against the structure factors for 1.25 Å and 3.0 Å resolution. No residue exceeded the correct structure’s bCCpred., except for the two terminal residues, and the structure with the maximum bCCpred. was the correct structure, as expected.
The model structures had conformational errors at a1-3, and the bCCact. values were as low as 0.2–0.4 (Fig. 4b and Supplementary Fig. 4). Most of bCCpred. values were 0.4 or less, implying that they were precisely predicted and not correlated with the correct electron density. However, the bCCpred. values of some residues were as high as 0.6, which is approximately 0.2 higher than bCCact., and these structures were predicted to be well correlated with the correct structure, despite the main chain being incorrect. (For example, for the 1.25 Å resolution structure of Ile154, bCCact. = 0.395 and bCCpred. = 0.668, and for the 3.0 Å resolution structure of Lys150, bCCact. = 0.315 and bCCpred. = 0.583 (Supplementary Fig. 4).) This problem should be considered and solved as follows.
When predicting bCC for the correct structure in cases where the main chain is incorrect, it is sufficient to indicate that no correlation exists. For residues where the main chain is almost correct or well-correlated to the correct structure, it is necessary to predict the relative bCCs precisely, that is, to distinguish which state is better correlated to the correct structure. In future work, we will proceed to optimize the training data and training of QAEmap further for identifying residues that can be corrected with bCCpred. and improving the prediction accuracy of bCC within the structural correction range.
We also compared bCC with RSCC, which is the correlation between the electron density and its coordinates, over the grids including the residue atoms (Supplementary Fig. 4). RSCC tended to overestimate residues with errors because high B-factors were estimated during refinement, and the electron density tended to appear at places where the atoms were misplaced, particularly at low resolution (Fig. 4c). In principle, this was not the case for bCC, because bCC referred to the electron density of the correct structure. Therefore, the prediction of bCC is expected to solve the model bias problem of X-ray crystallography, and further research is required in this direction.
Evaluation of QAEmap with Actual Experimental Data
An actual low-resolution structure was evaluated using QAEmap. Since test data were obtained by truncating high-resolution data, the signal-to-noise ratio of actual structure factor data would be worse than that of the test data at the same resolution.
The CDK2: Spy1 complex (3.2 Å resolution, 288 residues; ), registered in PDB as 5UQ1, was evaluated, and 2R3F (1.5 Å resolution; ) was used for comparison as a high-resolution structure of CDK2. They were determined using the molecular replacement method with a homologous CDK2 structure as a template and differ in crystal form and conformation; the RMSDs between all corresponding atoms were 3.01 Å.
The bCCpred. of 5UQ1 was predicted by QAEmap, and the bCCact. of 2R3F was calculated; the mean bCC values were 0.592 and 0.593, respectively (Fig. 5a, Supplementary Fig. 5). The individual bCCpred. values were in good agreement with the bCCact. of 2R3F (Fig. 5b). Specifically, the secondary structures of the C-terminal domain and QAEmap predicted that the structure of 5UQ1 is as accurate as that of 2R3F.
When the bCC values were compared locally, the region of amino acid residues 177–179 had bCCact. values of 0.52–0.57 for 2R3F and 0.35–0.39 for 5UQ1. Structural modification mimicking 2R3F and using bCCpred. values as an index improved the bCCpred. of Lys178 and resulted in a more accurate prediction (Fig 5c).
In addition, the bCC values were higher when packed with neighboring molecules than when exposed to solvents (Supplementary Fig. 6), indicating that local differences in structures at different resolutions can be described using bCC.
Application of QAEmap to Compound Bound Structures
Another challenge at low resolution is determining ligand binding and the binding mode. QAEmap can be applied in these instances to evaluate compound binding, as an input box contains the environment around the amino acid of interest. We attempted to assess the binding of compounds using our trained QAEmap. Although atoms belonging to compounds were removed from the current training data, ligand compounds that consisted of carbon, oxygen, nitrogen, and sulfur could be treated as part of the surrounding environment because the channels of QAEmap were designed for these four atom types.
Fig. 6a shows an example of a SET domain protein methyltransferase (PDB: 3F9X) bound with S-adenosylhomocysteine (SAH). For amino acids adjacent to SAH, the bCCact. values were approximately 3%–8% higher in the model with SAH than in the model without SAH (Fig. 6b). If these differences can be predicted by QAEmap, then the binding and docking pose can also be predicted.
To test this assumption, we prepared bound/unbound structures, refined them to make simulated low-resolution structures at 2.0, 3.0, and 4.0 Å resolutions, and predicted bCCpred. using QAEmap. As the electron density of a compound depends on the existence of a compound in the structure, it is arbitrary to determine the presence or absence of the compound from the electron density, especially at resolutions of 3.0 Å and 4.0 Å (Fig. 6c and Supplementary Fig. 7).
On the other hand, when the difference between bound and unbound bCCpred. was calculated, the bound structure was predicted more accurately at all resolutions. This suggests that QAEmap could be used to determine the binding of compounds.
As bCCpred. reflects the docking pose of the compound, an accurate docking pose is required for the determination. In addition, as some atom types and interactions between a compound and a protein are specific, it is necessary to prepare training data that can be used to train QAEmap on the compound binding states for QAEmap to be applied to compound binding.