QAEmap: A Novel Local Quality Assessment Method for Protein Crystal Structures Using Machine Learning

doi:10.21203/rs.3.rs-687363/v1

Download PDF

Article

QAEmap: A Novel Local Quality Assessment Method for Protein Crystal Structures Using Machine Learning

https://doi.org/10.21203/rs.3.rs-687363/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Low-resolution electron density maps can pose a major obstacle in the determination and use of protein structures. Herein, we describe a novel method, quality assessment based on an electron density map (QAEmap), that evaluates local protein structures determined by X-ray crystallography and corrects structural errors using low-resolution maps. QAEmap uses a three-dimensional deep convolutional neural network with electron density maps and their corresponding coordinates as input and predicts the correlation between the local structure and the putative high-resolution experimental electron density map. This estimates how well the structure fits the high-resolution map. Further, we propose that this method may be applied to evaluate ligand binding, which can be difficult to determine at low resolution.

Artificial Intelligence and Machine Learning

Computational Biology

Bioinformatics

electron density map

QAEmap

protein structures

Protein structures play an important role in understanding biology. For example, in drug discovery, it is used in structure-based drug design where the binding between a protein and a candidate drug compound is analyzed in detail to improve the compound and deliver a more effective drug [1].

Many protein structures have been determined by X-ray crystallography or cryogenic electron microscopy (cryo-EM). An electron density map for X-ray crystallography or a Coulomb potential map for Cryo-EM is calculated from experimental data, and a protein structure is constructed by placing the atoms of the protein according to the map. In X-ray crystallography, after an initial structure is built, the electron density map is calculated using the structure factors and the protein coordinates. During refinement, the electron density map is updated every time the structure is corrected, and the point at which the two agree best determines the final structure and electron density map. In cryo-EM, the Coulomb potential map is calculated from experimental data but is not updated during refinement; similar to X-ray crystallography, the structure that fits the map best determines the final coordinates.

The quality of the electron density and Coulomb potential maps is directly affected by the quality of experimental data; that is, higher-resolution data result in a clearer map, while a lower resolution delivers an obscure map (Fig. 1a). In contrast to high-resolution maps, where atomic coordinates can be determined easily, low-resolution maps depend on the experimenter's judgment even when semi-automated assistive tools and prior information are available. However, this may lead to overinterpretation.

In addition, the appearance of maps within a protein molecule can vary. While the secondary structures and the rigid part of the inner molecule provide clear electron density maps, the surrounding or loop regions are often obscured; also, the side chains are often more obscured than the main chain. This is because of the thermal vibration of atoms. To construct a structure from maps that are partially obscured, atoms will either be placed or not based on the quality of the electron density map. However, in all these cases, the experimenter needs to make subjective decisions.

Many attempts have been made to exclude the subjectivity of an experimenter. Refinement software contain a geometry library of amino acids and peptide bonds and perform refinement using the restraints defined in the library [2–4]. In particular, the low-resolution hydrogen bonds in the main chain of a secondary structure can be restrained using additional restraint conditions from a homologous, high-resolution structure in the geometry library [5–7]. However, this method cannot be applied to all cases, because a homologous high-resolution structure is not always available or suitable.

To perform structural checks, methods such as the real-space correlation coefficient (RSCC) for crystal structures [8] and MolProbity [9] are available to experimenters. Both methods evaluate the structure on a residue basis. RSCC is a local measure of how well the calculated electron density of an amino acid matches the observed electron density. The RSCC values are large when the electron density map is clear and small for indistinct regions; this correlates well with the temperature factor (that is, B-factor in crystallography), which is an index of the thermal vibration of atoms. After refinement, the atomic position correlates to the electron density, and RSCC would be at its maximum. Thus, RSCC is a measure of structural integrity, rather than a tool for the identification of structural errors, and is not used for the correction of crystal structures.

MolProbity, on the other hand, is a software that comprehensively evaluates the geometry of the main and side chains of amino acids and that of their surroundings. The geometry of the main chain is theoretically determined by the Ramachandran plot. The side chain is evaluated based on whether it is statistically possible. The relative position to the adjacent amino acids is also evaluated to ensure it is not unnatural based on the atomic radius.

However, these evaluation methods do not improve the resolution of the electron density map; therefore, if the above-mentioned indicator is judged to be acceptable, no further correction is possible, and the problem of low resolution is not solved. In some extreme low-resolution cases, it may be impossible to determine whether the electron density is attributable to the binding of a compound or noise [10]. For this reason, it is believed that a density map with a resolution of 2.0–2.5 Å or higher is necessary for drug discovery or simulation studies [11]. Unfortunately, these criteria are not always met, and only 40% of the registered entities in the Protein Data Bank (PDB) have a resolution higher than 2 Å (Fig. 1b) [12]. The accuracy of research based on protein structures would be much improved if the coordinates of the structure could be determined accurately regardless of whether high- or low-resolution data are used.

While the use of cryo-EM structures has been increasingly reported in recent years, its resolution is generally lower than that of crystal structures, limiting its application in drug discovery [13]. Recently, several studies have applied machine learning to the evaluation of protein structures [14]. Examples of the application of a three-dimensional deep convolutional neural network (3D-CNN) to protein structures include the proposal of methods to distinguish the secondary structure in a Coulomb potential map from cryo-EM or to evaluate protein models [15–17]. This shows that 3D-CNN can be applied to protein structures and maps.

In this study, we investigate a novel method, quality assessment based on an electron density map (QAEmap), that uses machine learning to solve low-resolution problems in structure determination by X-ray crystallography. We decided to use 3D-CNN and train it by using electron density maps and protein structures as input. In our method, the correct structure of a protein is defined as a high-resolution electron density map. We created a new evaluation index called the box correlation coefficient (bCC), which is the correlation between the coordinate structure to be evaluated and the electron density map of the correct structure. QAEmap can predict the bCC with input data of the coordinates and the electron density map used for determining the structure, even when no high-resolution structure is available. By modifying the coordinate structure to maximize bCC, the coordinates can attain a high resolution regardless of the resolution of the experimental electron density map. This method is also applicable to compound bindings that are unclear. In addition, we compared bCC with RSCC, which is the existing residue-based evaluation method that uses the electron density and coordinates as input.

Definition and Meaning of Objective Variable and Box Correlation Coefficient (bCC)

In this study, we defined bCC as a score used to evaluate local protein structure and used it as an objective variable in machine learning. (Fig. 2a)

Since we defined the high-resolution structure as the correct one, it was necessary to quantify the agreement between the structure to be evaluated and its high-resolution counterpart. Specifically, the electron density map of the correct structure — rather than its coordinates — was used for this quantification. This is because the coordinates of a high-resolution structure are only a model used to describe the electron density map; therefore, the electron density map was considered more appropriate for this purpose.

We selected structures determined at a resolution higher than 1.5 Å as the correct structures. The corresponding correct electron density map had a 2mFo-DFc electron density, which is commonly used in crystallography and free from the bias of the structural coordinates used in electron density calculation [18]. Hereinafter, this electron density is referred to as ρ_{correct, obs}.

The coordinates to be evaluated were converted to electron density using the atom scattering factors in X-ray crystallography to compare with ρ_{correct, obs}. The electron density of an atom is distributed depending on the distance from the atom center and a B-factor, according to a Gaussian function [19].

As we wanted to evaluate the coordinates without considering B-factors, we set the B-factor to be isotropic and fixed at 2.0, which is a sufficiently small value. The electron density of an atom is given by [see Methods]

where r is the distance from the center of the atom, a_i and b_i are the atomic scattering factors [20], and B_iso is the isotropic temperature (B-factor) fixed at 2.0.

The electron density ρ_{atom, calc} (r) was calculated for all atoms in the structure to produce the ρ_{model, calc} electron density map. Using both the ρ_{correct, obs}. and ρ_{model, calc} maps, a cube centered on the center of gravity of the amino acids of interest was extracted. The size and grid of the cubes were arbitrarily determined and were the same for all the residues.

The objective variable is defined as follows with the cubical boxes:

where var is the sample variance and cov is the sample covariance.

Because bCC is a correlation coefficient, it ranges from 0 to 1, and values closer to 1 indicate that the electron density and coordinates are consistent. The bCC value can be described as follows.

First, bCC would have the maximum value at the location where the coordinates of the correct structure best match the electron density ρ_correct, _obs. However, the location where the maximum bCC is obtained may vary because the thermal vibrations of the protein could affect its location and electron density noise may also be present (Fig. 2b 1-1, 2-1). Different states of electron density have different possible maxima; therefore, the bCC values have relative implications. The bCC decreases when the evaluated structure deviates further from the correct structure (Fig. 2b 1-2,3,4 and 2-2,3).

By using bCC as an objective variable, we can predict the degree of agreement between the putative high-resolution electron density and the coordinate structure in a resolution-independent manner. Depending on the size of the box, bCC also includes the electron density derived from neighboring amino acids, water, compounds, and noise in the box. The box size used in this study is 12 Å × 12 Å × 12 Å.

Hereafter, the value of bCC calculated from the actual electron density of the correct structure in the training data is referred to as bCC_act. and the value of the predicted bCC is referred to as bCC_pred.

Data Preparation and Overview of QAEmap

We obtained 22 protein structure data (coordinate and structure factor files) with a resolution higher than 1.5 Å from PDB and set them as the correct structures (Fig. 3a and Supplementary Table 1). Using the coordinates and the structure factor file as a starting point, we created structures containing various errors at various resolutions and the corresponding electron density maps using the crystallographic refinement and homology modeling techniques (see Methods). Here, water and other compounds were removed from the initial coordinate files, and only the atoms belonging to the proteins were retained for simplicity. Next, both the electron density map and coordinates were cut into 12 Å cubes for all the amino acids in the structure (Fig. 3b). The coordinates were further divided by atom species and converted into electron density using the objective variable calculation. These were used as the input data for training as three-dimensional descriptors. As the created structures corresponded to the correct structures in the initial state, bCC_act., defined earlier, could be calculated.

QAEmap was trained separately for each of the amino acid species. The amount of input data for each species differed depending on the amino acid, but no adjustments were made, and the data for all 263,689 amino acid residues were used (Supplementary Fig. 1).

Our QAEmap 3D-CNN architecture was based on SqueezeNet [21] and implemented using TensorFlow (Release ver. 1.6.0) (Fig. 3c). All training procedures and parameters were the same for all amino acid types, the initial hyperparameter values were used, and the learning rate was set to 1e-05. The three-dimensional input data were rotated to all possible orientations at intervals of 90°, i.e., the model was trained with the objective variable for all 24 rotations. The QAEmap model was trained until convergence, which was reached after 40 epochs.

Evaluation of QAEmap on Test Data

The trained QAEmap was evaluated on test datasets prepared from Biliverdin Reductase (PDB1LC0) and SET Domain Methyltransferase (PDB 3F9X, chain D) (Supplementary Fig. 1). QAEmap predicted bCC_pred., and the correlations between bCC_pred. and bCC_act. were observed per amino acid for three different resolutions (Fig. 4a, Supplementary Figs. 2 and 3). The correlation coefficients varied among the amino acids, and the maximum and minimum observed correlation coefficients were 0.865 for Proline and 0.691 for Phenylalanine, respectively. The correlation decreased for amino acids with lower resolutions; for those acids with resolutions under 4.5 Å — such as Glutamine and Tyrosine — it decreased to 0.5 or even lower values.

Subsequently, we compared the correct structure of 3F9X and its simulated low-resolution structure, which was refined with structure factor data truncated at 3.0 Å resolution (Fig. 4b and Supplementary Fig. 4). The two structures were almost identical with a root-mean-square difference (RMSD) of 0.16 Å for all atoms, and the bCC_act. values exceeded 0.6 for almost all amino acid residues. When bCCs were predicted using the electron density ρ_{model, obs} of each resolution as input data, bCC_pred.correlated well with bCC_act.(for the correct structure, the average difference between bCC_pred.and bCC_act.was -0.016 and standard deviation was 0.020, and for the simulated low-resolution structure, the average difference and standard deviation were 0.041 and 0.029, respectively). It was shown that the bCC of the correct coordinate structures could be estimated independent of the resolution.

As examples of incorrect structures, we examined two model structures, which were refined 5V2N-templated homology models, against the structure factors for 1.25 Å and 3.0 Å resolution. No residue exceeded the correct structure’s bCC_pred., except for the two terminal residues, and the structure with the maximum bCC_pred_. was the correct structure, as expected.

The model structures had conformational errors at a1-3, and the bCC_act. values were as low as 0.2–0.4 (Fig. 4b and Supplementary Fig. 4). Most of bCC_pred. values were 0.4 or less, implying that they were precisely predicted and not correlated with the correct electron density. However, the bCC_pred. values of some residues were as high as 0.6, which is approximately 0.2 higher than bCC_act., and these structures were predicted to be well correlated with the correct structure, despite the main chain being incorrect. (For example, for the 1.25 Å resolution structure of Ile154, bCC_act.= 0.395 and bCC_pred.= 0.668, and for the 3.0 Å resolution structure of Lys150, bCC_act.= 0.315 and bCC_pred.= 0.583 (Supplementary Fig. 4).) This problem should be considered and solved as follows.

When predicting bCC for the correct structure in cases where the main chain is incorrect, it is sufficient to indicate that no correlation exists. For residues where the main chain is almost correct or well-correlated to the correct structure, it is necessary to predict the relative bCCs precisely, that is, to distinguish which state is better correlated to the correct structure. In future work, we will proceed to optimize the training data and training of QAEmap further for identifying residues that can be corrected with bCC_pred. and improving the prediction accuracy of bCC within the structural correction range.

We also compared bCC with RSCC, which is the correlation between the electron density and its coordinates, over the grids including the residue atoms (Supplementary Fig. 4). RSCC tended to overestimate residues with errors because high B-factors were estimated during refinement, and the electron density tended to appear at places where the atoms were misplaced, particularly at low resolution (Fig. 4c). In principle, this was not the case for bCC, because bCC referred to the electron density of the correct structure. Therefore, the prediction of bCC is expected to solve the model bias problem of X-ray crystallography, and further research is required in this direction.

Evaluation of QAEmap with Actual Experimental Data

An actual low-resolution structure was evaluated using QAEmap. Since test data were obtained by truncating high-resolution data, the signal-to-noise ratio of actual structure factor data would be worse than that of the test data at the same resolution.

The CDK2: Spy1 complex (3.2 Å resolution, 288 residues; [22]), registered in PDB as 5UQ1, was evaluated, and 2R3F (1.5 Å resolution; [23]) was used for comparison as a high-resolution structure of CDK2. They were determined using the molecular replacement method with a homologous CDK2 structure as a template and differ in crystal form and conformation; the RMSDs between all corresponding atoms were 3.01 Å.

The bCC_pred. of 5UQ1 was predicted by QAEmap, and the bCC_act. of 2R3F was calculated; the mean bCC values were 0.592 and 0.593, respectively (Fig. 5a, Supplementary Fig. 5). The individual bCC_pred. values were in good agreement with the bCC_act. of 2R3F (Fig. 5b). Specifically, the secondary structures of the C-terminal domain and QAEmap predicted that the structure of 5UQ1 is as accurate as that of 2R3F.

When the bCC values were compared locally, the region of amino acid residues 177–179 had bCC_act. values of 0.52–0.57 for 2R3F and 0.35–0.39 for 5UQ1. Structural modification mimicking 2R3F and using bCC_pred. values as an index improved the bCC_pred. of Lys178 and resulted in a more accurate prediction (Fig 5c).

In addition, the bCC values were higher when packed with neighboring molecules than when exposed to solvents (Supplementary Fig. 6), indicating that local differences in structures at different resolutions can be described using bCC.

Application of QAEmap to Compound Bound Structures

Another challenge at low resolution is determining ligand binding and the binding mode. QAEmap can be applied in these instances to evaluate compound binding, as an input box contains the environment around the amino acid of interest. We attempted to assess the binding of compounds using our trained QAEmap. Although atoms belonging to compounds were removed from the current training data, ligand compounds that consisted of carbon, oxygen, nitrogen, and sulfur could be treated as part of the surrounding environment because the channels of QAEmap were designed for these four atom types.

Fig. 6a shows an example of a SET domain protein methyltransferase (PDB: 3F9X) bound with S-adenosylhomocysteine (SAH). For amino acids adjacent to SAH, the bCC_act. values were approximately 3%–8% higher in the model with SAH than in the model without SAH (Fig. 6b). If these differences can be predicted by QAEmap, then the binding and docking pose can also be predicted.

To test this assumption, we prepared bound/unbound structures, refined them to make simulated low-resolution structures at 2.0, 3.0, and 4.0 Å resolutions, and predicted bCC_pred. using QAEmap. As the electron density of a compound depends on the existence of a compound in the structure, it is arbitrary to determine the presence or absence of the compound from the electron density, especially at resolutions of 3.0 Å and 4.0 Å (Fig. 6c and Supplementary Fig. 7).

On the other hand, when the difference between bound and unbound bCC_pred. was calculated, the bound structure was predicted more accurately at all resolutions. This suggests that QAEmap could be used to determine the binding of compounds.

As bCC_pred. reflects the docking pose of the compound, an accurate docking pose is required for the determination. In addition, as some atom types and interactions between a compound and a protein are specific, it is necessary to prepare training data that can be used to train QAEmap on the compound binding states for QAEmap to be applied to compound binding.

We developed a method to evaluate the local structure of protein crystals independent of the resolution of experimental data. Further development is ongoing to expand the resolution of prediction, assign water and other compounds, and improve prediction accuracy.

We are also considering the application of QAEmap to cryo-EM. Cryo-EM has fewer entries in PDB than X-ray crystallography, and few have a resolution above 1.5 Å [12]. As the data and their tendency for cryo-EM are different from those for X-ray crystallography, it is necessary to adjust QAEmap for cryo-EM data. However, the basic approach of using maps and coordinates as inputs and bCC as an objective variable is still applicable.

QAEmap is an integrated local structure assessment tool that can be used by structural protein experimenters to confirm structural determination. Further, it can be used by structure users to guide the viewing of the structure and can become a useful tool for the expanding structural biology community.

Preparation of Training Data

Approximately 9,500 entities with resolutions higher than 1.5 Å and containing more than 30 amino acids were extracted from the PDB. They were classified using CATH [24], and 22 entities were randomly selected from each category as correct structures and downloaded along with their structure factor file (Supplementary Table 1). Their homologous proteins were found using Blast search [25] based on the amino acid sequences. After sequence alignment, homology models were built using MODELLER [26] with the homologous proteins as templates. The models superimposed on the original structure in the crystal coordinate system were the initial model structures. They were refined against the structure factors for resolutions ranging from the highest to 5.0 Å in increments of 0.5 Å by using the DIMPLE in the CCP4 package [27] over 100 cycles with all the default restraint conditions. Given that 8–9 model structures were prepared for each initial model, a total of 1,366 structures were obtained. Their corresponding electron density (ρ_{model, obs}) maps were also calculated from the refined coordinates and the structure file. Test data were prepared with the same method from 1LC0 and 3F9X (Supplementary Table 1).

Preparation of Three-dimensional Descriptors

The B-factors in the coordinate files were set to 2.0 using PDBSET (CCP4), and the coordinates were divided based on atom types. The coordinate files of each atom type were then converted into electron density maps using the ATOMMAP mode of SFALL (CCP4) and extended to the unit cell using MAPMASK (CCP4). All the electron density maps, ρ_{model, obs} and ρ_{model, calc}, of each atom type were cut into cubic boxes with sides of 12 Å and a grid size of 0.5 Å by MAPROT (CCP4); they were centered on the center of gravity of an amino acid. There were 315,187 amino acids in total for the training data. Thus, five descriptors for each amino acid were calculated and assigned to the different channels in the QAEmap model.

Calculation of Objective Variables

The coordinates of all the atom types in the model structures were converted into electron density in the same manner as mentioned above. The electron density (ρ_{model, calc}) and electron density of the correct structure (ρ_{correct, obs}) were cut into boxes, as with the three-dimensional descriptors. The correlation coefficient between the boxes over all grids was calculated as bCC_act.

Data Preparation for CDK2

5UQ1 (3.2 Å resolution) and 2R3F (1.5 Å resolution) were downloaded from the PDB along with their structure factor files. All water and ligand molecules were removed from the coordinate files. Ten cycles of refinement were performed by REFMAC5 (CCP4) [28]. The R-factor and free R-factor were 0.192 and 0.277 for 5UQ1, and 0.262 and 0.292 for 2R3F, respectively. The values of bCC_act. were calculated with the data of 2R3F as the correct structure. After the structural modification of 5UQ1, bCC was refined and predicted in the same manner. The modified structure’s R-factor was 0.190, and its free R-factor was 0.276.

Calculation of RSCC

The RSCC of the test data of 3F9X and 5UQ1 was calculated using EDSTATS (CCP4), the ρ_{model, obs} electron density map, and the coordinates after refinement.

Data preparation for 3F9X’s SAH bound and unbound structures

3F9X (1.25 Å resolution) was downloaded from the PDB, along with its structure factor file. All water molecules were removed from the coordinate files, following which two coordinate files were prepared; one was with an SAH molecule, and the other was without. They were refined against 1.25 Å, 2.0 Å, 3.0 Å, and 4.0 Å-resolution structure factors by REFMAC5 over 10 cycles. The R-factors and free R-factors of the SAH bound structures were 0.276 and 0.293 for 1.25 Å, 0.271 and 0.288 for 2.0 Å, 0.260 and 0.269 for 3.0 Å, and 0.252 and 0.261 for 4.0 Å resolution, respectively. For the SAH unbound structures, the R-factors and free R-factors were 0.278 and 0.300 for 1.25 Å, 0.273 and 0.291 for 2.0 Å, 0.262 and 0.269 for 3.0 Å, 0.256 and 0.259 for 4.0 Å resolution, respectively. The values of bCC_act. were calculated with the data of the 1.25 Å-resolution structures.

Software. The software package PyMOL (The PyMOL Molecular Graphics System, Version 2.3, Schrödinger, LLC, https://www.pymol.org/2/) was used for the visualization of protein structures and maps; MODELLER (v.10.1, University of California San Francisco, https://salilab.org/modeller/) and CCP4 (v.7.0, Collaborative Computational Project No. 4, https://www.ccp4.ac.uk/) were used for creating protein structures and map files. Tensorflow (v.1.x, Google Brain, https://www.tensorflow.org/install/pip) and Python (v.3.6, Python Software Foundation, https://www.python.org/downloads/) were used for the development of QAEmap.

Data availability

The coordinate and structure factor files can be downloaded from the Protein Data Bank (Supplementary Table 1).

Output files from QAEmap for the simulated and experimental data that support the findings of this study are available from the corresponding author upon request.

Code availability

The QAEmap program is freely available for academic use through github (https://gitlab.com/qaemap_products/qaemap).

Van Montfort, R. L. M. & Workman, P. Structure-based drug design: aiming for a perfect fit. Essays Biochem. <background-color:#FFCC66;bvertical-align:super;>1</background-color:#FFCC66;bvertical-align:super;>, 431–437 (2017).
Vagin, A. A. et al. REFMAC5 dictionary: organization of prior chemical knowledge and guidelines for its use. Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>D60</background-color:#FFCC66;bvertical-align:super;>, 2184–2195 (2004).
Headd, J. J. et al. Use of knowledge-based restraints in phenix.refine to improve macromolecular refinement at low resolution. Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>D68</background-color:#FFCC66;bvertical-align:super;>, 381–390 (2012).
Engh, R. A. & Huber, R. Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica <background-color:#FFCC66;bvertical-align:super;>A47</background-color:#FFCC66;bvertical-align:super;>, 392–400 (1991).
Afonine, P. V. et al. Towards automated crystallographic structure refinement with phenix.refine. Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>D68</background-color:#FFCC66;bvertical-align:super;>, 352–367 (2012).
Nicholls, R. A., Fishcher, M., McNicholas, S. & Murshudov, G. N. Conformation-independent structural comparison of macromolecules with ProSMART. Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>D70</background-color:#FFCC66;bvertical-align:super;>, 2487–2499 (2014).
Kovalevski, O., Nicholls, R. A. & Murshudov, G. N. Automated refinement of macromolecular structures at low resolution using prior information. Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>D72</background-color:#FFCC66;bvertical-align:super;>, 1149–1161 (2016).
Tickle, I. J. Statistical quality indicators for electron-density maps. Acta Cryst. D68, 454–467 (2012).
Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Cryst. D66, 12–21 (2010).
Rupp, B., Wlodawer, A. l., Minor, W., Helliwell, J. R. & Jaskolski, M. Correcting the record of structural publications requires joint effort of the community and journal editors. FEBS J. <background-color:#FFCC66;bvertical-align:super;>283</background-color:#FFCC66;bvertical-align:super;>, 4452–4457 (2016).
Ilatovskiy, A. V. & Abagyan, R. Computational Structural Biology for Drug Discovery. Power and Limitations, Structural Biology in Drug Discovery: Methods, Techniques, and Practices, Chap. 15, 347–361 (John Wiley & Sons, Inc., New Jersey, 2020).
wwPDB consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data, Nucleic Acids Research 47, D520-D528 (2019).
Wlodawer, A., Li, M. & Dauter, Z. High-resolution cryo-EM maps and models – a crystallizers’ perspective, Structure 25, 1589–1597 (2017).
Uziela, K., Hurtado, D. M., Shu, N., Wallner, B. & Elofsson, A., ProQ3D improved model quality assessments using deep learning. Bioinformatics <background-color:#FFCC66;bvertical-align:super;>33</background-color:#FFCC66;bvertical-align:super;>, 1578–1580 (2017).
Subramaniya, S. R. M. V., Terashi, G. & Kihara, D. Protein secondary structure detection in intermediate-resolution cryo-EM maps using deep learning. Nature Methods <background-color:#FFCC66;bvertical-align:super;>16</background-color:#FFCC66;bvertical-align:super;>, 911–917 (2019).
Sato, R. & Ishida, T. Protein model accuracy estimation based on local structure quality assessment using 3D convolutional neural network. PLOS ONE <background-color:#FFCC66;bvertical-align:super;>14</background-color:#FFCC66;bvertical-align:super;>(9), e0221347 (2019). https://doi.org/10.1371/journal.pone.0221347
Pages, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics <background-color:#FFCC66;bvertical-align:super;>35</background-color:#FFCC66;bvertical-align:super;>, 3313–3319 (2019).
Read, R. J. Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149 (1986).
Agarwal, A. A new least square refinement technique based on the fast Fourier transform algorithm, Acta Cryst. <background-color:#FFCC66;bvertical-align:super;>A34</background-color:#FFCC66;bvertical-align:super;>, 791–809 (1978).
International tables for crystallography volume C: mathematical, physical and chemical tables. First online edition (Wiley, 2006) ISBN: 978-1-4020-1900-5.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J. & Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5MB model size. https://arxiv.org/abs/1602.07360 (2016)<uvertical-align:super;>.</uvertical-align:super;>
McGrath, D. A. et al. Structural basis of divergent cyclin-dependent kinase activation by Spy1/RINGO proteins. EMBO J. 36, 2251–2262 (2017)
Fishmann, T., O. et al. Structure-guided discovery of cyclin-dependent kinase inhibitors, Biopolymer 89, 372–379 (2008).
Sillitoe I. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289-D295 (2016).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Fiser, A. & Sali, A. Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 374, 461–491 (2003).
Winn, M. D. et al. Overview of the CCP4 suite and current developments. Acta. Cryst. D67, 235–242 (2011).
Murshudov, G., Vagin, A. & Dodson, E. Refinement of macromolecular structures by the maximum-likelihood method. Acta Cryst. D53, 240–255 (1997).

There is NO Competing Interest.

Supplementarymaterials.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

QAEmap: A Novel Local Quality Assessment Method for Protein Crystal Structures Using Machine Learning

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Preparation of Training Data

Preparation of Three-dimensional Descriptors

Calculation of Objective Variables

Data Preparation for CDK2

Calculation of RSCC

Data preparation for 3F9X’s SAH bound and unbound structures

declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1