MDO: A Computational Protocol for Accurate Prediction of Full Flexibility Receptor-Ligand Binding Mode Structures

Predicting the binding structure of bio-complex is essential for understanding its properties, functions, and mechanisms, but is rather dicult due to the huge sampling space involved. A new computational protocol, MDO, for nding the ligand binding structure is proposed. MDO consists of global sampling via MD simulation and clustering of the receptor congurations, local sampling via molecular docking and clustering of the ligand conformations, and binding structure optimization by the ONIOM (QM/QM) method. MDO is tested on 15 protein-ligand complexes with known accurate structures. The success rate of MDO predictions, with RMSD < 2 Å, is found to be 67%, substantially higher than the 40% success rate of conventional methods. The MDO success rate can be increased to 83% if the ONIOM calculations are applied only for the starting poses with ligands inside the binding cavities. The MDO protocol is a promising tool for the structure based drug design.


Introduction
The principal aim of rational drug design and development is to detect a small molecule that binds robustly to a particular bio-macro-molecule. However, the identi cation of bio-complex binding structures is still a prime challenge for the structure-based drug design (SBDD). The di culty with the structural determinations is simply due to the complicated potential energy landscape of the bio-complex with an essentially in nite number of local minima [1][2][3][4] .
The molecular docking technique is a simple way to obtain the likely binding mode of the bio-complex system and estimate its binding a nity 2,3,5−7 . In this technique, a ligand is rst bound into a receptor in different binding mode conformations, then the conformations are ranked with a scoring function. During the docking process, seeking the ligand conformations for a xed receptor structure, i.e., rigid docking, is very common, while the exploration of receptor exibility or exible docking remains challenging and rare 8 . However, the importance of receptor exibility for SBDD by modifying its binding mode based on the ligand orientation is well documented 9 and the exible docking should be pursued if possible 10,11 .
A diversity of exible docking approaches have been suggested, including the ensemble docking with discrete receptor conformations 12,13 , the induced-t that allows the receptor conformation to change during docking 14 , and the modi cation of receptor-ligand interaction to simplify the potential energy landscape of the bio-complex 15,16 . While some exible dockings focus on the exible receptor side chain, others have limited backbone movements 14,17−19 . As an example of special interest, the recently released AutoDock 4.0 software allows the side chains of macromolecule to be exible. AutoDock 4.0 is also fast and capable of performing 40,000 rigid dockings in a day on one CPU 20 . The docking algorithm is capable of nding diverse binding modes, but fails to nd the correct binding mode through the scoring function. The success rate of docking ranges from 35-60% when judged by the Root Mean Square Deviation (RMSD) of less than 2 A 7,21 .
Generally, the success of molecular docking depends on the positive identi cation of binding modes and accurate estimation of their free energies. Unfortunately, most docking methods employ molecular mechanics (MM) force elds that are inadequate for the description of polarization, charge transfer, bond breaking/forming, etc. Consequently, recourse to use quantum mechanics (QM) based methods, especially the density functional theory (DFT), is desirable for the improvement of molecular docking 4,22,23 . However, the computational cost of QM is too high for the whole bio-complex system. Some compromise between the needs of numerical accuracy and computational e ciency is required, leading to the QM/MM technique. In the QM/MM approach, the ligand, and occasionally important atoms of the receptor, is treated by the QM method, while the rest of the receptor-ligand complex is treated by the MM method [24][25][26] . In particular, ONIOM [27][28][29] (our N-layered Integrated Molecular Orbital and Molecular Mechanics) is a highly successful QM/MM technique for large bio-complex systems. ONIOM uses different levels of theory for different layers of the bio-complex system and generates reliable system geometry and energy in a short time. ONIOM is used in computational protocols like DOX 26 and CSAMP 30 to nd the optimal binding modes of rigid docking.   In this work, each bio-complex system is subjected to MD simulation for 10 ns, using Gromacs 2018 package 31,32 , periodic boundary conditions (PBC), and CHARMM36 force eld 33 . Ligand topologies are generated using SwissParam server 34 that is compatible with the CHARMM force elds.
The bio-complex systems were solvated in a cubic box with a TIP3P 35 model of water molecules as a solvent, with 1 A between the bio-complex and the edge of the box. The solvated systems were naturalized by interchanging sodium or chloride ions with the solvent molecules. The systems were relaxed using the steepest descent method with maximum force convergence of 10 kJ mol − 1 nm − 1 . The LINCS algorithm was applied to constrain all-bonds 36 . The relaxed systems were heated gradually for 100 ps.
The MD runs use the leap-frog integration 37  The MD frames in an interval of 10 ps are saved for analysis. The geometry clustering analysis of the MD trajectories is carried out using the Gromos method 44 , after removing the PBC in the MD runs. The Gromos method uses the alpha carbon (C-α) RMSD to characterize the similarity of MD frames. The Gromos algorithm counts the number of neighbors of the frames using an RMSD threshold, discovers the frame with the largest number of neighbors as the frames centroid, removes the frames from the pool, and repeats the procedure for the remaining frames. While the top clusters represent the highest probabilities of the MD frames, only one frame for each cluster representing the cluster centroid is extracted as a PDB le for use in the molecular docking study. The RMSD threshold is varied between 0.1 nm and 0.15 nm to nd the optimal threshold 45 for each ligand-receptor bio-complex in this study. Clustering of ligand conformations obtained from the docking simulations is performed using the ensemble cluster tool in the UCSF-Chimera software 47 . This tool applies the approach of Kelley et al 48 .
The ensemble of conformations is grouped into clusters of similar conformations. One conformation from each cluster is selected to exemplify the conformations in that cluster. After clustering, the top 10 binding poses for each of the 10 most populated protein conformations, or a total of 100 binding poses, are left to represent the likely ligand-receptor binding modes.

Geometry optimization and ranking of ligand-protein binding poses
The top 100 most populated binding modes obtained above for each ligand-protein complex are subjected to geometry optimization using the ONIOM method. The binding mode structures are then ranked by the nal ONIOM energy. The top ranked binding structure is taken as the best ligand-receptor binding mode structure.
The ONIOM calculations are carried out at the level of wB97X-D 49 /6-31G*: PM6 50 . All amino acid (AA) residues within 6 A of ligand atoms are included in the ONIOM calculation. There are about 700 atoms in the AA residues of the ONIOM system in this study. The ONIOM system is divided into two layers, a highlevel part for the ligand, treated at the wB97X-D/6-31G* level, and a low-level part for the selected AA residues, treated at the PM6 level. The geometry optimizations of the large bio-systems are performed with the "loose 51 " convergence criterion. The ONIOM calculations are conducted with the Gaussian09 suite of software 52 .

Testing set of bio-complex systems
The proposed computational protocol MDO is tested on fteen protein-ligand complexes whose x-rays crystallographic structures are available in the protein data bank (PDB) 53 Table S1 of the Supporting Information. In addition to having accurate 3D crystallographic structures in PDB, the 15 bio-complex systems are bound by non-covalent interactions. When preparing the bio-complexes for testing the MDO protocol, water molecules and ions presented in the crystal structures are detached.
Missing atoms or residues that should exist in the crystallographic structure are added. Each bio-complex is assigned at the physiological conditions for titratable residues using preparation and 3D-protonation module of MOE.
Considering the exible nature of MDO and for more comparison, we also perform exible (induced t) docking with the DOCK software on the bio-complexes, using Charmm27 for the energy minimization.
Like the rigid docking, the bond rotation method is also used in the induced t for the generation of ligand conformations. The induced t uses the settings 'Triangle matcher/London dG' and 'Induced-Fit/force eld/GBVI-WAS dG' for the conformation placement and re nement, respectively. Figure 2 shows the variations of RSMDs with the MD simulations for the 15 bio-complex systems. The PDB crystallographic structures are used as the references for the RMSD evaluations. The RMSDs for a bio-complex re ect implicitly the movements of the backbone, side chains and ligand of the bio-complex. As seen in Fig. 3, all RMSDs level off after ~ 2 ns of MD runs. That is, after relaxation from their crystal xray structures, the 15 bio-complex systems are quite stable against the MD runs under the control conditions. The backbone and C-α RMSDs for most proteins are under 2 A, indicating high similarities between the MD frames and the original crystallographic structures.

MD simulation and clustering
The MD simulations yield a large number of protein con guration snapshots. The MD trajectories are subjected to clustering analysis with the C-α RMSD cut-off of 0.13, 0.11, 0.13, 0.12, 0.11, 0.13, 0.13, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, and 0.14 nm for 1ERB, 1FKI, 1HW9, 2IFB, 1FKG, 1HW8, 3GI5, 1HWI, 2I0A, 3MXD, 2QI3, 1HWJ, 1LF2, 1HWK, and 1LF3, respectively. The RMSD cut-offs are chosen to ensure the diversity of clusters, i.e., as many clusters as possible, while requiring the top 10 clusters to include ~ 90% or more of the MD trajectories 45 . With the above used C-α RMSD cut-offs, the top 10 clusters are highly representative as they account for over 90% of the total MD frames for each of the 15 bio-complexes, as shown in Fig. 3. Details about the dependences of the number of clusters and the ratio of members in the top ten clusters on the C-α RMSD cut-off for each of the 15 bio-complexes can be found in Table S2 of the Supporting Information. Table 1 compares the binding mode RMSDs of different prediction methods. The RMSD score is calculated between the crystal binding pose as the reference and the predicted top scored pose. As seen in Table 1, only 6 out of the 15 top scored binding poses of DOCK rigid docking have RMSDs below 2 Å. The success rate of DOCK rigid docking is quite low at 40%. The performance of the Induced-Fit method is similar, also with 6 of the 15 RMSDs below 2 Å. However, the average RMSD of the Induced-Fit, 2.46 Å, is larger than the rigid docking, 2.10 Å. The smallest and largest RMSDs of the rigid docking are 1.15 Å for 2IFB and 3.49 Å for 1LF3, respectively. In comparison, the smallest and largest RMSDs of the Induced-Fit are respectively 1.20 Å for 2IFB and 5.18 Å for 1LF2.

Binding mode prediction
The quality of the top binding poses is much improved with the MDO prediction. The success rate of MDO is 67%, with 10 of the 15 RMSDs below 2 Å. Comparison of the starting and resulting RMSDs of the MDO binding poses in Table 1 illustrates the power and the critical importance of the ONIOM calculations in determining the best binding poses. Here, the MDO starting pose refers to the starting structure from which the MDO predicted top pose was generated, i.e., the structure of the MDO pose before the ONIOM calculation. As shown in Table 1, only 6 of the 15 MDO starting RMSDs are less than 2 Å. The overall quality of the MDO starting poses, with an average RMSD of 2.54 Å, is the worst among the methods shown in Table 1, worse than that of the DOCK rigid docking and the Induced-Fit. However, the geometry optimization by ONIOM reduces the RMSD for 12 of the 15 MDO starting poses, leading to a rate of 80% for the improvement of the binding structures. The MDO predictions are overall far superior to their starting structures. They are also clearly better than the predictions of the rigid docking and Induced-Fit.   Tables S7, S8, S11 -S15. The need to substantially improve the DOCK scoring function is evident. While the MDO prediction does not necessarily pick the best pose, it is usually among the best ones. In particular, for the 10 successful predictions, namely with RMSD < 2 Å, 7 of the MDO picks are the very best, 2 of them are the second best, and 1 of them is the third best, see Tables S3 -S12. That is, most of the MDO predictions corresponds to the best choices, and all of them are among the top three structures. Moreover, the quality of MDO structure, when it is not the best pick, is not far from the best one. While the overall success of MDO is evident, the three cases of ILF2, 1HWK and 1LF3 shown in Table 1 are quite troubling. The MDO binding poses for the three bio-complexes are worse than their starting poses. To illustrate why the ONIOM calculations fail for the three cases but succeed in all the other cases, the starting MDO poses of 6 representative bio-systems are shown in Fig. 4. As seen in Fig. 4, the MDO starting poses for 1ERB, 2QI3 and 1HWJ are inside the binding cavities of the proteins. That is also the case for all the other studied bio-complexes, except for ILF2, 1HWK and 1LF3. The binding modes with the starting ligand poses inside the protein binding cavities are all improved by the ONIOM calculations.
In the cases of 1LF2, 1HWK and 1LF3, as shown in Fig. 4(d, e, f), the MDO starting poses selected by the ensemble clustering of rigid docking poses of DOCK are out of the protein binding cavities. That is, the ensemble cluster fails to select the right poses. Due to the lack of proper ligand-receptor interactions, the ONIOM geometry optimizations lead to higher RMSDs than the starting ones.
The observations derived from Fig. 4 can be used to improve the success rate of MDO and/or avoid unnecessary ONIOM calculations. The MDO starting poses should be inspected before the ONIOM calculations to see if the ligands are inside the binding cavities. If positive, the starting poses are subjected to geometry optimizations by ONIOM. For the bio-complexes examined in this study, the rate of structural improvement is then increased to 100%, and the rate of successful binding pose determinations is increased to 83%. If negative, the starting poses can be skipped to avoid the unmeaningful ONIOM calculations.
For a bio-complex with poor binding poses selected by the ensemble clustering, the worst case scenario is to say that the MDO methodology is inapplicable to the bio-complex. However, additional efforts may be made to enable the use of MDO for this bio-complex. For example, the step of rigid docking by DOCK, and even the step of MD simulations, may be repeated to yield the suitable MDO starting poses. Once such starting poses are found, the chance of successful identi cation of binding pose is quite high, as inferred from the results for the 12 bio-complexes.

Conclusions
A new computational protocol, MDO, for the prediction of ligand binding structure is proposed. MDO consists of MD simulations for sampling the receptor con gurations and clustering, rigid docking for sampling the ligand conformations and clustering, and ranking of the binding modes by the ONIOM method. MDO is tested on 15 protein-ligand complexes with known accurate structures. The MDO protocol yields RMSD < 2 Å for 10 of the 15 tested cases, or a success rate of 67%. In comparison, the success rates of both the rigid docking and induced-t with DOCK are only 40% for the same 15 bio-complexes. A key element for the success of MDO is the binding structure improvement by the ONIOM calculations. For the starting poses with the ligands inside the binding cavities, the MDO predictions are successful for 10 of the 12 bio-complexes, or a success rate of 83%, while the ONIOM structural improvement is observed for all the 12 bio-complexes. Therefore, the applicability of MDO can be predetermined by inspecting the binding poses obtained by rigid docking. The success rate of MDO predictions for applicable bio-complexes is over 80%, while the rate of structural improvement by the ONIOM calculations is 100%. MDO is a robust technique to identify the binding mode structures that are essential for the deep understanding of properties, functions, and mechanisms of bio-complex systems. MDO is a promising tool for the structure based drug design.
Declarations Figure 1 Schematic of the MDO Protocol.

Figure 2
RMSDs of the protein backbone, C-alpha, side chain, and the ligand during MD simulations. C-α RMSD cut-off of protein MD frames, total number of clusters, and percentage of the MD frames of top ten clusters.