Deep Learning Model of Dock by Dock Process Significantly Accelerate the Process of Docking-based Virtual Screening

Docking-based virtual screening (VS process) selects ligands with potential pharmacological activities from millions of molecules using computational docking methods, which greatly could reduce the number of compounds for experimental screening, shorten the research period and save the research cost. Howerver, a majority of compouds with low docking scores could waste most of the computational resources. Herein, we report a novel and practical docking-based machine learning method called MLDDM (Machince Learning Docking-by-Docking Models). It is composed of a regression model and a classification model that simulates a classical docking by docking protocol ususally applied in many virtual screening projects. MLDDM could quickly eliminate compounds with low docking scores and the retained compounds with potential high docking scores would be examined for further real docking program. We demonstrated that MLDDM has a good ability to identify active compounds in the case studies for 10 specific protein targets. Compared to pure docking by docking based VS protocol, the VS process with MLDDM can achieve an over 120 times speed increment on average and the consistency rate with corresponding docking by docking VS protocol is above 0.8. Therefore, it would be promising to be used for examing ultra-large compound libraries in the current big data era.


1.Introduction
Drug discovery is a costly, complex, and time-consuming process.According to relevant statistics, the cost for development of drugs approved from 2009 to 2018 raised from $300 million to $2.8 billion, depending on different diseases and methods. 1 Computer-aided drug design (CADD) can significantly accelerate the process of drug discovery.Virtual Screening (VS process) as a typical CADD technique has been widely applied in the early stage of drug discovery for accelerating lead discovery. 2Docking-based virtual screening aims to computationally place and evaluate small molecule compounds one by one at the binding site of a target protein with three dimensional structure.After continuously optimizing the poses of the compounds, molecular docking explicitly recommend possible potential compounds with high docking scores.
Since the forms of scoring functions and the sampling methods they used are quite divergent, for certain system different docking program would yield variant docking results.1][22][23][24][25][26][27][28][29][30][31] In such docking by docking protocol, the first docking process used in this workflow could be the one with very fast screening speed and the second could be more accurate by somehow slowly.Therefore, the balance between speed and quality of the final result are achieved.
Recently, it has been reported that docking based VS process on ultra-large databases can significantly increase the success rate of finding active hits. 32anwhile the sizes of compound databases also increased significantly.4][35] Marco Capuccini et al 36 carried out VS process on a large-scale compounds library by running existing docking programs on parallel distributed cloud computational resources.This parallel strategy was based on the Apache Spark engine and achieved an overall efficiency of 87%.
However, only a small proportion of the compounds will have high docking scores, and most of the computational time was wasted on the vast majority of 'low-scoring' compounds.Even the improvement of computational resources make it possible to perform VS process on an ultra-large compound library, it is desireable to identify high-score compounds using reliable machine learning filters in advance as the computational time and resources on docking could be significantly reduced.P. B. Jayaraj et al 37 developed a classification model based on random forest and applied this model to VS process.It can realize accelerated calculations on GPU, which significantly reduce the running time of the entire process of VS process.Francesco Gentile et al 38 reported the application of a deep learning framework to train classification models on the docking scores of a subset of compounds, then the model was used to predict compounds that have not been docked.Their method realized rapid VS process on a billion compounds library-ZINC-15, which greatly reduced the computational cost of VS on an ultra-large compounds library with an acceptable recall rate.
It would be promising to examine whether a workflow by two tandem models could mimic the above docking by docking protocol and applied for the scenairo when the absolute number of compounds to be examined increased in one or two orders.In this paper, we setup such a machine learning model: MLDDM (Machince Learning Docking-by-Docking Models), by training on two docking programs' results.With comprehensive evaluation of the method, MLDDM was proved to have a good ability to identify known active compounds on 10 specific protein targets.Further evaluation on protein target BRAF showed that the speed of the protocol could gain 120 times increment on average.These results show that MLDDM can quickly filter out compounds with low docking scores with high confidence.It is believed to be a good and practical filter in daily docking based virtual screening protocols with ultra large libraries.After evaluation of the models, they could be used for ultra-large library filtering.

Model Construction and Usage
When doing ultra-large library filtering, the regression model are used first which mimics the first step of docking by docking protocol and the high ranking score compounds are selected for the second prediction.The compounds with a positive sign suggested by the classification model could be used for further examination.

Programs and parameter settings for molecular docking
Two open-source and freely available docking programs, namely Vina 39 and rDock 16 , were selected for multiple docking VS process in this research.Vina is a molecular docking program based on an experience scoring function developed by the MGL Tools laboratory.Compare with AutoDock 4.0, Vina improves the average accuracy of combined mode prediction, greatly accelerates the search speed by using a simpler scoring function and can still provide reproducibility when dealing with a system with about 20 rotatable bonds.rDock is a fast and versatile Open Source docking program that can be used to dock small molecules against proteins and nucleic acids.Both of the programs can be installed on a computation cluster server and deployed on an almost unlimited number of CPUs, allowing HTVS process campaigns to be carried out in a matter of days.
In our cases, the protein structures were pre-processed using ADTools, and the small molecules were prepared using openbabel. 40For parameters used in vina, the grid center was defined as the native ligand center in the crystal structure, a cubic grid was used with dimension (20,20,20).For rDock, the rbcavity program are used to determine a grid parameter from the navtive ligand with default parameters.Vina was selected as the first docking program (Docking-1) and rDock was selected as the second docking program (Docking-2).Both docking results of Docking-1 and Docking-2 were saved as csv files (Supplementary materials 5).

Deep learning models
A open souce toolkit, Chemprop (http://github.com/chemprop/chemprop),which is a deep learning based molecular property modeling and prediction toolkit, 41  Both the regression and the classification models were trained on 80% of the random split set and then validated and tested on the remaining 20% of the datasets.
The hyperparameters of the D-MPNN, such as hidden_size, depth, dropout, and ffn_num_layers were optimized by 10-fold cross validation.The optimal hyperparameters were used to construct the VS-based models.The construction of these models were implemented with Pytorch and RDkit packages.

Datasets and their usages
In this study, several datasets were used for different purposes.Firstly, for the compoud library to be docked into each target, a ChemDiv (ChemDiv Inc) library with 1.25 million purchasable compounds were clustered with a tanimoto similarity threshold 0.4 to obtain a clustered subset of 287,216 compounds (named as ChemDiv subset, and the SMILES file of the compounds are available in Supplementary materials 2).Secondly, the description of target proteins used to build the MLDDMs are listed in Table 1.The nubmer of their experimentally validated active compounds as well as decoy compounds selected from the DUD-E database 42 are also listed here.They are used to act as validation sets for MLDDM's valibility.All these compounds were docked with two programs to corresponding targets and not used in the training process.(Their smiles strings are list in Supplementary materials 3).Thirdly, we tested MLDDM's performance in Expanded prediciton stage by constructing two datasets.For evalution of those regression models, a relative small dataset are constructed by a subset of 500,000 compounds randomly selected from the ChEMBL database to allievate the actual docking load while keeping the generality of the result.(Their smiles strings are listed in Supplementary materials 4) The consistency of the regression model will be evaluated.For evaluation of those classification modes, for each targets, the compounds from the ChEMBL database are labeld as active or inactive depends on whether their IC50s are less than 50μM or not.(Supplementarymaterials 5).The whole ChEMBL database (about 19M molecules) were predicted by the regression model, and those passed this filter will be used as the input to the classifcation model.The number of active compounds are evaluated.Finally, ZINC-15 dataset was used to illustrate the speed of MLDDM on a large dataset for target BARF as example.
Table1: Detailed information of the selected targets (from DUD-E) and their active/decoy compounds numbers

Regression model evaluation on the ChEMBL subset
We used regression models in MLDDM to predict the ranking percentage of the compoounds in the ChEMBL subset.The compounds with top 20% percentage ranking prediction were further docked to their targets using Vina to test their real distribution.In Figure 3, we can see that most of the selected ChEMBL compounds were located in the top20% docked ChemDiv compounds region.This suggests that the our regression models have a strong ability to enrich compounds with a high docking score, therefore, the models can distinguish compounds with potentially high docking scores for specific targets in a quickly manner without docking.

Evaluation of classification models on the DUD-E sets
In section 3.2.1,previously docked DUD-E actives/decoys by vina and selected compounds with docking score above the top20% percentile of the Training Set1.These compounds were then docked by rDock.The compounds whose rDock docking score is above the top20% percentile of the Training Set2 was labeled as positive while the rest of them were labeld as negative.These compounds were further predicted by the classification model from MLDDM.
Table 4 shows the statistical results of the various metrics of the classification model on the DUD-E external data set.The statistical results show that, except for the target DPP4, the accuracies of the other 9 targets on the DUD-E external data set are above 70%.Except for the targets DPP4 and BRAF, the specificity values of the other 8 targets on the DUD-E external data set are above 72%.The falsepositive rates of the targets EGFR, DPP4, and BRAF are overall higher than other targets, 24.8%, 46.9%, and 37.9%, respectively, for the rest targets, the falsepositive rates are all less than 20%.As for the true positive rate, the maximum and minimum values were 89.5% of BRAF and 27.5% of JAK2.Obviously, for different targets, the performances of the MLDDM are different here.In general, on the external test sets, the models showed a good performance of accuracy and specificity.

Overall performance of MLDDM to retrieve all active compounds
To further verify whether the model can identify active compounds for specific targets after the whole MLDDM process, the active compounds of the 10 targets were selected from ChEMBL for verification.The active compounds for 10 targets were filtered and considering a relative loose threshold.If a compound has an activity of IC50 less than 50μM, it would be kept as a positive sample.
To explore the models' ability of recalling actives, the whole 19 million molecules from the ChEMBL database were predicted with the regression model from the correponding MLDDM and the compounds which passed the top20% percentile of the training set were selected to make the second prediction.After using the classification model, we count the number of activies recalled.
In Table 5, it shows the percentage of recalled compounds in the total positive sample with the different activity cutoff.The statistical results showed that most of the models can identify over 50% of active compounds for specific targets, for example, the proportions of ADRB1, BRAF, EGFR, JAK2, LCK, and VGFR2 were 65.78%, 79.30%, 68.63%, 51.88%, 74.61%, 52.72%, respectively.That is to say, for most targets, the potential active compounds could be recalled by our models.To test the generalization of our framework, we used the MLDDM trained from current chemDiv training set and calculated EF0.04 on ChEMBL.As the current MLDDM would retrieve top4% (0.2 x 0.2 =0.04) of the training set, the enrichment facotr EF0.04 could be calculated and evaluated.
The summarized results of the enrichment factors for 10 targets are listed in Table 6.From this table, the enrichment factors against 10 targets range from the lowest 5.86 (CDK2) to the highest 17.03 (ADRB1).However, the predictive ability of the model does not show a strong correlation with activity, which may be related to the way our training datasets were acquired.

Computation speed comparison
To demonstrate the advantage of our MLDDM models in speed, the target BRAF was selected as an example of the prediction test.The drug-like molecules (containing 981,247,974 compounds) from the ZINC-15 were selected in this case.The statistical results in Table 7 show that based on the above available computing resources, the prediction speed of the machine learning VS model is greatly improved compared to traditional dock by dock process with an average speed increament of >120 times.A large amount of time was used to calculate the descriptors with single-core CPU are included in case every time a unique compound library is used.It means that the MLDDM we built here could greatly increase the speed of dock-by-dock process.This framework could quickly filter out lowscoring compounds against the specific targets, at the same time, the compounds with high docking score were recalled quickly.Therefore, the calculation time and resources of VS on ultra-large compounds library can be greatly reduced.

Discussion
To explore the reasonable top threshold in the process of obtaining the training dataset, the DUD-E active compounds were selected to make a statistic of the relationship between the enrichment ratio of the active compounds and the top value of the two docking programs for the 10 selected targets respectively.Representative results of 4 targets were listed in Figure 4, others are shown in Figure S3.The X-axis represents the top value of the compounds sorted by the docking score, and the Y-axis represents the proportion of DUD-E active compounds enriched for the corresponding target.The statistical results showed that there is a logarithmic growth relationship between enrichment ratio and top value, and the enrichment ratio has the maximum profit in the top20% for most of the targets.Besides, the active compounds enrichment ability of rDock is better than Vina overall.In addition to the target DPP4, the enrichment ratios of active compounds in the top20% compounds of the two docking programs are both >20%.It showed that the two selected docking programs have a certain ability to enrich active compounds.Thus, top20% was selected for the preparation of the training dataset.

Figure 4: Enrichment ratio of DUD-E active compound for two docking programs
To further explore whether the strategy using two docking programs and adjusting their order can obtain a higher enrichment ratio, the enrichment ratio of different combinations of the two docking programs were summarized in Figure 4.The red curve in Figure 4 represents the enrichment ratio of the corresponding target by the rDock program, the blue one represents that of the Vina program, and the purple one represents the re-docking of the Top_20% of the compounds using the rDock after docking with Vina, accordingly, the beige one represents the enrichment ratio in the order of rDock first and then Vina.
The statistical result showed that for most targets, the enrichment ratio of using rDock alone is higher than the other three methods.Besides, choosing to use the Vina program for first docking and then using the rDock program for redocking of the top20% can make the enrichment ratio further improved.This means that the strategy of combing two docking programs can get more active compounds in the training dataset compared with using Vina alone.Moreover, considering the speed and the consumption of computing resources of the two docking programs, we choose a combination of using Vina for the first step quick screening and then using the rDock program to run a more precise multiconformations search for re-docking.This strategy ensures that there are enough active compounds in the training dataset we obtained while reducing the time to obtain the training dataset.
Due to the avability of the docking software we could access, we only tested vina and rDock combination which tends to be orthogonal to each other.Conceptionally, this framework should be general to other docking programs' combination and make it very flexible to accommodate different requirements in variant project.

Conclusion
In the current work, we demonstrated a framework MLDDM which mimics the dock-by-dock screening process powered by deep learning.We implemented it by utilizing the docking results of certain target on a commercially purchasable chemDiv subsets with open access docking program vina and rdock as well as deep learning toolkit Chemprop.Various statistical results showed that the models could have a good performance of VS process.Moreover, compared with the traditional dock by dock process, the MLDDM we constructed here can realize screening on an ultra-large compounds library, and the speed is over 120 times faster.These models can quickly identify molecules with potentially high docking scores and filter out molecules with low docking scores, which greatly reducing the consumption of computing resources on ultra-large compound library virtual screening.In addition, it is found that the reasonable combination of the two docking programs can further increase the enrichment ratio of the screening and make the obtained training dataset more reliable.
With the applications of compound generative models in the field of drug development in recent years, it is believed that the models would be used as a rapid scoring metric for generative models to generate compounds with high docking scores for a specific target.
Figure S3: Enrichment ratio of DUD-E active compound for two docking programs.

Figure 1 :
Figure 1: The workflow of MLDDM construction and usage.
was used to develop our docking-based machine learning models.The architecture of Chemprop is based on graph convolutional networks, termed as the Directed Message Passing Neural Network (D-MPNN), which treats molecules as an attributed graph with node features (atom) and edge features (bond) for processing.D-MPNN was used to train regression and classification models with basic RDkit molecular descriptors.The default hyperparameters of Chemprop were used for model training.Besides, the setting parameters of rdkit_2d_normalized and no_features_scaling are added to the training process to improve the stability of the models.

Figure 3
Figure 3 shows the superposition of the vina docking score distributions from four representative tagets as examples.The figures for other 6 targets are shown in Figure S2.The vina docking score from those selected compounds from ChEMBL subset, the whole ChemDiv dataset and the top 20% scored ChemDiv compounds are illustrated as follows:

Figure 3 :
Figure 3: Vina docking score distribution for D-MPNN regression model selected ChEMBL compounds and ChemDiv compounds for a) BARF b) LCK c) VGFR2 and d) DPP4.

Table 2 .
For those regression 3.1 Model construction of MLDDMTo reflect the consistency of the model's performance and its generalizability, we summerized the the training statistical metrics of those models for the 10 selected targets in FigureS1.The AUC and RMSE on the internal test sets of the D-MPNN models for each target are presented in

Table 2 :
Performance on of D-MPNN regression and classification models based on the ChemDIV subset against 10 protein targets.
432.1 Regression model evaluation on DUD-E datasetConsidering there may be some topology bias in DUD-E, we do not intentionally incorporate compounds into training set and use it as one-time validation only.43Since the DUD-E dataset also used in succed classifcation modelWe labeled these as follows: DUD-E actives/decoys were docked by vina and compounds with docking score above the top 20% percentile of the Training-Set-1 were selected.The number of compounds selected were compared with the number of compounds that selected by the regression model.Next, these dock selected compounds were used for the classification model in 3.3.1.The results are listed in Table3.It is shown that among the 10 targets, except for those of ACE, the consistency rates between the docking regression and MLVS classification were 75.0%, with the highest consistency rate of 98% for the CDK2.This suggests that the regression model from MLDDM achieved an acceptable performance as corresponding docking methods.

Table 3 :
Performance of MLDDM regression models and docking results on DUD-E compounds 1. Consistency rate equals the Number of compounds Selected by MLDDM Regression over Number of compounds Selected by Dock-1 top20%

Table 4 :
Statistics of various indicators of the model on external data sets

Table 5 :
Active compounds prediction statistics of the model

Table 6 :
Enrichment factor statistics of 10 targets

Table 7 :
Comparison of running time