Improving reliability of the virtual screening process for human dihydroorotate dehydrogenase enzyme by combination of the ensemble and consensus docking approaches

The inconsistencies in the performance of the virtual screening (VS) process, depending on the used software and structural conformation of the protein, is a challenging issue in the drug design and discovery ﬁeld. Varying performance, especially in terms of early recognition of the potential hit compounds, negatively affects the whole process and leads to unnecessary waste of the time and resources. Appropriate application of the ensemble docking and consensus-scoring approaches can signiﬁcantly increase reliability of the VS results. Dihydroorotate dehydrogenase (DHODH) is a key enzyme in the pyrimidine biosynthesis pathway. It is considered as a valuable therapeutic target in cancer, autoimmune and viral diseases. Based on the conducted benchmark study and analysis of the effect of different combinations of the applied methods and approaches, here we suggested a structure-based virtual screening (SBVS) workﬂow that can be used to increase the reliability of VS.


Introduction
The hDHODH is a rate-limiting enzyme of the de novo pyrimidine synthesis pathway. The enzyme catalyzes the oxidation of dihydroorotate to orotate, which is essential for the production of uridine monophosphate 1, 2 ( Fig. 1). Inhibition of hDHODH results in pyrimidine depletion, thereby starving the cell of the essential nucleotides required in the cell cycle. In humans, DHODH has been widely recognized as a promising target for the treatment of cancer and autoimmune diseases 3,4 . Furthermore, inhibitors of hDHODH are of exclusive interest as host-targeting broad-spectrum antiviral agents against a wide variety of DNA and RNA viruses [5][6][7] . Given the medical significance, there are ongoing efforts towards the discovery and development of selective inhibitors for hDHODH. Structure-based virtual screening is an established powerful computational approach that is extensively used in drug design and discovery due to its dramatic time-, cost-and labor-savings 8 . At the time of this publication, the Protein Data Bank 9 (PDB) contained more than 60 crystal structures of hDHODH determined in the holo form, complexed with substrates, or inhibitors. The current structural data sufficiently allows researchers to conduct VS experiments in order to find new selective inhibitors for hDHODH [10][11][12] However, based on the binding pose prediction success rates in the recent benchmark study, hDHODH has been characterized as a "very hard" and challenging molecular target for docking 13 . Furthermore, a number of open-source academic and commercial molecular docking software packages show quite varied performances depending on protein family and also on crystal structure of the protein [13][14][15][16][17][18][19][20] . In this context, the traditional approach in using a single software and structural conformation to screen a large database of molecules may lead to the questionable results. Considering the therapeutic value of hDHODH, we have decided to develop a manual VS workflow, that integrates several known methods and approaches for the enhancement of VS efficiency, in particular ensemble and consensus docking strategies [21][22][23][24][25][26][27][28][29] . To validate this newly developed approach, we conducted a benchmark study. While we developed and tested SBVS workflow for hDHODH, we also tested and discussed its general application and efficiency.

Results and Discussion
Selection of the representative set of the hDHODH ubiquinone binding site structural conformations Information on the interacting amino acid residues of the ubiquinone binding site of hDHODH ( Fig. 2A) was gathered from the 38 selected crystal structures of hDHODH complexed with inhibitors (x-ray resolution < 2 Å). As a result of clusterization based on RMS deviation of the interacting amino acid residues of the ubiquinone binding site of hDHODH, four clusters were selected (cutoff 0.45Å) (Fig. 2B). For each cluster, a representative structure was selected (PDB IDs: 4IGH, 6GK0, 6J3C, 5ZF4).  We performed virtual screening of the benchmarking library for four software against four selected representative crystal  structures of the hDHODH ubiquinone binding site (16 runs, Table S2). Based on these separate 16 runs, the range of the values for AUC was from 0.62 to 0.84, for BEDROC 20 from 0.09 to 0.49 and for EFs (1 %) from 1.49 to 14.93, EF (3%) from 1 to 10.95, EF (5%) from 1.19 to to 8.66, EF (10%) from 1.19 to 6.12 (Table S3). The obtained data confirmed that performance of the SBVS procedure highly depends on the selected molecular docking software and representative structure of the protein.

Virtual screening
Based on the average value among four used structural conformations of hDHODH, AutoDock Vina and ICM demonstrated close and significantly better results than the remaining two software, LeDock and rDock (Fig. 3). Taken together, obtained results showed that the performance of virtual screening may significantly vary depending on the structural conformation of hDHODH. In case of AutoDock Vina, ICM and LeDock screening against 4IGH and 6J3C crystal structures resulted in relatively better performance, compared to other structures, while for rDock -5ZF4 was relatively better structure (Table S3). Ensemble docking and consensus-scoring Another important issue is multiple possible combinations of ensemble docking and consensus-scoring approaches and the difference in their effect on the improvement of the VS procedure. For the combination of ensemble docking and consensusscoring approaches two scenarios were tested: 1) Docking scores were combined among four structures for separate software, then consensus-scoring was used among different software (Fig. 4) and conversely 2) A consensus-scoring approach is implemented among different software for separate structures, then normalized scores were combined among all structures (Fig.  5). In the first case, a combination of ensemble docking approach based on the maximum docking scores (Fig. 4B) with the following consensus-scoring approach based on the average scores (Fig. 4D, "average") demonstrated relatively better among all tested combinations of software. This combination showed the most stable positive effect among all tested different combinations of ensemble and consensus-scoring approaches ( Fig. 4 and Fig.5). In the second case, combinations based on the average score values among software (  Taking into account the importance of early recognition of active compounds for real-life virtual screening application, two aforementioned best combinations in terms of enrichment factor were selected for further analysis. Since performance of VS software significantly varied for the different structural conformations of protein, we also studied the effect after elimination of "undesirable" structural conformations (conformations use of which resulted in relatively worse performance) of hDHODH for the AutoDock Vina and ICM combination of software (Fig. 6). Based on the initial separate VS runs, 6J3C and 4IGH were selected as the relatively best structural conformations for the AutoDock Vina and ICM, respectively.

4/11
We tested two scenarios, using the selected structures, in combination and separately (Fig. 6). As seen in figure 6, compared to the scenarios when all four structures were applied ( Fig.4 and Fig. 5), elimination of undesirable structures has not significantly affected the performance and only in two cases values of EF (1 %) slightly increased: 1) when both 6J3C and 4IGH structures are used for both software and consensus-scoring approach based on the average score values followed by an ensemble approach where maximum score values are applied and 2) when only the 6J3C structure is considered for both software and consensus-scoring based on the average values are applied (Fig. 6). When only 6J3C structure is considered, values of EF (3 %) slightly increased for the first case and decreased for the second case (Fig. 6). Two points are needed to be noted here: 1) minor changes in the performance, after elimination of "undesirable" structures, are justified by the fact that previously applied ensemble and consensus-scoring approaches have already minimized the impact of those structures, and 2) these small differences, observed in cases of using different combinations of two best structures (6J3C and 4IGH), may be highly dependent on the set of active compounds of the dataset used and should not be taken into account for the actual VS procedure. This shows that elimination of "undesirable" structural conformations (Fig. 6) does not negatively affect the outcome of VS and linearly decreases computational time of VS procedure. It follows that, if it is possible to distinguish the reliability of the structures beforehand, it is reasonable to limit the selection of structural conformations to the most appropriate ones.

Consensus binding poses
As the ensemble approach (multiple structures) was implemented, it is reasonable to evaluate success rates of software to predict crystal conformations of all 67 active compounds using selected representative crystal structures of hDHODH (Table 1). Based on our results (Table 1), all software performed reasonably well, with LeDock being most consistent among 4 crystal structures. Therefore, in addition to the methods used, we decided to apply an approach described by Houston and Walkinshaw [25]. These authors recommended to use several docking software to identify consensus binding poses for compounds, which increases reliability of predicted poses and consequently improve hit rates. We have tested the effect of the application of the consensus binding pose approach for the combination of all four software ( Table 2) and combination of the AutoDock Vina and ICM (  Table 3. Effect of the application of consensus binding pose filter in case of using combination of the AutoDock Vina and ICM. As seen in table 2 and table 3, inclusion of the additional software and structures affected the ratio of active to inactive compounds in a positive way for both cases. In case of using all available software for filtering with the CBP approach (Table 2), a number of active compounds in the top of the ranked list, along with inclusion of additional structures, noticeably decreases compared to the initial dataset. In contrast, by using the best combination of software the difference between using all four, two and one structure is less significant compared to the initial dataset and in some cases number of compounds actually increases (Table 3). Therefore, if there is no pre-validated data on software and structures performance and one is not able to distinguish their selection to the most appropriate ones, then it is reasonable not to use multiple structures of protein for the filtering with the CBP approach. Despite the fact that use of multiple structures for the filtering with the CBP approach may increase the percentage of potential hit compounds, unregulated elimination of some active compounds, can negatively affect virtual screening procedure in terms of early recognition. However, when one is able to conduct a manual benchmark study for the investigated protein beforehand, or has pre-validated data on the performance and "suitability" of selected software and structural conformations of the protein, limitation of both software and structures to the most appropriate ones is advantageous. Alongside with the increase in the percentage of potential active compounds in the library, application of such a filter can positively affect VS procedure in terms of early recognition, by additional placement of the active compounds in the top of the ranked list. The main effect of the application of the ensemble and consensus docking approaches is in the increase of the reliability of VS results. As seen in figure 7, even in the case when pre-validated data is not taken into account and all available software and structural conformations of protein are used, the outcome (number of active compounds in top of the list) is similar to that of the best "one software -one structure" performance. However, if pre-validated data is available application of those approaches can significantly improve the performance of VS procedure, especially in terms of early recognition of potential hit compounds 6/11 (Fig. 7).

Figure 7.
Comparison of the number of active compounds in the top 25 (1.25%), 50 (2.5%) and 100 (5%) of the ranked lists after application of the best identified combinations of the ensemble and consensus approaches to the average number of compounds (based on the 16 separate runs) and best "one software -one structure" case (ICM and 4IGH).

Conclusion
Taking into the account inconsistencies in the performance of the VS process, depending on the software and structural conformations of protein, appropriate application of the ensemble and consensus docking approaches significantly increases reliability of the VS results. However, their inappropriate integration can lead to negative outcomes in terms of performance. Based on the conducted benchmark and analysis of the different combinations of the applied methods and approaches, here we suggested a SBVS workflow that can be used to increase the reliability of the VS procedure (Fig. 8). In cases, like ours, when one is able to use pre-validated data, and limit the selection of the software and structures to the most appropriate ones, use of such workflow can increase the performance of the VS procedure in terms of early recognition of the potential hit compounds, which is of paramount importance in real-world screening applications, where researchers test top-ranked molecules in biological assays due to their high costs. Therefore, pre-validation of docking software and protein structures, before the actual VS procedure, is highly recommended.

Methods
Structure selection and preparation 38 crystal structures of hDHODH in complex with inhibitors (resolution < 2 Å) have been selected from the Protein Data Bank (PDB) (Table S1). Ligands, water, metal ions and other co-crystallized molecules were removed, except co-factor of hDHODH -flavin mononucleotide (FMN). Then protein structures were superimposed using ICM-browser 30 . For the selection of a representative set of structures, clusterization method based on root-mean-square deviation (RMSD) of the interacting amino acid residues (Cα atoms) of ubiquinone binding site of hDHODH was performed using bio3D R package 31 . Important amino acid residues of the ubiquinone binding site were selected based on the distribution of the interacting amino acids with the co-crystallized ligands among 38 used crystal structures (based on the annotations of crystal complexes) ( Fig. 2A). Polar hydrogen atoms were added to the structures with the protonation state at pH = 7, using Open Babel 2.4.0 software 32 . Benchmarking dataset and evaluation metrics Directory of Useful Decoys: Enhanced (DUD-E) 33 dataset of decoys for the hDHODH was downloaded, filtered for duplicates and 1933 unique decoys were selected. 67 crystal structures of hDHODH in complexes with inhibitors, including 38 complexes used for the selection of target's representative structures, were downloaded from the PDB (Table S1)  were extracted from corresponding complexes and used as active molecules for the benchmark study. Therefore, the final benchmarking library consisted of 67 active compounds and 1933 decoys (inactive compounds). Several commonly used metrics, such as area under the ROC curve (AUC), Boltzmann-enhanced discrimination of ROC (BEDROC, with α = 20), and enrichment factors (EF) for several percentage were used to evaluate and compare accuracy and performance of the conducted virtual screening procedures and calculated using rocker 34 . Enrichment factor was calculated for 1, 3, 5 and 10Since we applied a filter that changes the ratio of active to inactive compounds of the initial dataset and makes direct comparison of aforementioned metrics inaccurate, as an additional comparative metric we simply compared the numbers of active compounds found in the top 25, 50 and 100 of the ranked lists. Docking software and parameters Four molecular docking software packages were used, including three open-source/freely-distributed (AutoDock Vina 35 , LeDock 36 and rDOCK 37 ) and one commercial (ICM 30 ) packages. Main differences in molecular docking software are search algorithm and scoring functions. LeDock uses evolutionary algorithm adopted in combination with simulated annealing search, which is used to generate the first generation of docking poses and physics and knowledge-based hybrid scoring function 36 . RDock uses combination of stochastic and deterministic search techniques to generate low energy ligand poses and rDock master scoring function, which is a weighted sum of intermolecular, ligand intramolecular, site intramolecular, and external restraint terms 37 . The ICM docking algorithm is a global energy optimization procedure based on Monte Carlo minimization 30 . ICM has 2 scores: "ICM Score" and Potential of Mean Force (PMF) score. Only the ICM score, which is an empirical scoring function based on calculation of physiochemical properties of the receptor-ligand complex, was used in our study. AutoDock Vina uses "Iterated Local Search global optimizer" similar to that of ICM and Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method for the local optimization for generation of docking poses and hybrid scoring function (empirical + knowledge-based) 35 .
The 67 crystal complexes of hDHODH with co-crystallized ligands were superimposed using icm-browser and docking box that covers all 67 ligands, based on their crystal structure conformations, and the amino acid residues of the ubiquinone binding site was used for AutoDock Vina and LeDock runs. In the case of the rDock, cavity mapping method based on the reference ligand was used. In the case of the ICM software, the docking grid was based on the interacting amino acid residues of the ubiquinone binding pocket of hDHODH ( Fig. 2A). For all software packages, corresponding standard protocols were used with exception of rDOCK where the number of docking runs was set to 150. Ensemble docking and consensus-scoring In ensemble docking scores of each compound, obtained from docking runs using the same software but different structural conformers of a target protein, are combined using different methods 23 .
In consensus-scoring scores of each compound, obtained from individual software for the same representative structure of protein, are combined using different methods 27 . Since various docking software are used (i.e., different scoring functions), it is required to perform preceding score normalization. For the normalization of docking scores, from different software simple z-score function: z = (xµ) / σ , where x is a compound score, µ and σ -mean and standard deviation of scores of all compounds in the library, respectively, was used. For, ensemble docking and consensus-scoring approaches, docking scores were combined using two common data fusion methods: based on average and maximum (lowest binding energy) score values 38,39 . We combined ensemble and consensus-scoring approaches by two alternative methods. 1) Docking scores of each compound obtained by one software for four structures were combined. Then, scores obtained by different software were normalized and also combined. 2) Ddocking scores of compounds for each structural conformation of hDHODH obtained by different software were normalized and combined. Then, new normalized scores of each compound were combined among four structures. Consensus binding pose approach (CBP) An approach, based on consensus binding pose prediction and its use as a filter, described by Houston and Walkinshaw 26 , was implemented. The difference in our case is use of multiple structural conformations of the target protein. After molecular docking runs, the top ranked docked conformation of each compound was selected. Those conformations were used for consensus binding poses evaluation. Two poses were regarded as similar (consensus) if the RMSD value was below 2 Å, which is a widely accepted value in docking predictions 26,40,41 . All possible combinations of used docking programs were tested. Compounds, for which consensus binding poses for one structural conformation of protein were predicted by different software, passed into the new lists. Then those lists (one for each structure) were compared with each other, and if a compound was presented in all of the lists, then it was selected into the final filtered list.