High-Quality Conformer Generation with CDPKit/CONFORT: Algorithm and Performance Assessment

doi:10.21203/rs.3.rs-1597257/v1

Download PDF

software

High-Quality Conformer Generation with CDPKit/CONFORT: Algorithm and Performance Assessment

https://doi.org/10.21203/rs.3.rs-1597257/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The majority of compounds in the drug-like chemical space show flexibility regarding their three-dimensional structure which, in solution, results in an equilibrium of multiple interconvertible conformational states. Knowledge of the putative bound-state conformation of a molecule is an essential prerequisite for the successful application of many computer-aided drug design methods that aim to assess or predict its capability to bind to a particular target receptor of interest. An established approach to predict bioactive conformers in the absence of receptor structure information is to sample the low-energy conformational space of the investigated molecules and derive representative conformer ensembles which can then be expected to comprise members that closely resemble possible bound-state ligand conformations. The high relevance of such conformer generation functionality led to the development of a wide panel of dedicated commercial and open-source software tools throughout the last decades. Several published benchmarking studies have shown that open-source tools lag behind their commercial competitors in many key aspects like accuracy in reproducing bioactive conformations, speed of processing, output ensemble size, range of applicability, stability and user-friendliness.

In this work, we introduce the novel open-source conformer ensemble generator CONFORT, which builds upon proven concepts and algorithms and aims at delivering state-of-the-art performance for all types of organic molecules in the drug-like chemical space. The ability of CONFORT, and several well-known commercial and open-source conformer ensemble generators, to reproduce experimental 3D structures as well as their computational efficiency and robustness has been assessed thoroughly both for typical drug-like molecules and macrocyclic structures. For small molecules, CONFORT outperforms all other tested open-source conformer generators and is head-to-head with commercial generators, both in terms of processing speed and accuracy in the reproduction of bioactive conformations. In the case of macrocyclic structures, CONFORT is able to reproduce experimental 3D structures with clearly higher accuracy than all other tested generators. To our knowledge, CONFORT is the first open-source conformer ensemble generator that is able to truly keep up with commercial software in this field and thus represents a valuable addition to the open-source software toolbox for computer-aided drug design.

Conformer Generation

Virtual Screening

Computer-aided Drug Design

Free Software

Cheminformatics

Pharmacoinformatics

The vast majority of compounds in the drug-like chemical space comprise one or more rotatable bonds and thus offer some degree of variability regarding their three-dimensional (3D) structure. In solution at room temperature, this structural flexibility manifests in an equilibrated mixture of multiple interconvertible conformational states that correspond to local energetic minima on the potential energy surface. The bioactive conformation of a drug molecule adopted upon binding to the target receptor is usually quite close to one of its low-energy conformers in solution but can also equal a higher energy conformational state if increased structural strain is outweighed by a significant gain in energetically favorable binding site interactions [1].

Computer-aided drug design (CADD) methods that perform an assessment or prediction of the receptor binding capability of molecules rely on the knowledge of their bound-state conformation. Unfortunately, experimental data on the bioactive conformations of small molecules is only available for a small fraction of the chemical space. For the majority of the molecules - especially those that exist only ‘virtually’ - bioactive conformations have to be predicated as accurately as possible by specialized in silico methods. An established approach for computing potential bioactive ligand conformations without requiring prior knowledge about the receptor structure is to sample the low-energy conformational space of the ligand and derive a representative conformer ensemble. In most cases, some of the low-energy conformers in this ensemble can then be expected to closely resemble bound-state ligand conformations. Since the generation of conformer ensembles is a prerequisite for the application of many CADD techniques like structure-based virtual screening (VS) [2], 3D ligand-based VS [3, 4], ligand-based pharmacophore modeling, pharmacophore-based VS [5, 6], and 3D quantitative structure-activity relationship (QSAR) studies, a plethora of dedicated commercial and open-source software tools emerged during the last decades. Well-known commercial tools for the purpose of conformer ensemble generation include iCon [7], Omega [8], ConfGen [9], CAESAR [10], Conformator [11], COSMOS [12], Molecular Operating Environment (MOE) LowModeMD [13], MOE Stochastic and MOE Conformation Import [14]. Corresponding open-source tools are, e.g., FROG2 [15, 16], Confab [17], BCL::Conf [18], RDKit [19], Ballon DG and GA [20], and Multiconf-DOCK [21]. Based on the employed search strategy, applied methods for conformational sampling can be divided into two major categories: (i) systematic and (ii) stochastic approaches.

With the first approach, conformers are generated by the systematic alteration of torsion angles at the rotatable bonds of the molecule. A benefit of this approach over stochastic sampling is that the number of conformers that have to be generated for an exhaustive sampling of the conformational space is exactly defined. Furthermore, the computation of conformers is significantly faster for typical drug-like molecules since multiple conformers can be generated from a single starting 3D structure by just having to perform sequences of simple rigid body rotations. However, an exhaustive systematic sampling of conformers is usually only feasible for smaller numbers of rotatable bonds due to the combinatorial explosion in the amount of conformers to generate. Nevertheless, the number of rotor bonds which can be handled by this method is still large enough to process most drug-like molecules without any issues.

Stochastic sampling, in contrast, explores the conformational space in a random manner which results in changes of torsion angles at all rotatable bonds at once. Therefore, this sampling strategy is especially suitable for molecules with high numbers of rotatable bonds or flexible macrocyclic ring systems since representative conformer ensembles can be obtained with relatively few iterations. Software tools implementing a stochastic sampling approach [19, 20] often employ distance geometry (DG) [22, 23] to generate random conformers of the molecule. DG is based on the principle that all possible conformations of a molecule can be described by pairwise atom distance and volume constraints. To this end, lower and upper distance bounds for all pairs of atoms in a molecule are specified in a distance bounds matrix. In addition, a set of tetrahedral volume ranges of four-membered atom groups may be specified in order to enforce planarity or particular configurations of stereogenic centers. The specified distance and volume constraints then serve as input for an embedding procedure that generates a set of atom 3D coordinates matching the given constraints. A specification of reasonable atom pair distance bounds is crucial to obtain proper 3D output structures. Therefore, empirical information like ideal bond lengths, bond angles, and torsion angles is often used to construct the distance bounds matrix. Nevertheless, the 3D structures generated by the embedding procedure are still quite raw and may contain a considerable amount of geometric errors. In order to obtain structurally sound low-energy output conformers, the raw conformers have to undergo an additional structure refinement procedure which can be computationally quite expensive for larger molecules (e.g. iterative force field energy minimization). A further issue with stochastic sampling arises from the fact that it is hard to judge whether the conformational space has been sampled thoroughly enough. Depending on the flexibility of the processed molecule, pursuing the naive approach of sampling a fixed number of unique conformers may then bear the danger of under- or oversampling.

Implementations of both systematic and stochastic sampling often make use of empirical rules and knowledge as well as structural information derived from experimental data like X-ray structures to increase the speed of conformer generation and/or to improve the quality of the output ensembles. Frequently, empirical data are stored in the form of fragment 3D structure/conformer databases, ring conformer templates or torsion angle libraries which are then used for a fast fragment-based construction of molecule 3D structures or a directed exploration of conformational space employing only torsion angles that e.g. occur predominantly in crystallographic structures. Examples of conformer ensemble generators pursuing a rule-/knowledge-based approach are iCon, Omega, CAESAR, Confab, ConfGen, FROG2, Conformator, COSMOS, BCL::Conf and RDKit [24].

Irrespective of the conformer sampling approach employed, conformer ensemble generators always have to face the challenge of achieving a balance between the conflicting objectives accuracy (commonly measured as the minimum heavy atom root-mean-square deviation (RMSD) between the experimentally determined bioactive conformation and any of the conformers in the generated ensemble), ensemble size, and processing time. Of course, the ultimate goal is to generate small ensembles that accurately reproduce receptor-bound ligand conformations with diminishing low computational costs. Since these demands usually cannot be met for all objectives at once, varying emphasis may be put on each parameter depending on the targeted application scenario. If accuracy is of utmost importance, the choice of a more thorough but also more computationally expensive sampling algorithm may be adequate. If large numbers of molecules have to be processed (e.g. preparation of databases for virtual screening), smaller ensembles may be preferred in order to reduce the storage consumption of the generated conformers and to speedup subsequent processing steps (e.g. repeated virtual screening runs). If the speed of conformer ensemble generation is essential and possible losses in quality are tolerable, less accurate but computationally more efficient approaches may be given preference.

Recently, several studies have been published [11, 25, 26] which assessed the performance of well-know commercial and open source conformer ensemble generators regarding the objectives - accuracy in the reproduction of bioactive conformations, ensemble size and processing time - using a dataset of 2912 high-quality protein-bound ligand conformations extracted from the Protein Data Bank (PDB) [27]. The benchmarking results have shown that commercial and non-free conformer generators are clearly ahead of open-source tools in all evaluated performance aspects and substantial improvements on the free software side are required to catch up with leading commercial tools.

Aside from fulfilling the basic quality criteria outlined above, state-of-the-art conformer ensemble generators nowadays have to face an additional challenge imposed by the growing attention flexible macrocyclic systems receive as a new class of promising drug molecules [28–30]. Macrocyclic molecules do not obey Lipinsk’s rule of five [31] but exhibit interesting and useful properties which set them apart from the mass of typical drug-like small molecules. These are, e.g., improved metabolic stability [32], the ability to disrupt protein-protein interactions [33], or a higher cellular permeability due to conformational restriction [34, 35]. Furthermore, they are able to bind to proteins which are considered as non-druggable targets due to their lack of hydrophobic cavities which can serve as anchor points for functional groups [36, 37]. Drugs based on macrocyclic scaffolds found widespread clinical application as antibiotics (e.g. rifampicin, vancomycin, macrolides), in cancer therapy [38–41], in immunology and dermatology [42], to name a few. When it comes to the generation of representative conformer ensembles of macrocyclic structures, algorithms face the challenge of uniformly searching a huge conformational space in an acceptable amount of time whose size increases dramatically with the number of (partially) rotatable bonds in the macrocycle [43]. Recently, several studies were published that benchmarked well-known conformer generators regarding their ability to sample macrocycle conformations with enough accuracy and diversity for common CADD applications [44–46]. The methods employed by these programs likewise can be divided into systematic and stochastic approaches. For example, the conformer generators Omega macrocycle [47], MOE LowModeMD [13], MacroModel [48], Balloon [20] and RDKit ETKDG [24] all pursue a stochastic search strategy. Implementations of systematic sampling approaches are less common - with Conformator [11] and Prime [49] being two notable examples in this category. A major algorithmic challenge these programs have to face is imposed by the constrained flexibility of rings which prohibits an independent rotation of individual ring bonds. Commonly, this problem is handled by cutting one or more bonds of the macrocycle in order to obtain an open-ring equivalent [11] or multiple non-connected acyclic parts of the ring [49] which can then be sampled in a straightforward systematic way. Afterwards, the generated open-ring conformers are evaluated whether their geometry is suitable for carrying out ring closures. If so, previously cut bonds are re-introduced and a short 3D structure refinement step is carried out to obtain structurally sound conformers of the original ring system.

In this paper, we introduce the novel conformer ensemble generator CONFORT (Conformer Generation Tool). CONFORT is fully open-source (GNU LGPL) and available as part of the Chemical Data Processing Toolkit (CDPKit) in the form of a versatile command-line tool as well as a set of classes and functions provided by CDPKit’s C++/Python-API. CONFORT builds upon proven concepts and algorithms for conformer ensemble generation and aims at delivering state-of-the-art performance for all types of organic molecules in the drug-like chemical space. For the computationally efficient and accurate sampling of small molecule conformers, CONFORT employs a knowledge-based systematic approach which makes extensive use of pre-generated fragment and torsion angle libraries that were derived from experimental 3D structures. For sampling macrocycle conformers, CONFORT implements a purely stochastic approach based on DG and force field-driven structure refinement. The best suited conformer sampling approach for a processed molecule is either chosen automatically based on the detected presence of a macrocyclic ring system (default behavior) or can be specified by the user in advance for all molecules to process. Furthermore, CONFORT is able to correctly handle multi-component molecules like salts and mixtures by the automatic separate generation and later combination of individual component conformer ensembles. CONFORT’s ability to reproduce experimental 3D structures as well as its computational efficiency and robustness has been assessed thoroughly both for typical drug-like molecules (Platinum Diverse Dataset [25]) and macrocycles (208 macrocyclic structures compiled by Sindhikara et al. [49]). The calculation of various performance metrics and the visual presentation of the obtained results largely follow the established protocol developed by Friedrich et al. [25] with some meaningful extensions for presenting the macrocycle sampling benchmarking outcome. For reference, several well-known commercial (iCon [7]; two modes), non-open-source (Conformator [11]; two modes) and open-source conformer generators (Balloon [20], RDKit [24]; two modes/parameterizations) were assessed in addition to CONFORT employing the same benchmarking protocol. The small molecule benchmarking results have shown (see section Results and Discussion) that CONFORT outperforms all other tested open-source conformer ensemble generators and is head-to-head with commercial generators both in terms of speed of processing and accuracy in the reproduction of bioactive conformations. When it comes to the conformer sampling of macrocyclic structures, CONFORT is able to reproduce experimental 3D structures with clearly higher accuracy than all other tested generators with output ensembles that, on average, are only half of the allowed maximum size. To our knowledge, CONFORT is the first open-source conformer ensemble generator that is able to truly keep up with commercial software with regard to accuracy, speed of processing, applicability, robustness and ease of use. Given the relevance of conformer ensemble generation as a mandatory preprocessing step for the application of many modern CADD methods, CONFORT represents a valuable addition to the open-source CADD toolbox that will be highly welcome by the scientific community in the light of the previously present ‘performance gap’ [26] between open-source and commercial software in this field.

The following sections provide insight into the algorithms and methods employed in the implementation of CONFORT and discuss the design decisions taken to achieve a high performance in the reproduction of bioactive conformations and speed of processing for a wide variety of drug-like organic molecules. The sequence of individual processing steps that have to be performed during conformer generation is presented visually by a set of hierarchically linked flowcharts and will be described in detail throughout the text. In general, conformer ensemble generation is a complex endeavor and can fail at nearly all processing stages (e.g. for molecules containing atom types not supported by the forcefield) or take exceedingly long (e.g. for molecules with many rotatable bonds). Internally, CONFORT thus has to perform a significant amount of error and timeout checks which may cause early termination or require intermediate error correction steps. For the sake of simplicity, error and timeout handling have not been incorporated into the flowcharts and the shown program flow is based upon successful execution of every processing step. Furthermore, in the flowcharts and text, a distinction is made between compounds and molecules according to the chemical definition of the two terms: A molecule represents a set of covalently bonded atoms where each atom is reachable from any other atom via one or more bonds. Compounds may consist of just one (the common case) or multiple molecules, as is the case for salts or arbitrary mixtures.

CONFORT has been implemented in C + + and builds on the cheminformatics infrastructure provided by the CDPKit C + + API. For end-users, CONFORT is provided in the form of a command-line tool called ‘confgen’ (Table 1 in Additional File 1 provides an overview of the supported options) that can be found in the application directory of CDPKit. For developers, CONFORT’s functionality is also accessible as a set of classes and functions that are part of CDPKit’s C++/Python-API and allow for a seamless integration of CONFORT into own CDPKit-based applications.

Top-level Conformer Generation Workflow. Figure 1 shows the top-level conformer generation workflow executed for an uninitialized input compound (e.g. read from an input file) to obtain a structurally diverse, low-energy output conformer ensemble. In the first step, a preprocessing of the input compound takes place where the addition of missing hydrogens is performed and required data for the subsequent processing steps are calculated (see section Compound Preprocessing for details). For compounds comprising only a single molecule, the conformer ensemble generated in the next step already represents the output conformers of the compound. For multi-molecule compounds, CONFORT generates a separate conformer ensemble for every molecule and then arranges combinations of selected molecule conformers in 3D space to obtain the final compound output conformers (see section Multi-Molecule Output Conformer Ensemble Generation).

Compound Preprocessing. Compound preprocessing (Fig. 2) comprises several sub-steps that prepare the raw input compound and calculate various molecular properties required for the generation of proper 3D structures and conformer ensembles. In the first step, the hybridization state of every atom is determined from the atom type, formal charge and the number and order of incident bonds. In the next step, the smallest set of smallest rings (SSSR) is perceived, which, together with the previously determined atomic hybridization states, allows to identify atoms and bonds that are part of planar aromatics ring systems.

In general, stereo configurations specified for the atoms and bonds of the input compound are retained and will be considered accordingly in the 3D structure generation process. For chiral atoms and asymmetric double bonds with undefined stereochemistry, an attempt is made to calculate missing stereo descriptors from supplied 3D atom coordinates. If 3D atom coordinates are not available, the configuration of chiral atoms is left unspecified, leading to the selection of a configuration that turns out to be energetically favorable in the conformer generation process. For asymmetric double bonds with no predefined stereochemistry, a trans configuration of the sterically most bulky substituents on either side of the double bond is selected. To efficiently estimate the steric bulk of substituents, we use a modified version of the Morgan algorithm [50] that stops after the iterative calculation of atom connectivity values (CV). The higher the CV of an atom, the more complex its neighborhood is, and the more space-consuming the substituent it represents will presumably be.

The next processing step adds hydrogens onto heavy atoms which are unsaturated according to the associated chemical element's formal charge and typical valances.

CONFORT employs a DG-based approach [22, 23] whenever 3D structures need to be generated from scratch. Here, randomly distributed starting atom positions are iteratively refined until a valid local energetic minimum structure has been obtained. Which of the energetic minimum conformers is eventually obtained heavily depends on the order of atoms during the initial assignment of random atom positions. To guarantee an input atom order invariant generation of output conformer ensembles, the connection tables of input compounds need to be canonicalized before any conformers are generated. This canonicalization step is optional and has to be explicitly enabled by the user if possible slight output differences for structurally identical input compounds with varying atom orders are not acceptable. For the connection table canonicalization task, CONFORT employs an implementation of the algorithm devised by McKay [51].

After (optional) compound canonicalization, a perception of structurally separated compound components (= molecules) is carried out. Components are identified by performing a depth-first search for all atoms reachable via a bond path from a given start atom. Found reachable atoms and the start atom belong to the same component and get marked accordingly. The search procedure is then repeated for the next not yet visited atom in the atom list until no more unvisited atoms are left.

In the last compound preprocessing step, a topological distance matrix is generated for every component of the input compound. Topological atom distances are needed to generate DG constraints and for the parameterization of Merck Molecular Force Field 94 (MMFF94) Van der Waals interactions [52]. To determine topological distances, we employ a breadth-first search approach where the length of the bond path from a given start atom to a reachable atom is noted in the corresponding cells of the distance matrix. Alternatively, we also implemented the Floyd-Warshall algorithm [53] to determine the topological distances. However, the Floyd-Warshall implementation was significantly slower than the breadth-first search approach and eventually was given up in favor of the latter.

Molecule Conformer Ensemble Generation. As already noted, a separate molecule conformer ensemble will be generated for every component of the input compound. The molecule conformer generation workflow shown in Fig. 3 may thus be executed more than once depending on the number of components that could be identified in the compound preprocessing stage.

The conformer generation process for a given input molecule starts with the perception and parameterization of MMFF94 interactions. Determined force field interactions and parameters will be required for 3D structure refinement and conformer energy estimation in later processing stages. CONFORT utilizes CDPKit’s full-featured MMFF94 implementation whose correctness has been thoroughly validated in all aspects using the datasets provided by the MMFF94 Validation Suite [54]. In the next step, the ultimate decision is made whether to use a systematic or a stochastic sampling approach for generating the molecule conformers. The systematic approach (see section Systematic Conformer Sampling) performs best for typical drug-like molecules composed of chains and relatively rigid ring systems. Stochastic sampling (see next section) has proven to be advantageous for structures comprising flexible macrocyclic rings where conformers cannot be sampled by means of simple systematic torsion driving due to the present rotational restrictions. By default, CONFORT selects a stochastic sampling approach whenever the SSSR of the molecule contains a ring that incorporates more than ten non-aromatic single bonds. The automatic selection of a suitable sampling method can be overridden by specifying the method to use in advance in the CONFORT settings. Once a pool of output conformer candidates has been generated by the chosen method, a final output conformer ensemble is compiled in the last processing step of the workflow. Which, and how many, of the candidate conformers end up in the output conformer ensemble depends on various user-specified settings which control allowed energy range, desired structural diversity and maximum ensemble size. Details regarding the output conformer selection process can be found in section Molecule Output Conformer Ensemble Compilation.

Stochastic Conformer Sampling. Stochastic conformer sampling is a relatively simple but time consuming process which is based on the assumption that randomly generated structurally diverse low-energy conformers will be uniformly distributed over the whole conformational space of a molecule and that a sufficiently large set of such samples will represent a large fraction of the energetically most favorable torsion angle combinations. Figure 4 shows the principal steps of the stochastic conformer sampling workflow as implemented in CONFORT. The first step is concerned with initializing the random conformer generation unit. Random low-energy conformers are generated by a DG-based approach, where, in accordance with predefined atom distance and volume constraints, initially randomly distributed atom positions are successively refined until a valid energy-minimized 3D structure of the molecule is obtained (for details see next section). During the sampling process, random conformers are generated iteratively until a specified maximum number of conformers was sampled (default: 2,000 conformers), the granted sampling time has been exceeded, or the generation of new unique conformers ceased (convergence reached). Within the sampling loop, each newly generated conformer is first tested whether its energy is lower or equal to the current energy threshold. The energy threshold equals the lowest conformer energy encountered thus far plus the specified energy window size. If the generated conformer passes the energy check, it is added to the working ensemble and gets discarded, otherwise. Next, a sampling convergence check is carried out which decides whether the conformational space has been sampled densely enough and, therefore, the sampling loop may already be exited. In the convergence check, after a certain number of conformer generation cycles (default: 100 cycles) have passed, the amount of newly generated unique conformers is determined. This is done by first performing an energy-based removal of duplicate conformers (conformers with an energy difference < = 0.01 kcal/mol) from the working ensemble and then comparing the number of remaining conformers with the ensemble size obtained after the last duplicate removal step. If the ensemble size did not increase since the last check, the sampling process has converged and can be stopped. After leaving the iterative sampling loop, any conformers in the working ensemble with energies outside the energy window (see above) get discarded and the stochastic sampling process terminates.

Random Conformer Generation. CONFORT uses the random conformer generation functionality whenever arbitrary low-energy 3D structures of a molecule have to be generated from basic molecular graph information only (e.g. for stochastic conformer sampling, see previous section). The generation of a single random 3D structure comprises the following sequence of steps (Fig. 5):

First, the 3D-coordinates of all heavy atoms and any hydrogens which are connected to stereocenters with defined configuration are initialized with random values lying in a range equaling the atom count times a structure type specific factor (e.g. 0.5 for macrocycle sampling). Hydrogens usually represent the majority of the atoms in typical organic molecules and their exclusion allows for a significant speedup of the subsequent DG-based raw 3D structure generation step. Here, the initial random atom positions are optimized by a coordinate embedding procedure [55] until they meet certain pairwise distance and tetrahedral volume constraints. The geometric distance range constraint assigned to an atom pair p_ij = (a_i, a_j) depends on its topological distance TD_ij: For directly bonded atoms (TD_ij = 1) the upper and lower distance limit is set to the MMFF94 bond length taken from the assigned bond stretching interaction parameters. The upper and lower distance limit for a geminal atom pair (TD_ij = 2) is calculated from the respective MMFF bond lengths and the equilibrium bond angle specified by the assigned bond stretching and angle bending interaction parameters. For vicinal atom pairs (TD_ij = 3), where the central bond is not a double bond with a defined configuration, the lower limit is set to the calculated distance of the atoms in a coplanar and the upper limit to the distance in an anti-coplanar arrangement taking into account the MMFF94 bond lengths and angles of the involved bonds. Vicinal atoms connected to a central double bond with a defined configuration are assigned a fixed distance corresponding to the spacing of the atoms in coplanar arrangement if the central bond’s configuration is cis (with regard to the atom pair), and the corresponding distance in anti-coplanar arrangement if the configuration is trans. All remaining atom pairs (TD_ij > 3) obtain a lower distance limit which is calculated as the sum of the covalent atom radii plus an additional safety spacing of 1.5 Å, and an upper limit equal to the total sum of all MMFF94 bond lengths. Volume constraints are generated for atoms bonded to tetrahedral stereogenic centers and for groups with known planar atom arrangements. For stereogenic centers, the sign of the volume spanned by the neighbor atoms depends on the specified configuration and is enforced by setting the corresponding volume range limits to ± 0.5 and ± 1000 Å³ (empirical values), respectively. Planar groups are e.g. formed by amide Nitrogens, sp²-hybridized or aromatic atoms with three incident bonds. Furthermore, by atoms connected to amide, aromatic and double bonds. To enforce a planar arrangement of such atom groups, the upper and lower volume range limits are set to zero.

Once a raw 3D structure has been obtained from the embedding process, a first check is made whether the configurations of defined stereogenic centers are correct with respect to the generated coordinates. If the check fails for at least one stereocenter, a new attempt is made to generate a valid raw structure starting from a different set of random atom positions. If, after a certain number of trials (default: 10), still no valid structure could be obtained, the random conformer generation procedure terminates and reports an error. Otherwise, processing continues with the next step, where the coordinates of removed hydrogens are calculated according to the hybridization state of the connected heavy atoms and already assigned atom positions. After that, the geometry of the hydrogen complete raw 3D structure is further refined by iterative minimization of its MMFF94 energy until the energy gradient norm or the change of energy is below a certain target threshold (the stopping criterion and threshold value have to be specified by the caller and are context-dependent). For energy minimization, the algorithm devised by Broyden, Fletcher, Goldfarb and Shanno (BFGS) [56–59] is used. The CDPKit-implementation of the BFGS-algorithm is based on code that has been taken from the GNU Scientific Library (GSL) [60]. If the energy-driven structure refinement procedure fails, the random conformer generation procedure terminates immediately and reports an error. In the concluding step of the workflow, all defined stereocenters are once again checked for the correctness of their configurations. If no errors are found, the refined set of atom coordinates is output and the workflow terminates with success. Otherwise, as described previously, a new attempt to obtain a valid 3D structure will be made.

Systematic Conformer Sampling. CONFORT performs systematic conformer sampling by applying different combinations of torsion angles at the rotatable bonds of the processed molecule on beforehand generated conformer 3D structure templates. Compared with stochastic sampling, the systematic approach usually demands much less processing time for small drug-like molecules since all conformers of a given 3D structure template can be generated by relatively fast rigid body transformations instead of having to repeatedly perform a time-consuming ab initio generation of energy-minimized random conformers.

For the construction of 3D structure templates, CONFORT employs a fragment-based approach in which the overall 3D structure is assembled from smaller structural building blocks like chain fragments and ring systems. The idea behind this approach is to pre-generate reasonable 3D structures of frequently occurring molecular fragments by an external program (‘genfraglib’, can be found in the application directory of CDPKit) and store the resulting atom 3D coordinate sets in a permanent library for later use. Via this strategy, a considerable speedup of the 3D structure template buildup process can be achieved since a time consuming ab initio generation of 3D coordinates (see section Random Conformer Generation) can be circumvented for many of the fragments found in typical drug-like organic molecules.

Figure 6 illustrates the major processing steps performed for a given input molecule in the implemented systematic conformer sampling workflow. In the first step, the provided molecule is broken down into smaller structural building blocks by cutting specific bonds that have been identified by a set of fragmentation rules (details on molecule fragmentation can be found in the next section). Afterward a conformer ensemble is generated for each obtained fragment, which, depending on conformational flexibility, may consist only of a single 3D structure or multiple conformers. For rigid fragments like purely aromatic ring systems and chains/moieties lacking rotatable bonds, a single 3D structure is generated. Chains comprising rotatable bonds and flexible ring systems are sampled more thoroughly until a representative set of low-energy conformers has been obtained. As pointed out before, for each processed fragment, an attempt is made to obtain pre-generated 3D coordinates from an on-disk fragment library to speed up the overall conformer generation process. If a suitable library entry is not available, conformers will be generated on-the-fly using CONFORT’s random conformer generation (see previous section) and torsion driving functionality. More details on fragment library lookup, on-the-fly conformer generation and fragment conformer post-processing can be found in section Fragment Conformer Ensemble Generation.

In the next step, a set of fragment conformer combinations (FCC) is generated. FCCs are represented by N-tuples of fragment conformer indices where N corresponds to the total number of resulting input molecule fragments. The energy of an FCC is calculated as the sum of the MMFF94 energies of the fragment conformers referenced by the index-tuple. To prevent the exhaustive consumption of main memory and an exaggerated processing time by very high numbers of possible FCCs (e.g. for larger peptides), the FCC generation process stops once a certain amount of stored FCCs has been reached (100,000 in the current implementation). Furthermore, only FCCs with energies lower or equal to the calculated energy threshold are stored for further processing. The energy threshold corresponds to the minimum FCC energy plus the specified energy window size times the safety factor 1.5.

In the final step of the workflow, each FCC serves as a 3D structure template for the generation of derived molecule conformers by systematic torsion driving at rotatable bonds linking the fragments. By means of this divide-and-conquer strategy, it is possible to sample a representative share of the lowest energy conformers in an acceptable amount of time, even in the case of large and flexible molecules. Further details on the assignment of torsion angles to rotatable bonds and the implementation of the torsion driving process can be found in the section Fragment Conformer Combination Torsion Driving.

Molecule Fragmentation. As described in the previous section, the systematic conformer sampling procedure generates molecule conformers by assembling selected pre-computed conformers of molecular fragments. These fragments are derived from the molecular graph by cutting specific carbon-heteroatom bonds and bonds to ring system substituents. The pseudo-code function isCutBond() shown in Listing 1, implements the rules by which the bonds being cut are identified:

bool isCutBond(Bond b):

if isInRing(b) or connectsHydrogenAtom(b):

return false

if degree(b.atom1) < 2 or degree(b.atom2) < 2:

return false

if isInRing(b.atom1) or isInRing(b.atom2):

return true

if order(b) != 1:

return false

if element(b.atom1) == ‘C’:

if element(b.atom2) not in { 'N', 'O', 'S', 'P', 'Se' }:

return false

elif element(b.atom2) == ‘C’:

if element(b.atom1) not in { 'N', 'O', 'S', 'P', 'Se' }:

return false

else:

return false

if hasNonSingleBondTo(b.atom1, { 'N', 'O', 'S' }) or

hasNonSingleBondTo(b.atom2, { 'N', 'O', 'S' }):

return false

return true

Listing 1. Pseudo-code implementing rules for the identification of bonds to cut in the molecule fragmentation process.

In the resulting fragments, bonds to previously connected fragments of the parent molecule are preserved. Their atoms are replaced by corresponding pseudo atoms encoding chemical element, hybridization state, formal charge and membership in aromatic rings. These pseudo atoms provide enough local structural context for a later library lookup/on-the-fly generation of proper fragment 3D structures. Figure 7 shows an example of the fragmentation of a typical drug-like organic molecule and the list of resulting molecular fragments obtained by cutting bonds identified with the rules mentioned above.

Fragment Conformer Ensemble Generation. Depending on specified settings, data availability, and structural type, fragment conformer ensembles are either derived from provided input 3D coordinates, retrieved/derived from an entry in the built-in or user-specified fragment library, taken from the runtime cache, or have to be generated from scratch.

Figure 8 outlines the conformer ensemble generation workflow for a given input fragment. The workflow starts with checking whether the conformers should (by default, input coordinates are discarded) and can be derived from the provided atom 3D coordinates. Supplied atom 3D coordinates are only considered if they are present for at least all heavy atoms of the fragment, and if the fragment is not a flexible ring system or ring conformer enumeration has been disabled. If so, all present 3D coordinates are extracted and a calculation of missing hydrogen coordinates is performed (the hydrogen 3D coordinates calculation procedure is briefly described in section Random Conformer Generation). Afterward, the resulting complete fragment 3D structure is forwarded to the conformer post-processing step (see next section).

If input coordinates should not or cannot be considered according to the initial checks, the workflow proceeds with the fragment canonicalization step that comprises three sub-steps: In the first step, terminal heavy atoms connected to atoms of aromatic rings are replaced by hydrogen. Fragments differing only in their aromatic ring system substituent pattern thus converge into the same canonical fragment structure. This measure helps to increase the fragment library/runtime cache hit rate for quite common aromatic fragments like substituted phenyl rings. The resulting incorrect lengths of bonds between aromatic ring and replaced substituent atoms are corrected later in the conformer post-processing step (see next section).

In the next step, free valences of pseudo atoms (see previous section) are compensated by adding explicit hydrogens and a calculation of canonical atom labels using CDPKit’s implementation of the McKay algorithm [51] is performed.

Finally, a binary representation of the fragment’s connection table with atom and bond lists ordered according to the previously calculated canonical atom labels is generated. The calculated SHA1 hash code [61] of the connection table data is then converted to a 64 bit integer key which serves as a unique fragment identifier (ID) in subsequent processing steps.

In the canonical atom labeling and fragment ID calculation procedure, stereodescriptors of defined stereocenters are only considered if the processed fragment represents a ring system. Stereochemistry is disregarded for acyclic fragments since configurations of tetrahedral stereocenters and double bonds can be corrected easily by fast geometric operations in the conformer post-processing step (see next section). Furthermore, for each acyclic fragment only a single random 3D structure is stored in the fragment library and runtime cache to reduce per-fragment memory consumption and thus allow for a higher number of entries. Complete conformer ensembles are then generated in the post-processing step by performing torsion driving on the saved 3D structure. The increased post-processing effort resulting from these measures is condoned to attain higher library and runtime cache hit rates for this structurally diverse fragment type. High hit rates are key for achieving low average molecule processing times by avoiding the relatively slow DG-based random conformer generation procedure for as many encountered fragments as possible.

After the fragment canonicalization procedure, the calculated 64 bit fragment ID is used as a unique key for querying the loaded fragment libraries. Aside from the built-in fragment library which gets loaded on program startup, CONFORT also supports multiple user-specified external fragment libraries which are searched one by one for an entry matching the supplied fragment ID. If one of the performed library lookups is successful, the fragment conformers deposited in the library entry are extracted and forwarded to the final post-processing step after which the workflow terminates.

Should all library lookups fail, an attempt is made to find a matching entry in the dynamic runtime cache. The runtime cache is implemented as a Least Recently Used (LRU) list (limited to 10,000 entries in the current implementation) and stores conformers of previously processed fragments that likewise could not be found in any of the searched libraries. An identified matching cache entry is then processed in the same way as a corresponding library hit.

If library and cache lookups both fail, the requested fragment conformers are generated from scratch based on the information provided by the molecular graph of the derived canonical fragment. 3D structures are generated by means of the previously described random conformer generation functionality (see section Random Conformer Generation). For acyclic fragments and fragments representing purely aromatic ring systems, just a single output 3D structure is generated. Conformers of fragments representing flexible ring systems are sampled stochastically using a procedure similar to the one described in section Stochastic Conformer Sampling. Depending on actual ring system flexibility, this usually leads to the output of multiple structurally diverse (default ring atom RMSD = 0.1 Å) low-energy 3D structures which cover the conformational space of the fragment in the effective energy window (default values: small ring systems 8 kcal/mol, macrocycles 25 kcal/mol). Afterward, the generated 3D structures are deposited in the runtime cache for potential later reuse and then get post-processed in the workflow's final step.

Fragment Conformer Post-Processing. Fragment conformers resulting from the procedure described in the previous section need to undergo further checks, corrections and modifications before they can be used as building blocks for molecule 3D structure templates (see section Systematic Conformer Sampling). Depending on fragment type and source of the input conformers, required post-processing steps comprise: Correction of aromatic ring substituent bond lengths, inversion of stereocenters for the correction of detected configuration errors, enumeration of invertible nitrogen states and generation of additional conformers by torsion driving. The workflow of the executed post-processing procedure is shown in Fig. 9. Herein, the first two processing steps are concerned with required structural corrections resulting from molecular graph modifications and the disregard of stereochemistry for acyclic fragments in the fragment canonicalization procedure (see previous section).

In the first correction step, for every input conformer, the lengths of bonds between aromatic ring atoms and atoms of exocyclic substituents (which were replaced by hydrogen in the fragment canonicalization procedure, see previous section) are scaled to match the corresponding MMFF94 equilibrium bond lengths in the structural context of the parent molecule. In the second step, atom and bond stereocenters of acyclic fragments whose calculated configuration does not match the desired output configuration are corrected by exchanging stereocenter substituent positions and applying geometric transformations on the involved atom 3D coordinates.

Since systematic errors introduced in the fragment canonical procedure do not apply to fragment conformers extracted from a supplied 3D structure of the parent molecule (see Fig. 8), the two correction steps can be bypassed for this sort of conformers which is realized by a dedicated check at the beginning of the post-processing workflow.

The next processing step performs a systematic enumeration of all possible invertible nitrogen configuration combinations and generates corresponding 3D structures derived from the supplied input conformers. For a fragment with N invertible nitrogen atoms this will lead to a multiplication of the number of fragment conformers by a factor of 2^N. In the current implementation a non-planar nitrogen atom is considered invertible if it has three single-bonded neighbors, is not a member of more than one ring and is connected to at least two heavy atoms. CONFORT supports three nitrogen configuration enumeration modes: i) no enumeration - if specified, the enumeration procedure will be skipped and the fragment conformers are left unaltered, ii) only invertible nitrogens with undefined configuration are considered (default setting) and iii) all invertible nitrogens. Algorithmically, the configuration of a nitrogen is inverted by rotating the 3D coordinates of the atoms of a selected substituent to the exact opposite position on the other side of the plane spanned by the bonds to the remaining two substituents. If the inverted nitrogen atom is a ring member, the exocyclic substituent will be chosen for rotation. If the nitrogen is acyclic, the substituent comprising the smallest number of atoms is selected.

If the fragment contains rotatable acyclic bonds, each conformer in the current set will be subjected to torsion driving, further expanding the current fragment conformer ensemble by systematic sampling. Usually, torsion driving only needs to be carried out for flexible acyclic fragments for which just a single low-energy conformer is handed over by the parent workflow (see previous section). Conceptually, the employed torsion driving workflow is largely similar to the one described for the generation of molecule conformers from a set of FCCs (see next section) and, therefore, will not be outlined in more detail.

After torsion driving has finished, generated conformers which are out of the specified molecule conformer energy window get discarded and the remaining conformers are then forwarded to the last workflow step.

If no rotatable fragment bonds are present, the MMFF94 energy of each conformer using the force field parameterization of the parent molecule is calculated (the torsion driving procedure performs this step internally).

In the last post-processing step, the list of final fragment conformers is ordered by increasing energy and, if necessary, reduced to the maximum allowed output ensemble size (default setting: 10,000 conformers) by removing the necessary amount of high-energy conformers from the end of the list.

Fragment Conformer Combination Torsion Driving. This processing step represents the heart of the systematic conformer sampling workflow (see section Systematic Conformer Sampling) and produces an ensemble of output conformers from a set of energy ordered input FCCs. 3D structures of the output conformers are generated by a torsion driving procedure that aligns individual fragment conformers specified by the processed input FCCs along rotatable bonds linking the fragments (see section Molecule Fragmentation). The relative spatial orientation of the aligned fragment conformers is dictated by rotatable bond-specific dihedral angles which are retrieved from matching torsion library entries. Depending on the number of distinct torsion angle combinations possible for the present set of rotor bonds, a corresponding number of output conformers will usually be generated for each processed FCC. For big and quite flexible molecules, the total number of possible conformers can quickly reach rather high figures and an attempt to generate all of these conformers will fail due to excessive main memory and processing time consumption. However, carrying out the processing of FCCs in order of increasing energy and the early pruning of high-energy conformers during torsion driving ensures that also in cases where the conformational space cannot be explored exhaustively, most of the low-energy conformers will be captured and end up in the generated output ensembles (see section Results and Discussion).

The first step of the overall torsion driving workflow (Fig. 10) is concerned with the setup of the fragment tree data structure. This data structure represents the in-memory foundation of the torsion driving algorithm and comprises a set of linked nodes organized in a tree-like manner. Each non-leaf node of the tree references a particular rotatable bond and provides storage for conformers that can be generated by aligning all possible pairs of child node conformers along the referenced bond applying the dihedral angles specified by an assigned torsion library entry. The leaf nodes are associated with the fragments constituting the parent structure (root node) and each stores a particular fragment conformer as specified by the currently processed FCC. For the sake of better understanding, Fig. 11 shows an example of a typical fragment tree constructed by the setup procedure for the molecule shown in the root node.

Once the fragment tree data structure has been set up, an assignment of proper dihedral angles for the rotatable bonds referenced by the non-leaf nodes is carried out. Here, for each rotatable bond, a lookup for a matching entry in the loaded torsion libraries will be performed. Torsion library entries specify matching rotatable bonds by a single SMARTS pattern [62] which describes a linear path of three or four atoms and provides lists of preferred dihedral angles and associated tolerance intervals. The torsion library used by CONFORT has been derived from the collection of rotatable bond patterns and corresponding dihedral angles compiled by Schärfer et al. [63] and Guba et al. [64] which resulted from statistical analyses of torsion angle preferences in the drug-like chemical space. Based on the set of rotatable bond patterns in the most recent version of the library by Guba et al., we performed our own torsion angle analysis using a database of experimentally-determined receptor-bound ligand 3D structures extracted from the PDB [65]. Based on the obtained results, several new library entries have been added, and various modifications to the angle and tolerance lists of existing entries in the library by Guba et al. were made. Aside from the built-in library, CONFORT also supports multiple user-specified external torsion libraries which are included in the search for matching entries. These will take precedence over the entries in the built-in library and thus enable customization of the dihedral angles used in the torsion driving core procedure.

Depending on the conformer generation settings in effect and the presence of local symmetry, dihedral angles provided by the looked-up library entries may undergo further processing before they are assigned to the corresponding tree nodes: i) In the case of rotatable bonds which represent an axis of 1-, 2- or 3-fold rotational symmetry of at least one of the connected fragments, all redundant angles that would lead to the generation of conformer duplicates are removed. Often occurring fragments with known rotational symmetry (e.g. Trifluormethyl, Phenyl, t-Butyl, …) have been tabulated as a list of SMARTS-patterns and are identified at runtime by substructure searching. ii) If torsion angle tolerance range sampling has been enabled (default: not enabled), two additional angles per listed dihedral angle will be calculated that mark the beginning and end of the dihedral angle’s first tolerance range (see ref. [64] for the definition of the first tolerance range).

After the torsion angle assignment step, the workflow enters a loop in which a set of output conformer candidates will be generated for each specified input FCC. The loop is exited either after all FCCs have been processed or if one of the two early exit conditions is fulfilled. The first early exit check is carried out before entering the torsion driving core procedure and evaluates whether the energy of the currently processed FCC is above a certain upper energy limit. The energy limit is calculated as the sum of the energy of the FCC that resulted in the lowest energy output conformer generated so far and the specified energy window size. A generation of novel low-energy output conformers from FCCs above this energy limit is very unlikely and the processing of additional input FCCs will have little impact on the final output conformer ensemble. Hence, the main loop can already be exited at this point with, on average, only little losses in output conformer ensemble quality.

If the FCC passes the energy limit check, the torsion driving core procedure will be entered. Herein, the leaf nodes of the fragment tree are first initialized with the fragment conformers specified by the newly processed FCC and then a recursive generation of conformers at the non-leaf nodes is carried out. The conformers of a non-leaf node are generated by overlaying pairs of left and right child node conformers at the referenced rotatable bond and then applying rotations to the coordinates of one child conformer according to the torsion angles assigned during setup. This process is repeated for all possible child node conformer pair and torsion angle combinations and finally results in a set of conformers which represents the conformational space of the molecule substructure covered by the fragments referenced in the subtree rooting at the currently processed node.

The MMFF94 energy E_A−B of a conformer A-B that was generated from two child node conformers A and B for the torsion angle 𝚯 can be calculated as (Eq. 1):

\({E}_{A-B}\left(\varTheta \right)={E}_{A}+{E}_{B}+{E}_{BS}+{E}_{T}\left(\varTheta \right)+{E}_{C}\left(\theta \right)+{E}_{VdW}\left(\theta \right)\) Eq. 1

E _A and E_B denote the MMFF94 energies of the child node conformers A and B, respectively, E_BS represents the bond stretching interaction energy of the rotatable bond, E_T represents the sum of the energies of torsion interactions involving the rotatable bond, E_C denotes the sum of the energies of electrostatic interactions between the child node substructures, and E_VdW denotes the sum of corresponding Van der Waals interaction energies. As can be seen from Eq. 1, E_A, E_B as well as E_BS are torsion angle independent constants whose values, once calculated, can be reused in the calculation of other energy terms they are involved in. The torsion driving procedure exploits this fact and is thus able to perform a fast energy calculation of newly generated conformers since every parameterized forcefield interaction energy term, in the worst case, needs to be evaluated only once per generated conformer.

For each newly generated conformer a first check is made to determine whether its energy is exceeding the current energy threshold which equals the sum of the minimum conformer energy encountered so far and the specified energy window size. If so, the generated conformer is discarded. Otherwise, it will be added to the node’s current working set. After all possible conformers of a node have been generated, each saved conformer is again checked whether its energy is within the allowed energy window and if not, gets discarded. Finally, the remaining conformers are ordered by increasing energy and, if the number of conformers exceeds the maximum allowed pool size (default value: 10,000 conformers), the amount of conformers is reduced by pruning excessive high-energy conformers. These post-processing steps ensure that potential high-energy molecule conformers get identified and removed from the processing pipeline already early on and, in case of molecules comprising many fragments and/or high numbers of torsion angles per rotor bond, a combinatorial explosion of live conformers will be avoided.

After the bottom-up generation of conformers has finished at the root node of the fragment tree, the torsion driving procedure terminates. The root node now stores a set of final low-energy molecule conformers derived from the currently processed FCC. For each conformer a check is then made whether its energy is above an upper limit calculated as the sum of the minimum conformer energy encountered so far for all processed FCCs and the specified energy window size. If the conformer’s energy is above the threshold, the conformer gets discarded. Otherwise, it is added to the output conformer working set. If, during the processing of the root node conformer ensemble, a new energetic minimum conformer has been encountered, the energetic minimum gets adapted and all conformers in the current working set which are now above the new upper energy limit are discarded.

After all conformers have been processed, a check is made whether the second early loop exit condition is met. The condition will be fulfilled if the number of conformers in the working set is greater or equal to a specified non-zero maximum pool size (default value: 10,000 conformers). If so, the loop will be exited and the torsion driving workflow terminates. Otherwise, the next input FCC (if available) will be processed as described above.

Molecule Output Conformer Ensemble Compilation. Sets of low-energy molecule conformers obtained from the systematic or stochastic sampling procedures usually do not yet fulfill all user-specified output ensemble characteristics (structural diversity, max. output ensemble size and energy window size) and may contain a significant amount of duplicates or clusters of structurally too similar conformers. Figure 12 shows the workflow of the post-processing procedure which takes care of compiling a final output conformer ensemble with desired characteristics from a set of supplied input conformers. The procedure starts with ranking the supplied input conformers by increasing energy to ensure a preference of low-energy conformers in the later output conformer picking stage. Afterward, a check is made whether a present 3D structure of the molecule that has been provided with the processed molecule on input shall be added to the output conformer ensemble (default setting: not included). If so, the input 3D structure gets extracted and, after calculating its MMFF94 energy, it is added to the output conformer working set.

Processing then enters the main section of the workflow where final output conformers will be selected iteratively from the energy ordered set of input conformers based on mutual structural dissimilarity. The picking loop exits if either all input conformers have been processed, the maximum output ensemble size has been reached or the energy of the currently processed conformer exceeds the energy limit which is calculated as the sum of the minimum conformer energy (= energy of the first conformer) and the specified energy window size. Within the loop a processed input conformer is selected as an output conformer only if the heavy atom RMSD between the current conformer and any of the previously selected output conformers is not below a specified threshold value (default setting: 0.5 Å). The RMSD of an evaluated conformer pair is calculated by performing RMSD minimizing 3D alignments using CDPKit’s implementation of the Kabsch algorithm [66, 67] for all possible homomorphic atom mappings that might result from the presence of topological symmetry. To avoid an excessive memory and processing time consumption in case of highly symmetric molecules, the number of processed mappings has been limited to a hardcoded maximum value of 131,072. If this limit is exceeded, further processing of the current molecule will be suspended and a corresponding error is reported.

When the RMSD value resulting from one of the performed 3D alignments falls below the specified threshold, further processing stops and the evaluated input conformer gets rejected. If, otherwise, all performed pairwise RMSD checks have been passed successfully, the conformer is added to the output conformer ensemble and processing continues with the next input conformer in line.

Although the implemented output conformer selection algorithm is quite simple, the obtained ensembles nevertheless fulfill all major quality criteria, which is proven by the benchmarking results presented in the Results and Discussion section.

Multi-Molecule Output Conformer Ensemble Generation. For compounds that consist of multiple components an additional processing step is required which merges the separately generated molecule conformer ensembles into a single compound output conformer ensemble (see also section Top-level Conformer Generation Workflow). The compound conformer generation method implemented in CONFORT builds composite conformers from sets of selected molecule conformers by lining them up along the X-axis of the coordinate system. For each placed molecule conformer, an axis-aligned bounding box (AABB) is first calculated which specifies the minimum [X_min, Y_min, Z_min] and maximum [X_max, Y_max, Z_max] values of the atom coordinates in each dimension of space. The position [X_min, (Y_min + Y_max) * 0.5, (Z_min + Z_max) * 0.5] then serves as an anchor point for calculating a translation vector to a particular location on the X-axis. The placement X-position starts at 0 and is incremented by X_max - X_min + 4.0 after each conformer has been placed. The constant 4.0 (in Å) represents an additional safety distance and makes sure that no atom Van der Waals sphere clashes occur between successively placed conformers. For the generation of the ith compound conformer, the molecule conformers at index i of the respective ensembles serve as input and its energy represents the total of the molecule conformer energies. If a conformer at index i does not exist, the molecule conformer with the highest index will be selected instead. Compound conformers are generated until all conformers of the largest molecule ensemble(s) have been consumed. A schematic representation of the compound conformer generation workflow can be found in Fig. 13.

The ability of CONFORT to reproduce experimental 3D structures as well as its computational efficiency and robustness has been thoroughly assessed both for typical drug-like molecules and macrocyclic structures. The calculation of various performance metrics and the visual presentation of the obtained results largely follows the established protocol developed by Friedrich et al. [25] with some meaningful additions for the presentation of the macrocycle sampling benchmarks. For reference, several well-known commercial (iCon [7]; two modes), non-open-source (Conformator [11]; two modes) and open-source conformer generators (Balloon [20], RDKit [24]; two modes/parameterizations) were assessed in addition to CONFORT employing the same benchmarking protocol. The corresponding results will be presented and discussed along the ones obtained for CONFORT in the next sections.

Small Molecule Dataset. For assessing the small molecule conformer sampling performance of the herein evaluated generators the Platinum Diverse Dataset [25] was used. The dataset consists of 2,912 high-quality protein-bound ligand conformations extracted from X-ray structural data provided by the PDB and has been designed and compiled especially for the benchmarking of conformer generators [11, 25, 26].

Macrocycle Dataset. To evaluate the conformational sampling of macrocycles, we used a dataset consisting of 208 macrocyclic structures taken from the Cambridge Structural Database (CSD, 130 structures) [68], the PDB (60 structures) and the Biologically Interesting Molecule Reference Dictionary (BIRD) dataset (18 structures) [69]. The dataset (from now on referred to as Prime dataset) was compiled by Sindhikara et al. for benchmarking the conformer generator Prime [49] (later also used for benchmarking other generators [44]) and contains structurally diverse, challenging macrocyclic structures (featuring disulfide bridges, cross-linking amide bonds, polycyclic rings - including cyclodextrins, polyglycines, cycloalkanes and peptidic macrocycles) with high crystallographic quality (low-temperature factors and/or resolutions). More details about the composition of the full dataset can be found in the supporting information of [49].

Dataset Preparation. In order to prevent any evaluated conformer generator from taking advantage of the 3D structural information present in the original dataset files, we converted both datasets into isomeric SMILES format by means of a small Python script using functionality provided by CDPKit. The thus generated SMILES files then served as input for the conformer generators in all performed benchmarking runs.

Benchmarking Software and Hardware. The benchmarking runs for each dataset were performed in a fully automated manner using a set of Bash and Python scripts on a dedicated workstation equipped with an Intel Xeon E5-2630 V3 CPU (8 x 2.4 GHz) and 64 GB of DDR3 RAM running CentOS 8 Linux. User interaction was only required for starting the conformer ensemble generation process and, once finished, for analyzing the produced conformer generator output. In each run, all evaluated conformer generators were executed in sequence without user interaction and indefinite time gaps in between. The source code of the developed benchmarking suite has been licensed under the GNU GPL V2 and is available for download (see Availability of data and materials).

Processing Time Measurement and RMSD Calculation. Measured per-molecule processing times and total program execution times were averaged over five runs to level out runtime differences caused by background system activity and first-time program startup. Furthermore, data was read and written only from/to local hard disk drives. Thus, any influence on I/O-speed and processing times resulting from changes in network latency has been excluded. Depending on availability, per-molecule processing times were either directly extracted from the saved conformer generator log-files in a post-processing step or have been determined indirectly by first time-stamping each line of log-output generated during program execution and then calculating individual molecule processing times from the time-stamp differences of characteristic generator specific log-file lines. In order to eliminate systematic processing time errors due to internal multi-threaded execution (e.g. performed in iCon), single-threaded execution was enforced for all conformer generators by tying started generator processes to a single CPU-core using the Linux command ‘taskset’.

The RMSD of generated molecule conformers from the reference 3D structure in the processed dataset has been calculated after performing an RMSD minimizing 3D alignment using CDPKit’s implementation of the Kabsch algorithm [66, 67]. For 3D alignment and RMSD calculation only heavy atoms were considered and all possible topological symmetry mappings have been taken into account. The lowest RMSD that could be obtained in this way among all conformers of an evaluated ensemble was then reported as ‘best RMSD’. Since we encountered slight differences between the conformer ensembles generated by iCon in the five performed runs, calculated best ensemble RMSDs and ensemble sizes were averaged over the five runs for all assessed conformer generators.

Assessed Generators. Aside from CONFORT, we benchmarked four additional well-known conformer ensemble generators - Conformator, iCon, Balloon, and RDKit - and compared their conformer sampling performance with the results obtained for CONFORT.

All assessed generators support different conformer sampling modes, methods, or parameterizations that have an impact on the size and quality of the generated conformer ensembles as well as on sampling speed. Hence, we also considered different modes of operation in the performed benchmarks: Conformator and iCon were both run with two sampling quality/speed settings (best and fast) in the small molecule and macrocycle dataset benchmarks. For Balloon, we evaluated two sampling algorithms and the RDKit conformer generator was assessed using two different built-in parameter set versions in all performed benchmarks. In the Platinum Diverse benchmarks CONFORT was run in systematic sampling mode with a) default settings for all other sampling-related parameters ('CONFORT Systematic Default') and b) enabled exhaustive torsion sampling and defaults for the remaining settings ('CONFORT Systematic Best'). In the Prime dataset benchmarks stochastic sampling and default settings for all other sampling parameters were used. A comprehensive overview of all evaluated conformer generators, the applied settings and other relevant information can be found in Table 1.

Notes on Omega

The commercial conformer ensemble generator Omega [8] by OpenEye Scientific Software - which has shown exceptional performance in several previously published benchmarks [7, 26] - would have been a good additional candidate for inclusion in our benchmarking studies. Unfortunately, our academic software license does not allow for publishing Omega benchmarking results without OpeEyes’s explicit permission [70]. To enable at least a partial comparison of CONFORT with Omega in the Platinum Diverse benchmarks, we provided the benchmarking results (see Table 2) recently published by Friedrich et al. [11], instead. In our studies we used the same dataset and RMSD calculation method as Friedrich et al. and the results we obtained internally for Omega differed only insignificantly (e.g. deviations between the calculated mean accuracies were 0.012 Å for max. ensemble size = 50 and 0.009 Å for max. ensemble size = 250, respectively).

Table 1

Conformer Generators and associated Settings employed in the performed Benchmarking Studies
Generator	Settings	Clustering^a	Forcefield	Prog. Version
Balloon	DG^b	RMSD	MMFF94	1.7.0
Balloon	GA^c	RMSD	MMFF94	1.7.0
CONFORT	Systematic Best^d	RMSD	MMFF94s_RTOR_NO_ESTAT^m	CDPKit source 2021/02/27
CONFORT	Systematic Default^e	RMSD	MMFF94s_RTOR_NO_ESTAT^m
CONFORT	Stochastic^f	RMSD	MMFF94s_RTOR^l
Conformator	Best^g	RMSD	- / MCOS^j	1.1.0
Conformator	Fast^g	RMSD	- / MCOS^j	1.1.0
iCon	Best^g	RMSD	MMFF94s^k	4.4.7
iCon	Fast^g	RMSD	MMFF94s^k	4.4.7
RDKit	KDG^h	-	UFFⁿ	RDKit source 2020/09/2
RDKit	ETKDGv3ⁱ	-	UFFⁿ	RDKit source 2020/09/2

^aConformer similarity measure used in the compilation of diverse output ensembles. ^bGenetic algorithm (GA) disabled (--noGA flag), conformers generated via DG only. ^cGA enabled (default). ^dEnforced systematic conformer sampling mode with larger energy window (20 kcal/mol), lower RMSD-threshold (0.3 Å) and enabled torsion angle tolerance range sampling. ^eEnforced systematic conformer sampling mode using default settings for energy window (15 kcal/mol) and RMSD-threshold (0.5 Å). ^fEnforced stochastic conformer sampling using default parameter settings. ^gPredefined conformer generation modes provided by the developers. ^hConformer generation using DG embedding parameters for the KDG method [71]. ⁱConformer generation using DG embedding parameters for the ETKDGv3 method including improvements for small rings [24, 71]. ^jMacrocycle Optimization Score (MCOS), only used for macrocycle optimization [11]. ^kMMFF94 parameter set variant which enforces planarity of delocalized trigonal Nitrogen atoms [72]. ^lMMFF94s using a refined torsion interaction parameter set [73]. ^mMMFF94s_RTOR excluding electrostatic interaction terms. ⁿUniversal Force Field (UFF) [74].

Small Molecule Conformer Sampling Performance.

As shown in Table 2, our new conformer generator CONFORT is able to retrieve the bio-active conformations for the input molecules more accurately on average than all of our competitors. This is achieved while beating them regarding runtime, most methods even by more than an order of magnitude. Conformator is able to find better conformations when only regarding the best conformations from their produced ensembles, but as the baseline conformation would not be available in practical applications these better conformations would not be reliably retrievable. Conformator also produces smaller ensemble sizes when run with its 'Fast' presets, which may be generally desirable, but as the user of a given method will have to continue working with the generated conformations the on average worse results are a significant downside. The performance benefits of CONFORT over Conformator are also large enough to have real-world impact with a speedup of 21.8x on average, when comparing CONFORT’s 'Best' to Conformator's 'Fast' variant, which is the worst-case comparison for our method. The runtimes of our benchmarks also show that no other openly available method can compete with our new implementation regarding time spent, with the best competitor, the commercially available iCon conformer generator with 'Fast' presets, still requiring 3.2x the processing time of our method.

Table 2

Conformer Generator Performance Comparison for the Platinum Diverse Dataset
Generator	Maximum ensemble size 50				Maximum ensemble size 250
Generator	Mean	Median	Min.	Max.	Mean	Median	Min.	Max.
RMSD (Å)
Balloon DG	0.967	0.788	0.027	4.564	0.806	0.613	0.025	3.996
Balloon GA	0.989	0.858	0.027	4.606	0.833	0.711	0.027	4.336
CONFORT Systematic Best	0.669	0.486	0.030	3.919	0.559	0.416	0.030	3.675
CONFORT Systematic Default	0.683	0.548	0.036	3.140	0.616	0.524	0.036	2.793
Conformator Best	0.680	0.589	0.020	3.255	0.568	0.474	0.020	2.931
Conformator Fast	0.747	0.654	0.020	3.511	0.637	0.538	0.020	3.329
iCon Best	0.701	0.588	0.030	3.662	0.639	0.569	0.030	3.662
iCon Fast	0.719	0.535	0.030	3.921	0.598	0.474	0.030	3.662
RDKit ETKDGv3	0.711	0.590	0.038	4.386	0.604	0.518	0.038	3.367
RDKit KDG	0.738	0.596	0.038	4.156	0.621	0.515	0.038	3.638
Conformator Best^a	0.68	0.58	-	-	0.57	0.47	-	-
Conformator Fast^a	0.75	0.66	-	-	0.64	0.53	-	-
Omega^a	0.67	0.51	-	-	0.57	0.46	-	-
Ensemble Size
Balloon DG	49.09	50	-	-	241.55	250	-	-
Balloon GA	38.19	43	-	-	180.33	210	-	-
CONFORT Systematic Best	39.05	50	-	-	149.26	212	-	-
CONFORT Systematic Default	28.80	29	-	-	82.07	29	-	-
Conformator Best	38.60	42	-	-	167.37	189	-	-
Conformator Fast	20.53	19	-	-	71.46	55	-	-
iCon Best	28.96	32	-	-	71.99	32	-	-
iCon Fast	35.07	50	-	-	122.29	89	-	-
RDKit ETKDGv3	50.00	50	-	-	249.99	250	-	-
RDKit KDG	50.00	50	-	-	250.00	250	-	-
Conformator Best^a	38	42	-	-	166	187	-	-
Conformator Fast^a	20	19	-	-	70	54	-	-
Omega^a	34	50	-	-	118	74	-	-
Processing Time (s)
Balloon DG	16.939	14.432	0.711	89.403	81.763	70.543	1.285	461.502
Balloon GA	12.596	10.982	0.042	49.986	66.587	58.269	0.472	262.118
CONFORT Systematic Best	0.164	0.021	0.001	14.295	0.334	0.07	0.001	16.431
CONFORT Systematic Default	0.102	0.012	0.001	19.769	0.224	0.014	0.001	19.942
Conformator Best	4.028	0.344	0.009	268.220	6.579	2.160	0.009	353.579
Conformator Fast	3.581	0.220	0.008	262.060	4.240	0.516	0.008	293.616
iCon Best	0.652	0.199	0.002	64.742	0.755	0.275	0.002	65.130
iCon Fast	0.527	0.174	0.002	30.975	0.652	0.269	0.002	31.132
RDKit ETKDGv3	5.539	2.962	0.102	932.943	27.377	14.903	0.541	3657.464
RDKit KDG	4.075	2.910	0.077	26.434	20.590	14.664	0.423	132.641

The best values for RMSD, ensemble size, and molecule processing time obtained by any assessed generator are written in bold letters. ^aThe values listed for Conformator were taken from [11] and included for reference to demonstrate the correctness of our benchmarking code by the close reproduction of the previously published results for this generator. As a side effect, this allows us to also include the in ref. [11] published results for Omega which we could not assess in our study due to licensing reasons.

It is also important to note that most other conformer generators were not able to handle all of the presented molecules and failed to produce any results for some molecules. The overall program execution times and number of molecules for which processing failed are listed in Table 3. Only iCon was also able to compute conformers for all molecules regardless of ensemble size, while again, all competitors were significantly slower than our method. The computation of the conformers for the whole dataset took the more thorough CONFORT 'Best' variant under 9 minutes while the commercially available iCon took 26 minutes with the 'Fast' variant. Freely available algorithms required at least 2 hours and 50 minutes (Conformator Fast) with many of them taking significantly longer than even that method. This comparison nicely visualizes the real impact of these runtimes on a researcher whose work depends on the generation of conformers.

Table 3

Total Program Execution Times and Molecule Processing Failures recorded for the Platinum Diverse Dataset
Generator	Maximum ensemble size 50		Maximum ensemble size 250
Generator	Total execution time (hh:mm:ss)^a	Number of failed molecules	Total execution time (hh:mm:ss)^a	Number of failed molecules
Balloon DG	13:26:45	0	64:55:43	1
Balloon GA	10:18:53	3	53:12:55	3
CONFORT Systematic Best	00:08:55	0	00:20:23	0
CONFORT Systematic Default	00:05:43	0	00:13:14	0
Conformator Best	03:11:41	4	05:13:06	4
Conformator Fast	02:50:25	4	03:21:48	4
iCon Best	00:31:59	0	00:37:46	0
iCon Fast	00:26:07	0	00:33:44	0
RDKit ETKDGv3	04:36:43	1	22:48:01	1
RDKit KDG	03:34:20	1	18:03:37	1

^aWall time elapsed between program start and termination.

As can be seen in Table 4, the largest improvement of CONFORT regarding accuracy can be identified in the number of molecules for which conformer generation produces especially similar results to the true bioactive conformation (RMSD < 0.5 Å). CONFORT Systematic Best is thereby able to produce more conformers close to this ground truth, while still performing similar to the competitors for larger RMSD accuracy ranges.

Table 4

Percentile of Platinum Diverse Dataset Structures successfully reproduced below specified RMSD thresholds
Generator	Maximum ensemble size 50				Maximum ensemble size 250
	RMSD threshold (Å)
	0.5	1.0	1.5	2.0	0.5	1.0	1.5	2.0
Balloon DG	32.4	61.1	78.5	89.6	41.8	69.7	84.5	93.6
Balloon GA	26.6	58.2	79.2	91.1	34.2	67.6	86.6	94.7
CONFORT Systematic Best	51.4	78.8	90.3	96.4	59.5	86.4	95.0	98.3
CONFORT Systematic Default	44.5	80.0	92.2	97.2	47.1	85.2	95.6	99.0
Conformator Best	41.8	78.0	93.2	98.4	52.9	85.9	96.5	99.0
Conformator Fast	36.8	73.8	91.8	98.0	45.9	83.2	95.5	98.8
iCon Best	36.5	80.8	93.1	97.9	38.0	86.6	96.2	99.0
iCon Fast	46.1	76.5	89.5	96.5	54.3	85.3	95.1	98.4
RDKit ETKDGv3	40.9	75.7	92.0	97.8	47.6	84.9	96.5	99.4
RDKit KDG	40.7	75.0	90.3	96.6	48.5	83.3	95.3	98.5
Conformator Best^a	42	78	94	98	53	86	97	99
Conformator Fast^a	37	73	91	98	46	83	95	99
Omega^a	49	80	92	97	56	87	96	99

The best value for each RMSD threshold obtained by any assessed generator is written in bold letters. ^aThe values listed for Conformator were taken from [11] and included for reference to demonstrate the correctness of our benchmarking code by the close reproduction of the previously published results for this generator. As a side effect, this allows us to also include the in [11] published results for Omega which we could not assess in our study due to licensing reasons.

In practice, the ensemble size produced during conformer generation is also an important factor. If a large number of conformers is produced, it is more likely that one of them is close to the bioactive conformation, as a fundamental goal of conformer generation is to produce a sufficiently diverse set of results. Still, this increase in the number of conformations comes with the downside of having to search through this larger solution space for the conformer(s) desirable for a given use-case and results in a larger uncertainty of whether or not the correct one was chosen. For this reason, we visualized in Fig. 14 how the accuracy, ensemble size and processing times compare between the different conformer generation methods. We limited the ensemble sizes to 50 and 250 for the different measurements, and it can be seen that CONFORT lies approximately in the middle regarding how much of the ensemble size limit it uses, with the 'Best' variant producing smaller ensembles than the 'Default' variant. Generally, all methods would be allowed to output maximally large ensemble sizes (exactly the maximum allowed number for each molecule), but, with the exception of RDKit, they all discard too similar solutions. Even though CONFORT produces similarly sized ensembles, its runtime is unparalleled by the other methods and the accuracy is nearly always better. This shows that the aforementioned benefits in accuracy and especially runtime are not an artifact of larger ensembles.

The size of ensembles usually relates to the degrees of freedom that can be sampled during conformer generation. Rotatable bonds are thereby the largest contributing factor as the 3D position of molecule substructures connected to each of these bonds can vary, leading to a combinatorial problem when assembling the final conformer. Figure 15 shows this property nicely, as most conformer generators produce larger ensembles with an increasing number of rotatable bonds. Exceptions to this are the variants of Balloon, which seem to be agnostic to this number, and Conformator’s 'Best' variant, which does not use the full amount of allowed conformers, even for complex molecules.

The molecules are also likely to contain more atoms with an increasing number of rotatable bonds. This introduces more possibilities of errors in the relative position of the atoms compared to the crystal structure, leading to a higher expected RMSD than for small molecules when using the same conformer generation method and, therefore, lower accuracy. With increasing molecule size, this inevitably leads to a decreasing number of molecules retrieved with any fixed RMSD precision. Figure 16 visualizes this relationship for the different benchmarked methods and shows how the different algorithms are affected by this phenomenon. While all methods retrieve less molecules within a fixed threshold with an increasing number of rotatable bonds, our CONFORT variants are usually able to handle this increase better than the competitors. Not only is the retrieval of molecules with few rotatable bonds more accurate, larger ones with more rotatable bonds also experience less of an accuracy loss compared to the other benchmarked methods. Other methods we could identify to handle larger numbers of rotatable bonds well include 'iCon Best' (1 Å), 'Conformator Best' (1.5-2 Å) and 'RDKit ETKDGv3' (2.5 Å) which, as can be seen in Table 4, show good reproduction rates with increasing RMSD thresholds. For the percentage of molecules retrieved within the accuracy threshold of RMSD < = 0.5 Å 'CONFORT Systematic Best' outperformed all other methods.

Macrocycle Conformer Sampling Performance.

Macrocyclic compounds usually pose an even larger challenge for conformer generators due to the large number of (partially) rotatable bonds contributed by flexible side chains and the macrocyclic ring system itself. For this reason, we separately compare the performance of the different conformer generation methods to our CONFORT Stochastic variant, which has been specifically designed for macrocycles and large/complex compounds.

Table 5 shows the performed measurements. CONFORT Stochastic achieves on average higher accuracy than any other benchmarked method. When only measuring the accuracy for atoms constituting the macrocyclic ring, it can be observed that the relative improvement over other methods is even larger. Still, this improved conformer quality comes at the cost of longer runtimes compared to other methods. As the quality of the results is significantly better and the relevance of macrocyclic compounds for drug development is continuously rising, our new method will provide drastic improvements when working with such molecules. The only notable competitor in this domain is the RDKit variant ETKDGv3. This method is able to outperform CONFORT Stochastic regarding runtime while producing conformers that closely resemble experimental 3D structures. Still, the conformers produced by RDKit are more difficult to use in practice as the method always generates conformers up to the maximum allowed ensemble size. This larger ensemble size, compared to our method’s 255.66 on average, may also be part of the reason why this competitor produces more sensible results than the other tested methods.

Table 5

Conformer Generator Performance Comparison for the Prime Macrocycle Dataset
Generator	Mean	Median	Min.	Max.
Overall RMSD (Å)^a
Balloon DG	1.782	1.434	0.043	7.637
Balloon GA	1.851	1.592	0.032	5.829
CONFORT Stochastic	1.153	0.942	0.042	5.249
Conformator Best	1.563	1.187	0.180	6.478
Conformator Fast	1.615	1.252	0.221	6.468
iCon Best	1.472	1.115	0.052	6.372
iCon Fast	1.479	1.173	0.052	5.928
RDKit ETKDGv3	1.247	0.978	0.052	4.842
RDKit KDG	1.439	1.305	0.049	4.543
Macrocycle RMSD (Å)^b
Balloon DG	0.938	0.736	0.029	3.855
Balloon GA	1.140	0.917	0.025	4.528
CONFORT Stochastic	0.586	0.473	0.028	3.155
Conformator Best	0.974	0.808	0.126	5.193
Conformator Fast	1.026	0.857	0.126	3.841
iCon Best	0.913	0.714	0.035	4.753
iCon Fast	0.950	0.762	0.035	5.240
RDKit ETKDGv3	0.672	0.584	0.027	3.141
RDKit KDG	0.833	0.747	0.027	3.379
Ensemble Size
Balloon DG	321.20	500	-	-
Balloon GA	145.28	64	-	-
CONFORT Stochastic	255.66	242	-	-
Conformator Best	236.02	233	-	-
Conformator Fast	99.53	59	-	-
iCon Best	69.22	43.8	-	-
iCon Fast	130.72	69	-	-
RDKit ETKDGv3	500.00	500	-	-
RDKit KDG	500.00	500	-	-
Processing Time (s)
Balloon DG	262.62	174.45	9.87	1453.09
Balloon GA	494.88	343.06	40.58	2602.07
CONFORT Stochastic	977.88	432.91	12.10	10609.27
Conformator Best	471.40	169.61	7.19	43358.54
Conformator Fast	186.90	111.53	1.54	5615.60
iCon Best	28.53	30.17	0.72	492.15
iCon Fast	26.05	24.91	0.68	574.71
RDKit ETKDGv3	677.82	159.89	7.25	34714.72
RDKit KDG	312.48	148.09	7.37	2962.38

The best values for RMSD, ensemble size, and molecule processing time obtained by any assessed generator are written in bold letters. ^aRMSD taking into account all heavy atoms of the molecule. ^bRMSD taking into account only the heavy atoms constituting the macrocyclic ring system.

The difficulty of this task can be seen once again in Table 6 where the number of molecules for which the conformer generation failed is shown. Besides CONFORT Stochastic, only RDKit was able to complete the computations without failure, while the other methods could not produce any results for at least one molecule. Notably, Conformator even completely crashed with a segmentation fault when presented 6 of the molecules in the dataset (VEVHAF, POTTEY, PRD_000785, 1YND_SFA, 1NMK_SFM, 3I6O_GR6).

Table 6

Total Program Execution Times and Molecule Processing Failures recorded for the Prime Macrocycle Dataset
Generator	Total execution time (hh:mm:ss)^a	Number of failed molecules
Balloon DG	14:31:35	9
Balloon GA	27:20:21	11
CONFORT Stochastic	56:30:54	0
Conformator Best	26:29:35	7
Conformator Fast	10:46:42	9
iCon Best	01:48:45	1
iCon Fast	01:40:27	1
RDKit ETKDGv3	39:09:47	0
RDKit KDG	18:03:17	0

^aWall time elapsed between program start and termination.

Just as for the previous benchmark, we computed the percentiles of molecules for which conformers within a certain RMSD threshold have been found by the individual methods. Table 7 shows the resulting values, with CONFORT Stochastic finding more high-quality conformers for any of the presented thresholds. The next best percentile values were measured for RDKit ETKDGv3.

Table 7

Percentile of Prime Macrocycle Dataset Structures successfully reproduced below specified RMSD thresholds
Generator	RMSD threshold (Å)
Generator	0.5	1.0	1.5	2.0	2.5	3.0	3.5	4.0
Balloon DG	16.3	35.6	49.0	61.1	70.7	77.9	82.7	87.5
Balloon GA	5.8	23.6	43.8	59.6	71.6	79.3	85.6	90.4
CONFORT Stochastic	25.5	52.4	75.0	81.7	92.3	95.7	98.6	98.6
Conformator Best	11.5	39.9	60.6	70.7	80.3	86.5	89.4	92.3
Conformator Fast	7.2	36.1	55.3	69.2	78.4	84.6	88.9	90.9
iCon Best	19.2	45.7	60.6	72.6	80.8	90.4	93.7	95.7
iCon Fast	25.0	44.7	60.6	70.7	80.8	88.0	93.8	95.7
RDKit ETKDGv3	22.6	51.4	67.8	79.8	89.4	94.7	97.6	98.6
RDKit KDG	12.5	38.9	61.5	74.0	87.0	94.2	97.6	98.6

The best value for each RMSD threshold obtained by any of the assessed generators is written in bold letters.

The accuracy benefits of CONFORT can, once again, be seen in Fig. 17. The produced ensembles are of average size compared to our competitors, while the method needs more time to produce the results than others. This increase in computation time should be more than made up for by the increase in quality over the other algorithms.

The quality of CONFORT Stochastic’s results is visualized cumulatively over the number of rotatable bonds and size of the macrocycle in Fig. 18. There, the plots show that for nearly all possible thresholds for those two properties, our method drastically outperforms any other algorithm regarding the accuracy of generated conformers.

We introduced and described the implementation of the novel conformer ensemble generator CONFORT which has been designed and developed to deliver top-level performance for all types of organic molecules in the drug-like chemical space. CONFORT is fully open-source and available as part of the Chemical Data Processing Toolkit in the form of a versatile command-line tool that accepts a wide panel of input data formats and a set of classes and functions provided by CDPKit’s C++/Python-API. CONFORT’s implementation is based on established concepts and algorithms which have proven their power with respect to conformer ensemble generation in other well-known software tools for this purpose. For the computationally efficient and accurate conformational sampling of drug-like small molecules a knowledge-based systematic approach is employed, which makes extensive use of pre-generated fragment and torsion angle libraries that were derived from experimental 3D structures. For the sampling of macrocycle conformers, CONFORT implements a purely stochastic approach based on a combination of DG and 3D structure refinement by iterative MMFF94 energy minimization. Irrespective of the sampling approach, CONFORT does not require any input atom 3D coordinates and is able to generate conformer ensembles solely from molecular graph connection table information. The conformer sampling approach best suited for a processed input molecule is either chosen automatically or can be specified by the user in advance for all molecules to process. Furthermore, CONFORT correctly handles compounds consisting of multiple molecules like salts and mixtures by a separate generation and later combination of individual component conformer ensembles.

CONFORT’s capability to reproduce experimental 3D structures as well as its computational efficiency and robustness has been assessed for typical drug-like organic molecules using the Platinum Diverse Dataset and for macrocyclic systems using a dataset of 208 molecules compiled by Sindhikara et al. [49]. The calculation of performance metrics and the visual presentation of the obtained results largely followed the established protocol developed by Friedrich et al. [25] with extensions for the presentation of macrocycle sampling results. For comparison, several well-known commercial (iCon; two modes), non-open-source (Conformator; two modes) and open-source conformer generators (Balloon, RDKit; two modes/parameterizations) were benchmarked in addition to CONFORT using the same benchmarking protocol. For the Platinum Diverse Dataset benchmarks, two runs with maximum output ensemble sizes (MES) of 50 and 250 representative conformers were performed, and for the testing of the macrocycle sampling performance, a MES of 500 was chosen. In the Platinum Diverse Dataset benchmarks, CONFORT achieved a median accuracy in the reproduction of bioactive conformations of 0.486 Å (MES = 50) and 0.416 Å (MES = 250), respectively, which were the best values among all tested generators (see Table 2). At the same time, CONFORT had the lowest mean processing time per molecule (12 ms for MES = 50 and 14 ms for MES = 250, respectively) and the lowest mean output ensemble size of 29 conformers in case of a MES of 250 and the second lowest for a MES of 50, respectively. The mean accuracy of 1.153 Å achieved by CONFORT for the macrocycle dataset was again the best among all benchmarked conformer ensemble generators (see Table 5). However, CONFORT also showed the highest mean per molecule processing time (432.91 s) and produced output ensembles with a mean size of 242 conformers (see Table 5). These results indicate that there is still room for improvements when it comes to macrocycle conformer sampling and CONFORT will be best suited for low throughput applications which favor accuracy over speed.

In summary, the presented open-source conformer ensemble generator CONFORT has proven to be able to deliver performance on the highest levels in all aspects of relevance. To our knowledge, CONFORT is the first open-source conformer generator which can compete with market-leading commercial software in this field. It will provide the scientific community with a truly free alternative of high quality that facilitates open research due to the absence of any restrictions on the use of the generated data. As part of the CDPKit project, CONFORT will be actively maintained and undergo further development. Planned future improvements of CONFORT include speed optimizations regarding stochastic sampling, the possibility to use other general-purpose force fields like the Open Force Field [75] and functionality enabling a restriction of conformer sampling to only particular user-specified parts of the processed molecule.

Availability and Requirements

Project name: CDPKit - Chemical Data Processing Toolkit

Project home page: CDPKit source code repository at https://github.com/aglanger/CDPKit

Operating systems: Linux, Windows, Mac OS X

Programming language: C++11, Python V3

Other requirements: CMake V3.17 or higher, Boost C++ libraries V1.52 or higher

License: GNU LGPL V2.1

Availability of data and materials

The source code of the developed benchmarking suite (see Results and Discussion), both datasets in SDF and SMILES format, the conformer ensembles generated by CONFORT and the benchmarking result files for all generators (CSV-files and figures) can be downloaded from https://phaidra.univie.ac.at/o:1433151. The CDPKit source code in the version at the time of benchmarking is available at https://phaidra.univie.ac.at/o:1246931. An installer for the corresponding CDPKit binaries (compiled for RHEL 8.x based systems) can be downloaded from https://phaidra.univie.ac.at/o:1246933.

Competing interests

The authors declare that they have no competing interests of any kind.

Funding

The authors gratefully acknowledge funding by the University of Vienna Research Platform NeGeMac (Next Generation Macrocycles to Address Challenging Protein Interfaces).

Authors' contributions

TS implemented CONFORT and parts of the benchmarking code, is the main developer of CDPKit, carried out and supervised the conformer generator benchmarks, analyzed the results and wrote the manuscript. CP is the main developer of the benchmarking code, analyzed benchmarking results and contributed to writing of the manuscript. SK, OW and TL contributed to the writing and proof reading of the manuscript.

Supplementary Information

Additional file 1. Document providing a table listing all options supported by CONFORT’s command-line interface.

Perola E, Charifson PS (2004) Conformational analysis of drug-like molecules bound to proteins: an extensive study of ligand reorganization upon binding. J Med Chem 47:2499–2510. https://doi.org/10.1021/jm030563w
Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055. https://doi.org/10.1016/s1359-6446(02)02483-2
Venkatraman V, Pérez-Nueno VI, Mavridis L, Ritchie DW (2010) Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J Chem Inf Model 50:2079–2093. https://doi.org/10.1021/ci100263p
Hu G, Kuang G, Xiao W, et al (2012) Performance evaluation of 2D fingerprint and 3D shape similarity methods in virtual screening. J Chem Inf Model 52:1103–1113. https://doi.org/10.1021/ci300030u
Langer T, Wolber G (2004) Pharmacophore definition and 3D searches. Drug Discov Today Technol 1:203–207. https://doi.org/10.1016/j.ddtec.2004.11.015
Seidel T, Ibis G, Bendix F, et al (2010) Strategies for 3D pharmacophore-based virtual screening. Drug Discov Today Technol 7:e203–70. https://doi.org/10.1016/j.ddtec.2010.11.004
Poli G, Seidel T, Langer T (2018) Conformational Sampling of Small Molecules With iCon: Performance Assessment in Comparison With OMEGA. Front Chem 6:229. https://doi.org/10.3389/fchem.2018.00229
Hawkins PCD, Skillman AG, Warren GL, et al (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584. https://doi.org/10.1021/ci100031x
Watts KS, Dalal P, Murphy RB, et al (2010) ConfGen: a conformational search method for efficient generation of bioactive conformers. J Chem Inf Model 50:534–546. https://doi.org/10.1021/ci100015j
Li J, Ehlers T, Sutter J, et al (2007) CAESAR: a new conformer generation algorithm based on recursive buildup and local rotational symmetry consideration. J Chem Inf Model 47:1923–1932. https://doi.org/10.1021/ci700136x
Friedrich N-O, Flachsenberg F, Meyder A, et al (2019) Conformator: A Novel Method for the Generation of Conformer Ensembles. J Chem Inf Model 59:731–742. https://doi.org/10.1021/acs.jcim.8b00704
Sadowski P, Baldi P (2013) Small-molecule 3D structure prediction using open crystallography data. J Chem Inf Model 53:3127–3130. https://doi.org/10.1021/ci4005282
Labute P (2010) LowModeMD—Implicit Low-Mode Velocity Filtering Applied to Conformational Search of Macrocycles and Protein Loops. Journal of Chemical Information and Modeling 50:792–800
Molecular operating environment (MOE). https://www.chemcomp.com/Products.htm. Accessed 30 Mar 2022
Leite TB, Gomes D, Miteva MA, et al (2007) Frog: a FRee Online druG 3D conformation generator. Nucleic Acids Res 35:W568–72. https://doi.org/10.1093/nar/gkm289
Miteva MA, Guyon F, Tufféry P (2010) Frog2: Efficient 3D conformation ensemble generator for small compounds. Nucleic Acids Res 38:W622–7. https://doi.org/10.1093/nar/gkq325
O’Boyle NM, Vandermeersch T, Flynn CJ, et al (2011) Confab - Systematic generation of diverse low-energy conformers. J Cheminform 3:8. https://doi.org/10.1186/1758-2946-3-8
Kothiwale S, Mendenhall JL, Meiler J (2015) BCL::Conf: small molecule conformational sampling using a knowledge based rotamer library. J Cheminform 7:47. https://doi.org/10.1186/s13321-015-0095-1
RDKit Homepage. http://www.rdkit.org. Accessed 24 Mar 2022
Vainio MJ, Johnson MS (2007) Generating conformer ensembles using a multiobjective genetic algorithm. J Chem Inf Model 47:2462–2474. https://doi.org/10.1021/ci6005646
Sauton N, Lagorce D, Villoutreix BO, Miteva MA (2008) MS-DOCK: accurate multiple conformation generator and rigid docking protocol for multi-step virtual ligand screening. BMC Bioinformatics 9:184. https://doi.org/10.1186/1471-2105-9-184
Blaney JM, Dixon JS (2007) Distance geometry in molecular modeling. In: Reviews in Computational Chemistry. John Wiley & Sons, Inc., Hoboken, NJ, USA, pp 299–335
Crippen GM, Havel TF (1988) Distance geometry and molecular conformation
Wang S, Witek J, Landrum GA, Riniker S (2020) Improving Conformer Generation for Small Rings and Macrocycles Based on Distance Geometry and Experimental Torsional-Angle Preferences. J Chem Inf Model 60:2044–2058. https://doi.org/10.1021/acs.jcim.0c00025
Friedrich N-O, Meyder A, de Bruyn Kops C, et al (2017) High-Quality Dataset of Protein-Bound Ligand Conformations and Its Application to Benchmarking Conformer Ensemble Generators. Journal of Chemical Information and Modeling 57:529–539
Friedrich N-O, de Bruyn Kops C, Flachsenberg F, et al (2017) Benchmarking commercial conformer ensemble generators. J Chem Inf Model 57:2719–2728. https://doi.org/10.1021/acs.jcim.7b00505
Berman HM, Westbrook J, Feng Z, et al (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. https://doi.org/10.1093/nar/28.1.235
Yudin AK (2015) Macrocycles: lessons from the distant past, recent developments, and future directions. Chem Sci 6:30–49. https://doi.org/10.1039/c4sc03089c
Marsault E, Peterson ML (2011) Macrocycles are great cycles: applications, opportunities, and challenges of synthetic macrocycles in drug discovery. J Med Chem 54:1961–2004. https://doi.org/10.1021/jm1012374
Mallinson J, Collins I (2012) Macrocycles in new drug discovery. Future Med Chem 4:1409–1438. https://doi.org/10.4155/fmc.12.93
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings1PII of original article: S0169-409X(96)00423-1. The article was originally published in Advanced Drug Delivery Reviews 23 (1997) 3–25.1. Adv Drug Deliv Rev 46:3–26. https://doi.org/10.1016/S0169-409X(00)00129-0
Bell IM, Gallicchio SN, Abrams M, et al (2002) 3-Aminopyrrolidinone Farnesyltransferase Inhibitors: Design of Macrocyclic Compounds with Improved Pharmacokinetics and Excellent Cell Potency. Journal of Medicinal Chemistry 45:2388–2409
Dougherty PG, Qian Z, Pei D (2017) Macrocycles as protein-protein interaction inhibitors. Biochem J 474:1109–1125. https://doi.org/10.1042/BCJ20160619
Giordanetto F, Kihlberg J (2014) Macrocyclic drugs and clinical candidates: what can medicinal chemists learn from their properties? J Med Chem 57:278–295. https://doi.org/10.1021/jm400887j
Rezai T, Bock JE, Zhou MV, et al (2006) Conformational Flexibility, Internal Hydrogen Bonding, and Passive Membrane Permeability: Successful in Silico Prediction of the Relative Permeabilities of Cyclic Peptides. Journal of the American Chemical Society 128:14073–14080
Dömling A (2008) Small molecular weight protein–protein interaction antagonists—an insurmountable challenge? Current Opinion in Chemical Biology 12:281–291
Doak BC, Over B, Giordanetto F, Kihlberg J (2014) Oral Druggable Space beyond the Rule of 5: Insights from Drugs and Clinical Candidates. Chemistry & Biology 21:1115–1142
Kwitkowski VE, Prowell TM, Ibrahim A, et al (2010) FDA approval summary: temsirolimus as treatment for advanced renal cell carcinoma. Oncologist 15:428–435. https://doi.org/10.1634/theoncologist.2009-0178
Raymond E, Alexandre J, Faivre S, et al (2004) Safety and pharmacokinetics of escalated doses of weekly intravenous infusion of CCI-779, a novel mTOR inhibitor, in patients with cancer. J Clin Oncol 22:2336–2347
Goodin S (2008) Novel cytotoxic agents: epothilones. Am J Health Syst Pharm 65:S10–5. https://doi.org/10.2146/ajhp080089
Goodin S (2008) Ixabepilone: a novel microtubule-stabilizing agent for the treatment of metastatic breast cancer. Am J Health Syst Pharm 65:2017–2026. https://doi.org/10.2146/ajhp070628
Marsault E, Peterson ML (2017) Practical Medicinal Chemistry with Macrocycles: Design, Synthesis, and Case Studies. John Wiley & Sons
Hawkins PCD (2017) Conformation Generation: The State of the Art. J Chem Inf Model 57:1747–1756. https://doi.org/10.1021/acs.jcim.7b00221
Reyes Romero A, Ruiz-Moreno AJ, Groves MR, et al (2020) Benchmark of Generic Shapes for Macrocycles. J Chem Inf Model 60:6298–6313. https://doi.org/10.1021/acs.jcim.0c01038
Olanders G, Alogheli H, Brandt P, Karlén A (2020) Conformational analysis of macrocycles: comparing general and specialized methods. Journal of Computer-Aided Molecular Design 34:231–252
Poongavanam V, Danelius E, Peintner S, et al (2018) Conformational Sampling of Macrocyclic Drugs in Different Environments: Can We Find the Relevant Conformations? ACS Omega 3:11742–11757. https://doi.org/10.1021/acsomega.8b01379
Omega Theory Manual — Macrocycle conformations. https://docs.eyesopen.com/applications/omega/theory/macrocycle_theory.html. Accessed 8 Apr 2022
Watts KS, Shawn Watts K, Dalal P, et al (2014) Macrocycle Conformational Sampling with MacroModel. Journal of Chemical Information and Modeling 54:2680–2696
Sindhikara D, Spronk SA, Day T, et al (2017) Improving Accuracy, Diversity, and Speed with Prime Macrocycle Conformational Sampling. J Chem Inf Model 57:1881–1894. https://doi.org/10.1021/acs.jcim.7b00052
Morgan HL (1965) The generation of a unique machine description for chemical structures-A technique developed at chemical abstracts service. J Chem Doc 5:107–113. https://doi.org/10.1021/c160017a018
McKay B (1981) Practical graph isomorphism, Numerical mathematics and computing, Proc. 10th Manitoba Conf., Winnipeg/Manitoba 1980
Halgren TA (1996) Merck molecular force field. II. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions. J Comput Chem 17:520–552. https://doi.org/10.1002/(sici)1096-987x(199604)17:5/6<520::aid-jcc2>3.0.co;2-w
Floyd RW (1962) Algorithm 97: Shortest path. Commun ACM 5:345. https://doi.org/10.1145/367766.368168
Kearsley S (1999) MMFF94 Validation Suite. http://www.ccl.net/cca/data/MMFF94/ver.98.05.22/index.html. Accessed 4 Nov 2021
Agrafiotis DK (2003) Stochastic proximity embedding. J Comput Chem 24:1215–1221. https://doi.org/10.1002/jcc.10234
Broyden CG (1970) The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA J Appl Math 6:76–90. https://doi.org/10.1093/imamat/6.1.76
Fletcher R (1970) A new approach to variable metric algorithms. Comput J 13:317–322. https://doi.org/10.1093/comjnl/13.3.317
Goldfarb D (1970) A family of variable-metric methods derived by variational means. Math Comput 24:23–26. https://doi.org/10.1090/s0025-5718-1970-0258249-6
Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput 24:647–656. https://doi.org/10.1090/s0025-5718-1970-0274029-x
Galassi M, Davies J, Theiler J, et al (2007) The gnu scientific library reference manual, 2007. URL http://www gnu org/software/gsl
Eastlake D, Jones P (2001) RFC3174: US Secure Hash Algorithm 1 (SHA1), Internet Engineering Task Force
James CA, Weininger D, Delany J (2000) SMARTS Theory. Daylight Theory Manual; Daylight Chemical Information Systems. Laguna Niguel: CA
Schärfer C, Schulz-Gasch T, Ehrlich H-C, et al (2013) Torsion angle preferences in druglike chemical space: a comprehensive guide. J Med Chem 56:2016–2028. https://doi.org/10.1021/jm3016816
Guba W, Meyder A, Rarey M, Hert J (2016) Torsion Library Reloaded: A New Version of Expert-Derived SMARTS Rules for Assessing Conformations of Small Molecules. J Chem Inf Model 56:1–5. https://doi.org/10.1021/acs.jcim.5b00522
Westbrook JD, Shao C, Feng Z, et al (2015) The chemical component dictionary: complete descriptions of constituent molecules in experimentally determined 3D macromolecules in the Protein Data Bank. Bioinformatics 31:1274–1278. https://doi.org/10.1093/bioinformatics/btu789
Kabsch W (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 32:922–923. https://doi.org/10.1107/S0567739476001873
Kabsch W (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 34:827–828. https://doi.org/10.1107/S0567739478001680
Groom CR, Bruno IJ, Lightfoot MP, Ward SC (2016) The Cambridge Structural Database. Acta Crystallogr B Struct Sci Cryst Eng Mater 72:171–179. https://doi.org/10.1107/S2052520616003954
The biologically interesting molecule reference dictionary (BIRD). RCSB Protein Data Bank. https://doi.org/10.2210/wwpdb/doc_bird. Accessed 28 Apr 2021
OpenEye Scientific Software Academic license agreement preface. https://www.eyesopen.com/academic-license-preface. Accessed 11 Oct 2021
rdkit.Chem.rdDistGeom module — The RDKit 2021.03.1 documentation. https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html. Accessed 28 Apr 2021
Halgren TA (1999) MMFF VI. MMFF94s option for energy minimization studies. J Comput Chem 20:720–729. https://doi.org/10.1002/(sici)1096-987x(199905)20:7<720::aid-jcc7>3.0.co;2-x
Wahl J, Freyss J, von Korff M, Sander T (2019) Accuracy evaluation and addition of improved dihedral parameters for the MMFF94s. J Cheminform 11:53. https://doi.org/10.1186/s13321-019-0371-6
Rappe AK, Casewit CJ, Colwell KS, et al (1992) UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J Am Chem Soc 114:10024–10035. https://doi.org/10.1021/ja00051a040
Qiu Y, Smith DGA, Boothroyd S, et al (2021) Development and Benchmarking of Open Force Field v1.0.0-the Parsley Small-Molecule Force Field. J Chem Theory Comput 17:6262–6280. https://doi.org/10.1021/acs.jctc.1c00571

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

High-Quality Conformer Generation with CDPKit/CONFORT: Algorithm and Performance Assessment

Status:

Version 1

Abstract

Figures

Introduction

Implementation

Results And Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1