The following sections provide insight into the algorithms and methods employed in the implementation of CONFORT and discuss the design decisions taken to achieve a high performance in the reproduction of bioactive conformations and speed of processing for a wide variety of drug-like organic molecules. The sequence of individual processing steps that have to be performed during conformer generation is presented visually by a set of hierarchically linked flowcharts and will be described in detail throughout the text. In general, conformer ensemble generation is a complex endeavor and can fail at nearly all processing stages (e.g. for molecules containing atom types not supported by the forcefield) or take exceedingly long (e.g. for molecules with many rotatable bonds). Internally, CONFORT thus has to perform a significant amount of error and timeout checks which may cause early termination or require intermediate error correction steps. For the sake of simplicity, error and timeout handling have not been incorporated into the flowcharts and the shown program flow is based upon successful execution of every processing step. Furthermore, in the flowcharts and text, a distinction is made between compounds and molecules according to the chemical definition of the two terms: A molecule represents a set of covalently bonded atoms where each atom is reachable from any other atom via one or more bonds. Compounds may consist of just one (the common case) or multiple molecules, as is the case for salts or arbitrary mixtures.
CONFORT has been implemented in C + + and builds on the cheminformatics infrastructure provided by the CDPKit C + + API. For end-users, CONFORT is provided in the form of a command-line tool called ‘confgen’ (Table 1 in Additional File 1 provides an overview of the supported options) that can be found in the application directory of CDPKit. For developers, CONFORT’s functionality is also accessible as a set of classes and functions that are part of CDPKit’s C++/Python-API and allow for a seamless integration of CONFORT into own CDPKit-based applications.
Top-level Conformer Generation Workflow. Figure 1 shows the top-level conformer generation workflow executed for an uninitialized input compound (e.g. read from an input file) to obtain a structurally diverse, low-energy output conformer ensemble. In the first step, a preprocessing of the input compound takes place where the addition of missing hydrogens is performed and required data for the subsequent processing steps are calculated (see section Compound Preprocessing for details). For compounds comprising only a single molecule, the conformer ensemble generated in the next step already represents the output conformers of the compound. For multi-molecule compounds, CONFORT generates a separate conformer ensemble for every molecule and then arranges combinations of selected molecule conformers in 3D space to obtain the final compound output conformers (see section Multi-Molecule Output Conformer Ensemble Generation).
Compound Preprocessing. Compound preprocessing (Fig. 2) comprises several sub-steps that prepare the raw input compound and calculate various molecular properties required for the generation of proper 3D structures and conformer ensembles. In the first step, the hybridization state of every atom is determined from the atom type, formal charge and the number and order of incident bonds. In the next step, the smallest set of smallest rings (SSSR) is perceived, which, together with the previously determined atomic hybridization states, allows to identify atoms and bonds that are part of planar aromatics ring systems.
In general, stereo configurations specified for the atoms and bonds of the input compound are retained and will be considered accordingly in the 3D structure generation process. For chiral atoms and asymmetric double bonds with undefined stereochemistry, an attempt is made to calculate missing stereo descriptors from supplied 3D atom coordinates. If 3D atom coordinates are not available, the configuration of chiral atoms is left unspecified, leading to the selection of a configuration that turns out to be energetically favorable in the conformer generation process. For asymmetric double bonds with no predefined stereochemistry, a trans configuration of the sterically most bulky substituents on either side of the double bond is selected. To efficiently estimate the steric bulk of substituents, we use a modified version of the Morgan algorithm [50] that stops after the iterative calculation of atom connectivity values (CV). The higher the CV of an atom, the more complex its neighborhood is, and the more space-consuming the substituent it represents will presumably be.
The next processing step adds hydrogens onto heavy atoms which are unsaturated according to the associated chemical element's formal charge and typical valances.
CONFORT employs a DG-based approach [22, 23] whenever 3D structures need to be generated from scratch. Here, randomly distributed starting atom positions are iteratively refined until a valid local energetic minimum structure has been obtained. Which of the energetic minimum conformers is eventually obtained heavily depends on the order of atoms during the initial assignment of random atom positions. To guarantee an input atom order invariant generation of output conformer ensembles, the connection tables of input compounds need to be canonicalized before any conformers are generated. This canonicalization step is optional and has to be explicitly enabled by the user if possible slight output differences for structurally identical input compounds with varying atom orders are not acceptable. For the connection table canonicalization task, CONFORT employs an implementation of the algorithm devised by McKay [51].
After (optional) compound canonicalization, a perception of structurally separated compound components (= molecules) is carried out. Components are identified by performing a depth-first search for all atoms reachable via a bond path from a given start atom. Found reachable atoms and the start atom belong to the same component and get marked accordingly. The search procedure is then repeated for the next not yet visited atom in the atom list until no more unvisited atoms are left.
In the last compound preprocessing step, a topological distance matrix is generated for every component of the input compound. Topological atom distances are needed to generate DG constraints and for the parameterization of Merck Molecular Force Field 94 (MMFF94) Van der Waals interactions [52]. To determine topological distances, we employ a breadth-first search approach where the length of the bond path from a given start atom to a reachable atom is noted in the corresponding cells of the distance matrix. Alternatively, we also implemented the Floyd-Warshall algorithm [53] to determine the topological distances. However, the Floyd-Warshall implementation was significantly slower than the breadth-first search approach and eventually was given up in favor of the latter.
Molecule Conformer Ensemble Generation. As already noted, a separate molecule conformer ensemble will be generated for every component of the input compound. The molecule conformer generation workflow shown in Fig. 3 may thus be executed more than once depending on the number of components that could be identified in the compound preprocessing stage.
The conformer generation process for a given input molecule starts with the perception and parameterization of MMFF94 interactions. Determined force field interactions and parameters will be required for 3D structure refinement and conformer energy estimation in later processing stages. CONFORT utilizes CDPKit’s full-featured MMFF94 implementation whose correctness has been thoroughly validated in all aspects using the datasets provided by the MMFF94 Validation Suite [54]. In the next step, the ultimate decision is made whether to use a systematic or a stochastic sampling approach for generating the molecule conformers. The systematic approach (see section Systematic Conformer Sampling) performs best for typical drug-like molecules composed of chains and relatively rigid ring systems. Stochastic sampling (see next section) has proven to be advantageous for structures comprising flexible macrocyclic rings where conformers cannot be sampled by means of simple systematic torsion driving due to the present rotational restrictions. By default, CONFORT selects a stochastic sampling approach whenever the SSSR of the molecule contains a ring that incorporates more than ten non-aromatic single bonds. The automatic selection of a suitable sampling method can be overridden by specifying the method to use in advance in the CONFORT settings. Once a pool of output conformer candidates has been generated by the chosen method, a final output conformer ensemble is compiled in the last processing step of the workflow. Which, and how many, of the candidate conformers end up in the output conformer ensemble depends on various user-specified settings which control allowed energy range, desired structural diversity and maximum ensemble size. Details regarding the output conformer selection process can be found in section Molecule Output Conformer Ensemble Compilation.
Stochastic Conformer Sampling. Stochastic conformer sampling is a relatively simple but time consuming process which is based on the assumption that randomly generated structurally diverse low-energy conformers will be uniformly distributed over the whole conformational space of a molecule and that a sufficiently large set of such samples will represent a large fraction of the energetically most favorable torsion angle combinations. Figure 4 shows the principal steps of the stochastic conformer sampling workflow as implemented in CONFORT. The first step is concerned with initializing the random conformer generation unit. Random low-energy conformers are generated by a DG-based approach, where, in accordance with predefined atom distance and volume constraints, initially randomly distributed atom positions are successively refined until a valid energy-minimized 3D structure of the molecule is obtained (for details see next section). During the sampling process, random conformers are generated iteratively until a specified maximum number of conformers was sampled (default: 2,000 conformers), the granted sampling time has been exceeded, or the generation of new unique conformers ceased (convergence reached). Within the sampling loop, each newly generated conformer is first tested whether its energy is lower or equal to the current energy threshold. The energy threshold equals the lowest conformer energy encountered thus far plus the specified energy window size. If the generated conformer passes the energy check, it is added to the working ensemble and gets discarded, otherwise. Next, a sampling convergence check is carried out which decides whether the conformational space has been sampled densely enough and, therefore, the sampling loop may already be exited. In the convergence check, after a certain number of conformer generation cycles (default: 100 cycles) have passed, the amount of newly generated unique conformers is determined. This is done by first performing an energy-based removal of duplicate conformers (conformers with an energy difference < = 0.01 kcal/mol) from the working ensemble and then comparing the number of remaining conformers with the ensemble size obtained after the last duplicate removal step. If the ensemble size did not increase since the last check, the sampling process has converged and can be stopped. After leaving the iterative sampling loop, any conformers in the working ensemble with energies outside the energy window (see above) get discarded and the stochastic sampling process terminates.
Random Conformer Generation. CONFORT uses the random conformer generation functionality whenever arbitrary low-energy 3D structures of a molecule have to be generated from basic molecular graph information only (e.g. for stochastic conformer sampling, see previous section). The generation of a single random 3D structure comprises the following sequence of steps (Fig. 5):
First, the 3D-coordinates of all heavy atoms and any hydrogens which are connected to stereocenters with defined configuration are initialized with random values lying in a range equaling the atom count times a structure type specific factor (e.g. 0.5 for macrocycle sampling). Hydrogens usually represent the majority of the atoms in typical organic molecules and their exclusion allows for a significant speedup of the subsequent DG-based raw 3D structure generation step. Here, the initial random atom positions are optimized by a coordinate embedding procedure [55] until they meet certain pairwise distance and tetrahedral volume constraints. The geometric distance range constraint assigned to an atom pair pij = (ai, aj) depends on its topological distance TDij: For directly bonded atoms (TDij = 1) the upper and lower distance limit is set to the MMFF94 bond length taken from the assigned bond stretching interaction parameters. The upper and lower distance limit for a geminal atom pair (TDij = 2) is calculated from the respective MMFF bond lengths and the equilibrium bond angle specified by the assigned bond stretching and angle bending interaction parameters. For vicinal atom pairs (TDij = 3), where the central bond is not a double bond with a defined configuration, the lower limit is set to the calculated distance of the atoms in a coplanar and the upper limit to the distance in an anti-coplanar arrangement taking into account the MMFF94 bond lengths and angles of the involved bonds. Vicinal atoms connected to a central double bond with a defined configuration are assigned a fixed distance corresponding to the spacing of the atoms in coplanar arrangement if the central bond’s configuration is cis (with regard to the atom pair), and the corresponding distance in anti-coplanar arrangement if the configuration is trans. All remaining atom pairs (TDij > 3) obtain a lower distance limit which is calculated as the sum of the covalent atom radii plus an additional safety spacing of 1.5 Å, and an upper limit equal to the total sum of all MMFF94 bond lengths. Volume constraints are generated for atoms bonded to tetrahedral stereogenic centers and for groups with known planar atom arrangements. For stereogenic centers, the sign of the volume spanned by the neighbor atoms depends on the specified configuration and is enforced by setting the corresponding volume range limits to ± 0.5 and ± 1000 ų (empirical values), respectively. Planar groups are e.g. formed by amide Nitrogens, sp2-hybridized or aromatic atoms with three incident bonds. Furthermore, by atoms connected to amide, aromatic and double bonds. To enforce a planar arrangement of such atom groups, the upper and lower volume range limits are set to zero.
Once a raw 3D structure has been obtained from the embedding process, a first check is made whether the configurations of defined stereogenic centers are correct with respect to the generated coordinates. If the check fails for at least one stereocenter, a new attempt is made to generate a valid raw structure starting from a different set of random atom positions. If, after a certain number of trials (default: 10), still no valid structure could be obtained, the random conformer generation procedure terminates and reports an error. Otherwise, processing continues with the next step, where the coordinates of removed hydrogens are calculated according to the hybridization state of the connected heavy atoms and already assigned atom positions. After that, the geometry of the hydrogen complete raw 3D structure is further refined by iterative minimization of its MMFF94 energy until the energy gradient norm or the change of energy is below a certain target threshold (the stopping criterion and threshold value have to be specified by the caller and are context-dependent). For energy minimization, the algorithm devised by Broyden, Fletcher, Goldfarb and Shanno (BFGS) [56–59] is used. The CDPKit-implementation of the BFGS-algorithm is based on code that has been taken from the GNU Scientific Library (GSL) [60]. If the energy-driven structure refinement procedure fails, the random conformer generation procedure terminates immediately and reports an error. In the concluding step of the workflow, all defined stereocenters are once again checked for the correctness of their configurations. If no errors are found, the refined set of atom coordinates is output and the workflow terminates with success. Otherwise, as described previously, a new attempt to obtain a valid 3D structure will be made.
Systematic Conformer Sampling. CONFORT performs systematic conformer sampling by applying different combinations of torsion angles at the rotatable bonds of the processed molecule on beforehand generated conformer 3D structure templates. Compared with stochastic sampling, the systematic approach usually demands much less processing time for small drug-like molecules since all conformers of a given 3D structure template can be generated by relatively fast rigid body transformations instead of having to repeatedly perform a time-consuming ab initio generation of energy-minimized random conformers.
For the construction of 3D structure templates, CONFORT employs a fragment-based approach in which the overall 3D structure is assembled from smaller structural building blocks like chain fragments and ring systems. The idea behind this approach is to pre-generate reasonable 3D structures of frequently occurring molecular fragments by an external program (‘genfraglib’, can be found in the application directory of CDPKit) and store the resulting atom 3D coordinate sets in a permanent library for later use. Via this strategy, a considerable speedup of the 3D structure template buildup process can be achieved since a time consuming ab initio generation of 3D coordinates (see section Random Conformer Generation) can be circumvented for many of the fragments found in typical drug-like organic molecules.
Figure 6 illustrates the major processing steps performed for a given input molecule in the implemented systematic conformer sampling workflow. In the first step, the provided molecule is broken down into smaller structural building blocks by cutting specific bonds that have been identified by a set of fragmentation rules (details on molecule fragmentation can be found in the next section). Afterward a conformer ensemble is generated for each obtained fragment, which, depending on conformational flexibility, may consist only of a single 3D structure or multiple conformers. For rigid fragments like purely aromatic ring systems and chains/moieties lacking rotatable bonds, a single 3D structure is generated. Chains comprising rotatable bonds and flexible ring systems are sampled more thoroughly until a representative set of low-energy conformers has been obtained. As pointed out before, for each processed fragment, an attempt is made to obtain pre-generated 3D coordinates from an on-disk fragment library to speed up the overall conformer generation process. If a suitable library entry is not available, conformers will be generated on-the-fly using CONFORT’s random conformer generation (see previous section) and torsion driving functionality. More details on fragment library lookup, on-the-fly conformer generation and fragment conformer post-processing can be found in section Fragment Conformer Ensemble Generation.
In the next step, a set of fragment conformer combinations (FCC) is generated. FCCs are represented by N-tuples of fragment conformer indices where N corresponds to the total number of resulting input molecule fragments. The energy of an FCC is calculated as the sum of the MMFF94 energies of the fragment conformers referenced by the index-tuple. To prevent the exhaustive consumption of main memory and an exaggerated processing time by very high numbers of possible FCCs (e.g. for larger peptides), the FCC generation process stops once a certain amount of stored FCCs has been reached (100,000 in the current implementation). Furthermore, only FCCs with energies lower or equal to the calculated energy threshold are stored for further processing. The energy threshold corresponds to the minimum FCC energy plus the specified energy window size times the safety factor 1.5.
In the final step of the workflow, each FCC serves as a 3D structure template for the generation of derived molecule conformers by systematic torsion driving at rotatable bonds linking the fragments. By means of this divide-and-conquer strategy, it is possible to sample a representative share of the lowest energy conformers in an acceptable amount of time, even in the case of large and flexible molecules. Further details on the assignment of torsion angles to rotatable bonds and the implementation of the torsion driving process can be found in the section Fragment Conformer Combination Torsion Driving.
Molecule Fragmentation. As described in the previous section, the systematic conformer sampling procedure generates molecule conformers by assembling selected pre-computed conformers of molecular fragments. These fragments are derived from the molecular graph by cutting specific carbon-heteroatom bonds and bonds to ring system substituents. The pseudo-code function isCutBond() shown in Listing 1, implements the rules by which the bonds being cut are identified:
bool isCutBond(Bond b):
if isInRing(b) or connectsHydrogenAtom(b):
return false
if degree(b.atom1) < 2 or degree(b.atom2) < 2:
return false
if isInRing(b.atom1) or isInRing(b.atom2):
return true
if order(b) != 1:
return false
if element(b.atom1) == ‘C’:
if element(b.atom2) not in { 'N', 'O', 'S', 'P', 'Se' }:
return false
elif element(b.atom2) == ‘C’:
if element(b.atom1) not in { 'N', 'O', 'S', 'P', 'Se' }:
return false
else:
return false
if hasNonSingleBondTo(b.atom1, { 'N', 'O', 'S' }) or
hasNonSingleBondTo(b.atom2, { 'N', 'O', 'S' }):
return false
return true
Listing 1. Pseudo-code implementing rules for the identification of bonds to cut in the molecule fragmentation process.
In the resulting fragments, bonds to previously connected fragments of the parent molecule are preserved. Their atoms are replaced by corresponding pseudo atoms encoding chemical element, hybridization state, formal charge and membership in aromatic rings. These pseudo atoms provide enough local structural context for a later library lookup/on-the-fly generation of proper fragment 3D structures. Figure 7 shows an example of the fragmentation of a typical drug-like organic molecule and the list of resulting molecular fragments obtained by cutting bonds identified with the rules mentioned above.
Fragment Conformer Ensemble Generation. Depending on specified settings, data availability, and structural type, fragment conformer ensembles are either derived from provided input 3D coordinates, retrieved/derived from an entry in the built-in or user-specified fragment library, taken from the runtime cache, or have to be generated from scratch.
Figure 8 outlines the conformer ensemble generation workflow for a given input fragment. The workflow starts with checking whether the conformers should (by default, input coordinates are discarded) and can be derived from the provided atom 3D coordinates. Supplied atom 3D coordinates are only considered if they are present for at least all heavy atoms of the fragment, and if the fragment is not a flexible ring system or ring conformer enumeration has been disabled. If so, all present 3D coordinates are extracted and a calculation of missing hydrogen coordinates is performed (the hydrogen 3D coordinates calculation procedure is briefly described in section Random Conformer Generation). Afterward, the resulting complete fragment 3D structure is forwarded to the conformer post-processing step (see next section).
If input coordinates should not or cannot be considered according to the initial checks, the workflow proceeds with the fragment canonicalization step that comprises three sub-steps: In the first step, terminal heavy atoms connected to atoms of aromatic rings are replaced by hydrogen. Fragments differing only in their aromatic ring system substituent pattern thus converge into the same canonical fragment structure. This measure helps to increase the fragment library/runtime cache hit rate for quite common aromatic fragments like substituted phenyl rings. The resulting incorrect lengths of bonds between aromatic ring and replaced substituent atoms are corrected later in the conformer post-processing step (see next section).
In the next step, free valences of pseudo atoms (see previous section) are compensated by adding explicit hydrogens and a calculation of canonical atom labels using CDPKit’s implementation of the McKay algorithm [51] is performed.
Finally, a binary representation of the fragment’s connection table with atom and bond lists ordered according to the previously calculated canonical atom labels is generated. The calculated SHA1 hash code [61] of the connection table data is then converted to a 64 bit integer key which serves as a unique fragment identifier (ID) in subsequent processing steps.
In the canonical atom labeling and fragment ID calculation procedure, stereodescriptors of defined stereocenters are only considered if the processed fragment represents a ring system. Stereochemistry is disregarded for acyclic fragments since configurations of tetrahedral stereocenters and double bonds can be corrected easily by fast geometric operations in the conformer post-processing step (see next section). Furthermore, for each acyclic fragment only a single random 3D structure is stored in the fragment library and runtime cache to reduce per-fragment memory consumption and thus allow for a higher number of entries. Complete conformer ensembles are then generated in the post-processing step by performing torsion driving on the saved 3D structure. The increased post-processing effort resulting from these measures is condoned to attain higher library and runtime cache hit rates for this structurally diverse fragment type. High hit rates are key for achieving low average molecule processing times by avoiding the relatively slow DG-based random conformer generation procedure for as many encountered fragments as possible.
After the fragment canonicalization procedure, the calculated 64 bit fragment ID is used as a unique key for querying the loaded fragment libraries. Aside from the built-in fragment library which gets loaded on program startup, CONFORT also supports multiple user-specified external fragment libraries which are searched one by one for an entry matching the supplied fragment ID. If one of the performed library lookups is successful, the fragment conformers deposited in the library entry are extracted and forwarded to the final post-processing step after which the workflow terminates.
Should all library lookups fail, an attempt is made to find a matching entry in the dynamic runtime cache. The runtime cache is implemented as a Least Recently Used (LRU) list (limited to 10,000 entries in the current implementation) and stores conformers of previously processed fragments that likewise could not be found in any of the searched libraries. An identified matching cache entry is then processed in the same way as a corresponding library hit.
If library and cache lookups both fail, the requested fragment conformers are generated from scratch based on the information provided by the molecular graph of the derived canonical fragment. 3D structures are generated by means of the previously described random conformer generation functionality (see section Random Conformer Generation). For acyclic fragments and fragments representing purely aromatic ring systems, just a single output 3D structure is generated. Conformers of fragments representing flexible ring systems are sampled stochastically using a procedure similar to the one described in section Stochastic Conformer Sampling. Depending on actual ring system flexibility, this usually leads to the output of multiple structurally diverse (default ring atom RMSD = 0.1 Å) low-energy 3D structures which cover the conformational space of the fragment in the effective energy window (default values: small ring systems 8 kcal/mol, macrocycles 25 kcal/mol). Afterward, the generated 3D structures are deposited in the runtime cache for potential later reuse and then get post-processed in the workflow's final step.
Fragment Conformer Post-Processing. Fragment conformers resulting from the procedure described in the previous section need to undergo further checks, corrections and modifications before they can be used as building blocks for molecule 3D structure templates (see section Systematic Conformer Sampling). Depending on fragment type and source of the input conformers, required post-processing steps comprise: Correction of aromatic ring substituent bond lengths, inversion of stereocenters for the correction of detected configuration errors, enumeration of invertible nitrogen states and generation of additional conformers by torsion driving. The workflow of the executed post-processing procedure is shown in Fig. 9. Herein, the first two processing steps are concerned with required structural corrections resulting from molecular graph modifications and the disregard of stereochemistry for acyclic fragments in the fragment canonicalization procedure (see previous section).
In the first correction step, for every input conformer, the lengths of bonds between aromatic ring atoms and atoms of exocyclic substituents (which were replaced by hydrogen in the fragment canonicalization procedure, see previous section) are scaled to match the corresponding MMFF94 equilibrium bond lengths in the structural context of the parent molecule. In the second step, atom and bond stereocenters of acyclic fragments whose calculated configuration does not match the desired output configuration are corrected by exchanging stereocenter substituent positions and applying geometric transformations on the involved atom 3D coordinates.
Since systematic errors introduced in the fragment canonical procedure do not apply to fragment conformers extracted from a supplied 3D structure of the parent molecule (see Fig. 8), the two correction steps can be bypassed for this sort of conformers which is realized by a dedicated check at the beginning of the post-processing workflow.
The next processing step performs a systematic enumeration of all possible invertible nitrogen configuration combinations and generates corresponding 3D structures derived from the supplied input conformers. For a fragment with N invertible nitrogen atoms this will lead to a multiplication of the number of fragment conformers by a factor of 2N. In the current implementation a non-planar nitrogen atom is considered invertible if it has three single-bonded neighbors, is not a member of more than one ring and is connected to at least two heavy atoms. CONFORT supports three nitrogen configuration enumeration modes: i) no enumeration - if specified, the enumeration procedure will be skipped and the fragment conformers are left unaltered, ii) only invertible nitrogens with undefined configuration are considered (default setting) and iii) all invertible nitrogens. Algorithmically, the configuration of a nitrogen is inverted by rotating the 3D coordinates of the atoms of a selected substituent to the exact opposite position on the other side of the plane spanned by the bonds to the remaining two substituents. If the inverted nitrogen atom is a ring member, the exocyclic substituent will be chosen for rotation. If the nitrogen is acyclic, the substituent comprising the smallest number of atoms is selected.
If the fragment contains rotatable acyclic bonds, each conformer in the current set will be subjected to torsion driving, further expanding the current fragment conformer ensemble by systematic sampling. Usually, torsion driving only needs to be carried out for flexible acyclic fragments for which just a single low-energy conformer is handed over by the parent workflow (see previous section). Conceptually, the employed torsion driving workflow is largely similar to the one described for the generation of molecule conformers from a set of FCCs (see next section) and, therefore, will not be outlined in more detail.
After torsion driving has finished, generated conformers which are out of the specified molecule conformer energy window get discarded and the remaining conformers are then forwarded to the last workflow step.
If no rotatable fragment bonds are present, the MMFF94 energy of each conformer using the force field parameterization of the parent molecule is calculated (the torsion driving procedure performs this step internally).
In the last post-processing step, the list of final fragment conformers is ordered by increasing energy and, if necessary, reduced to the maximum allowed output ensemble size (default setting: 10,000 conformers) by removing the necessary amount of high-energy conformers from the end of the list.
Fragment Conformer Combination Torsion Driving. This processing step represents the heart of the systematic conformer sampling workflow (see section Systematic Conformer Sampling) and produces an ensemble of output conformers from a set of energy ordered input FCCs. 3D structures of the output conformers are generated by a torsion driving procedure that aligns individual fragment conformers specified by the processed input FCCs along rotatable bonds linking the fragments (see section Molecule Fragmentation). The relative spatial orientation of the aligned fragment conformers is dictated by rotatable bond-specific dihedral angles which are retrieved from matching torsion library entries. Depending on the number of distinct torsion angle combinations possible for the present set of rotor bonds, a corresponding number of output conformers will usually be generated for each processed FCC. For big and quite flexible molecules, the total number of possible conformers can quickly reach rather high figures and an attempt to generate all of these conformers will fail due to excessive main memory and processing time consumption. However, carrying out the processing of FCCs in order of increasing energy and the early pruning of high-energy conformers during torsion driving ensures that also in cases where the conformational space cannot be explored exhaustively, most of the low-energy conformers will be captured and end up in the generated output ensembles (see section Results and Discussion).
The first step of the overall torsion driving workflow (Fig. 10) is concerned with the setup of the fragment tree data structure. This data structure represents the in-memory foundation of the torsion driving algorithm and comprises a set of linked nodes organized in a tree-like manner. Each non-leaf node of the tree references a particular rotatable bond and provides storage for conformers that can be generated by aligning all possible pairs of child node conformers along the referenced bond applying the dihedral angles specified by an assigned torsion library entry. The leaf nodes are associated with the fragments constituting the parent structure (root node) and each stores a particular fragment conformer as specified by the currently processed FCC. For the sake of better understanding, Fig. 11 shows an example of a typical fragment tree constructed by the setup procedure for the molecule shown in the root node.
Once the fragment tree data structure has been set up, an assignment of proper dihedral angles for the rotatable bonds referenced by the non-leaf nodes is carried out. Here, for each rotatable bond, a lookup for a matching entry in the loaded torsion libraries will be performed. Torsion library entries specify matching rotatable bonds by a single SMARTS pattern [62] which describes a linear path of three or four atoms and provides lists of preferred dihedral angles and associated tolerance intervals. The torsion library used by CONFORT has been derived from the collection of rotatable bond patterns and corresponding dihedral angles compiled by Schärfer et al. [63] and Guba et al. [64] which resulted from statistical analyses of torsion angle preferences in the drug-like chemical space. Based on the set of rotatable bond patterns in the most recent version of the library by Guba et al., we performed our own torsion angle analysis using a database of experimentally-determined receptor-bound ligand 3D structures extracted from the PDB [65]. Based on the obtained results, several new library entries have been added, and various modifications to the angle and tolerance lists of existing entries in the library by Guba et al. were made. Aside from the built-in library, CONFORT also supports multiple user-specified external torsion libraries which are included in the search for matching entries. These will take precedence over the entries in the built-in library and thus enable customization of the dihedral angles used in the torsion driving core procedure.
Depending on the conformer generation settings in effect and the presence of local symmetry, dihedral angles provided by the looked-up library entries may undergo further processing before they are assigned to the corresponding tree nodes: i) In the case of rotatable bonds which represent an axis of 1-, 2- or 3-fold rotational symmetry of at least one of the connected fragments, all redundant angles that would lead to the generation of conformer duplicates are removed. Often occurring fragments with known rotational symmetry (e.g. Trifluormethyl, Phenyl, t-Butyl, …) have been tabulated as a list of SMARTS-patterns and are identified at runtime by substructure searching. ii) If torsion angle tolerance range sampling has been enabled (default: not enabled), two additional angles per listed dihedral angle will be calculated that mark the beginning and end of the dihedral angle’s first tolerance range (see ref. [64] for the definition of the first tolerance range).
After the torsion angle assignment step, the workflow enters a loop in which a set of output conformer candidates will be generated for each specified input FCC. The loop is exited either after all FCCs have been processed or if one of the two early exit conditions is fulfilled. The first early exit check is carried out before entering the torsion driving core procedure and evaluates whether the energy of the currently processed FCC is above a certain upper energy limit. The energy limit is calculated as the sum of the energy of the FCC that resulted in the lowest energy output conformer generated so far and the specified energy window size. A generation of novel low-energy output conformers from FCCs above this energy limit is very unlikely and the processing of additional input FCCs will have little impact on the final output conformer ensemble. Hence, the main loop can already be exited at this point with, on average, only little losses in output conformer ensemble quality.
If the FCC passes the energy limit check, the torsion driving core procedure will be entered. Herein, the leaf nodes of the fragment tree are first initialized with the fragment conformers specified by the newly processed FCC and then a recursive generation of conformers at the non-leaf nodes is carried out. The conformers of a non-leaf node are generated by overlaying pairs of left and right child node conformers at the referenced rotatable bond and then applying rotations to the coordinates of one child conformer according to the torsion angles assigned during setup. This process is repeated for all possible child node conformer pair and torsion angle combinations and finally results in a set of conformers which represents the conformational space of the molecule substructure covered by the fragments referenced in the subtree rooting at the currently processed node.
The MMFF94 energy EA−B of a conformer A-B that was generated from two child node conformers A and B for the torsion angle 𝚯 can be calculated as (Eq. 1):
\({E}_{A-B}\left(\varTheta \right)={E}_{A}+{E}_{B}+{E}_{BS}+{E}_{T}\left(\varTheta \right)+{E}_{C}\left(\theta \right)+{E}_{VdW}\left(\theta \right)\) Eq. 1
E A and EB denote the MMFF94 energies of the child node conformers A and B, respectively, EBS represents the bond stretching interaction energy of the rotatable bond, ET represents the sum of the energies of torsion interactions involving the rotatable bond, EC denotes the sum of the energies of electrostatic interactions between the child node substructures, and EVdW denotes the sum of corresponding Van der Waals interaction energies. As can be seen from Eq. 1, EA, EB as well as EBS are torsion angle independent constants whose values, once calculated, can be reused in the calculation of other energy terms they are involved in. The torsion driving procedure exploits this fact and is thus able to perform a fast energy calculation of newly generated conformers since every parameterized forcefield interaction energy term, in the worst case, needs to be evaluated only once per generated conformer.
For each newly generated conformer a first check is made to determine whether its energy is exceeding the current energy threshold which equals the sum of the minimum conformer energy encountered so far and the specified energy window size. If so, the generated conformer is discarded. Otherwise, it will be added to the node’s current working set. After all possible conformers of a node have been generated, each saved conformer is again checked whether its energy is within the allowed energy window and if not, gets discarded. Finally, the remaining conformers are ordered by increasing energy and, if the number of conformers exceeds the maximum allowed pool size (default value: 10,000 conformers), the amount of conformers is reduced by pruning excessive high-energy conformers. These post-processing steps ensure that potential high-energy molecule conformers get identified and removed from the processing pipeline already early on and, in case of molecules comprising many fragments and/or high numbers of torsion angles per rotor bond, a combinatorial explosion of live conformers will be avoided.
After the bottom-up generation of conformers has finished at the root node of the fragment tree, the torsion driving procedure terminates. The root node now stores a set of final low-energy molecule conformers derived from the currently processed FCC. For each conformer a check is then made whether its energy is above an upper limit calculated as the sum of the minimum conformer energy encountered so far for all processed FCCs and the specified energy window size. If the conformer’s energy is above the threshold, the conformer gets discarded. Otherwise, it is added to the output conformer working set. If, during the processing of the root node conformer ensemble, a new energetic minimum conformer has been encountered, the energetic minimum gets adapted and all conformers in the current working set which are now above the new upper energy limit are discarded.
After all conformers have been processed, a check is made whether the second early loop exit condition is met. The condition will be fulfilled if the number of conformers in the working set is greater or equal to a specified non-zero maximum pool size (default value: 10,000 conformers). If so, the loop will be exited and the torsion driving workflow terminates. Otherwise, the next input FCC (if available) will be processed as described above.
Molecule Output Conformer Ensemble Compilation. Sets of low-energy molecule conformers obtained from the systematic or stochastic sampling procedures usually do not yet fulfill all user-specified output ensemble characteristics (structural diversity, max. output ensemble size and energy window size) and may contain a significant amount of duplicates or clusters of structurally too similar conformers. Figure 12 shows the workflow of the post-processing procedure which takes care of compiling a final output conformer ensemble with desired characteristics from a set of supplied input conformers. The procedure starts with ranking the supplied input conformers by increasing energy to ensure a preference of low-energy conformers in the later output conformer picking stage. Afterward, a check is made whether a present 3D structure of the molecule that has been provided with the processed molecule on input shall be added to the output conformer ensemble (default setting: not included). If so, the input 3D structure gets extracted and, after calculating its MMFF94 energy, it is added to the output conformer working set.
Processing then enters the main section of the workflow where final output conformers will be selected iteratively from the energy ordered set of input conformers based on mutual structural dissimilarity. The picking loop exits if either all input conformers have been processed, the maximum output ensemble size has been reached or the energy of the currently processed conformer exceeds the energy limit which is calculated as the sum of the minimum conformer energy (= energy of the first conformer) and the specified energy window size. Within the loop a processed input conformer is selected as an output conformer only if the heavy atom RMSD between the current conformer and any of the previously selected output conformers is not below a specified threshold value (default setting: 0.5 Å). The RMSD of an evaluated conformer pair is calculated by performing RMSD minimizing 3D alignments using CDPKit’s implementation of the Kabsch algorithm [66, 67] for all possible homomorphic atom mappings that might result from the presence of topological symmetry. To avoid an excessive memory and processing time consumption in case of highly symmetric molecules, the number of processed mappings has been limited to a hardcoded maximum value of 131,072. If this limit is exceeded, further processing of the current molecule will be suspended and a corresponding error is reported.
When the RMSD value resulting from one of the performed 3D alignments falls below the specified threshold, further processing stops and the evaluated input conformer gets rejected. If, otherwise, all performed pairwise RMSD checks have been passed successfully, the conformer is added to the output conformer ensemble and processing continues with the next input conformer in line.
Although the implemented output conformer selection algorithm is quite simple, the obtained ensembles nevertheless fulfill all major quality criteria, which is proven by the benchmarking results presented in the Results and Discussion section.
Multi-Molecule Output Conformer Ensemble Generation. For compounds that consist of multiple components an additional processing step is required which merges the separately generated molecule conformer ensembles into a single compound output conformer ensemble (see also section Top-level Conformer Generation Workflow). The compound conformer generation method implemented in CONFORT builds composite conformers from sets of selected molecule conformers by lining them up along the X-axis of the coordinate system. For each placed molecule conformer, an axis-aligned bounding box (AABB) is first calculated which specifies the minimum [Xmin, Ymin, Zmin] and maximum [Xmax, Ymax, Zmax] values of the atom coordinates in each dimension of space. The position [Xmin, (Ymin + Ymax) * 0.5, (Zmin + Zmax) * 0.5] then serves as an anchor point for calculating a translation vector to a particular location on the X-axis. The placement X-position starts at 0 and is incremented by Xmax - Xmin + 4.0 after each conformer has been placed. The constant 4.0 (in Å) represents an additional safety distance and makes sure that no atom Van der Waals sphere clashes occur between successively placed conformers. For the generation of the ith compound conformer, the molecule conformers at index i of the respective ensembles serve as input and its energy represents the total of the molecule conformer energies. If a conformer at index i does not exist, the molecule conformer with the highest index will be selected instead. Compound conformers are generated until all conformers of the largest molecule ensemble(s) have been consumed. A schematic representation of the compound conformer generation workflow can be found in Fig. 13.