Three new components have been developed using the RDKit toolkit. Two of these components (Standardizer and GetParent) have been rewritten and adapted from rules originally implemented using a commercial software toolkit. In contrast, the Checker component was developed more recently in an attempt to identify problem structures before they were added to the ChEMBL database.
Hence the new ChEMBL curation pipeline comprises three processes:
-
Checker: identifies and validates structures and identifies problems before they are added to the database
-
Standardizer: processes (standardises) chemical structures according to a set of predefined rules
-
GetParent: generates parent structures based on a set of rules and defined lists of salts and solvents
Checker Component
The Checker component validates structures prior to the compounds being loaded into ChEMBL. If an error or problem is detected in the structure a score is reported for the molecule; the score increases with the severity of the perceived problem. In the majority of cases compounds are loaded into the database even if a warning flag is identified. The scores are recorded but at this point errors are not corrected. Instead, they are prioritised and subjected to subsequent manual curation, as time and degree of seriousness permits. A summary of the structure checks performed, and the resultant penalty scores assigned are shown in Table 1.
Table 1
Penalty scores and annotation that are output from the Checker module. 7 is the most serious penalty score and 2 the least important.
Penalty Score
|
Penalty Explanation
|
7
|
Error − 9986 (Cannot process aromatic bonds)
Illegal input
InChI: Unknown element(s)
|
6
|
all atoms have zero coordinates
InChI: Accepted unusual valence(s)
InChI: Empty structure
molecule has 3D coordinates
molecule has a radical that is not found in the known list
molecule has six (or more) atoms with exactly the same coordinates
number of atoms less than 1
polymer information in mol file
V3000 mol file
|
5
|
InChI_RDKit/Mol stereo mismatch
Mol/Inchi/RDKit stereo mismatch
RDKit_Mol/InChI stereo mismatch
molecule has a bond with an illegal stereo flag
molecule has a bond with an illegal type
molecule has a crossed bond in a ring
molecule has two (or more) atoms with exactly the same coordinates
|
2
|
InChI_Mol/RDKit stereo mismatch
molecule has a stereo bond in a ring
molecule has an atom with multiple stereo bonds
molecule has a stereo bond to a stereocenter
molecule has the 3D flag set for a 2D conformer
Other InChI Warnings
|
It is an individual user’s choice what they decide to do with molecules that return specific penalty scores. For ChEMBL, a penalty score of 7 is considered to be a fatal error and the molfile is not loaded into the database. Examples of illegal input are, for example, unknown elements in the molfile, or molfiles that cannot be read in RDKit due to the inability to process their aromatic bonds. Molecules with a penalty score of 6 are loaded into the ChEMBL database but without a molfile, as it is considered that these have a significant issue with the structure, and it is preferable to fix the problem than have a badly formed molecule in the database. Most of the issues that give rise to a penalty score of 6 are self-explanatory and are described in Table 1. If the penalty score is 5 or 2 the molecule is loaded but the compounds are also prioritised for manual curation. Again, many of the 5 and 2 penalty scores are self-explanatory, but the stereo mismatch errors perhaps need further explanation. These are reported when the number of stereocentres perceived by the following calculation methods differ:
-
Mol: number of atoms where a wedged bond starts
-
InChI: number of tetrahedral stereocentres
-
RDKit: number of atomic stereocentres remaining after calling Chem.AssignStereochemistry()
Hence the “InChi_RDKit/Mol stereo mismatch” warning message indicates that the InChI and RDKit algorithms perceive the number of stereocentres to be the same but different from the molfile. “Mol/Inchi/RDKit stereo mismatch” means that all three methods perceive different stereocentre counts. The majority of these issues occur in complex molecules such as bridged bicyclic molecules that are badly drawn. As Standard InChI is derived from the molfile, the errors where the molfile and InChI differ in their stereocentre counts are given a higher penalty score (5) than when they are the same but different from the RDKit stereo count (2).
The InChI software may give a number of warnings. These are also reported by the ChEMBL Checker module. Some of these are considered important, but others such as “InChI: Omitted undefined stereo”, “InChI: Charges were rearranged”, “InChI: Ambiguous stereo”, “InChI: Proton(s) added/removed” and “InChI: Not chiral” are generated for large numbers of molecules. These either reflect the reorganisation of atoms in order to generate the InChI or are related to stereochemical ambiguity arising, for example, from the fact that the compound is a racemate; these are not considered issues for a database such as ChEMBL. Therefore, these are given a low penalty score (2). However, in other contexts they might be more relevant and so are reported in the Checker output.
Standardizer Component
The standardisation rules implemented in the ChEMBL database are based largely on the FDA/IUPAC guidelines (31, 32). Whilst the aim is to adhere to these rules as closely as possible, the practical reality is that submitted compounds are sometimes drawn imperfectly or the structures are ambiguously defined in the original publication or by the depositor. An automated standardiser can only safely correct some of the potential issues and the standardisation rules, currently encoded in the Standardizer component, are outlined here.
For certain compound types, particularly organometallic and those with a large number of boron atoms, a flag is set (exclude flag) and no attempt is made to standardise them. This is largely due to the V2000 molfile format used by ChEMBL being unable to accurately represent coordination bonds. For this reason, although the bioactivity data on these compounds is available in ChEMBL, the chemical structures are not curated nor provided in the release version of the database.
The first step in the standardisation process is therefore to exclude molecules if they contain more than 7 boron atoms or any of the following atoms: [Sc], [Ti], [V], [Cr], [Mn], [Fe], [Co], [Ni], [Cu], [Ga], [Y], [Zr], [Nb], [Mo], [Tc], [Ru], [Rh], [Pd], [Cd], [In], [Sn], [La], [Hf], [Ta], [W], [Re], [Os], [Ir], [Pt], [Au], [Hg], [Tl], [Pb], [Bi], [Po], [Ac], [Ce], [Pr], [Nd], [Pm], [Sm], [Eu], [Gd], [Tb], [Dy], [Ho], [Er], [Tm], [Yb], [Lu], [Th], [Pa], [U], [Np], [Pu], [Am], [Cm], [Bk], [Cf], [Es], [Fm], [Md], [No], [Lr], [Ge], [Sb].
The following standardisations are then made to the molecule (where they occur):
-
Standardise unknown stereochemistry
|
Before Standardisation
|
After Standardisation
|
a. Change “wiggly” bonds on sp3 carbons denoting unknown stereo to show no stereo
|

|

|
b. Set either or unknown cis/trans bonds to crossed bonds instead of showing them as “wiggly” bonds
|

|

|
2. Clear S Group data from the molfile |
-
Generate a kekulé form of the structure
-
Remove explicit H atoms except:
Hs where an isotope of hydrogen has been specifically set
Hs that have a wedged or dashed bond to them
Hs bonded to atoms with tetrahedral stereochemistry set ("Chiral Hs"). This is an example:

|

|
d. Hs bonded to atoms in a non-default valence state that are not simply protonated. An example is phosphinic acid: |

|
5. Normalise structure: |
-
Fix hypervalent nitro groups
Convert covalently drawn alkaline metals connected to O or N to ionic forms (e.g. NaO to Na + O-)
Fix incorrect amide tautomers, e.g. N = COH to HNC(= O)
Standardise sulphoxides to charge-separated form
Standardise diazonium N to N+
Ensure quaternary N is charged
Ensure trivalent O is charged
Ensure trivalent S is charged
Ensure halogen not bonded to a neighbouring atom is charged
-
Ensure molecule is neutralised, if possible, by:
Moving Hs from one atom to another (including between components)
Note that if the Hs could be added to more than one atom an arbitrary choice is made but this is done canonically so the result will always be the same for a given molecule
-
Normalise (straighten) triple bonds and allenes
In the context of the ChEMBL database, it is the molfiles standardised according to these rules that are stored in the database and which are in turn the structures made available to the database users.
GetParent Component
Many compound registration systems, including the ChEMBL database, identify compounds that are related by virtue of being a salt form of a common parent structure. Therefore, as part of the ChEMBL compound curation pipeline, molecules are identified where the molfile contains more than one connected component as well as molecules containing atoms with specified isotopes.
The GetParent module is applied to just those compounds that match one or both of these criteria. All information about isotopes is removed, as are solvents and salts present in the molfile which match any of the components in the defined salt and solvent lists. Having removed all salts (e.g. Na + that might be included to neutralise a carboxyl group), the resulting molecule is neutralised and a new molfile created as the “parent” molecule. Compounds containing more than one component that are genuine mixtures (i.e., all of the components are absent from the salt and solvent lists) has, in the context of the ChEMBL database, its parent registered as the identical mixture. For cases such as sodium chloride and sodium citrate, where both components are in the salt list, the GetParent module does not remove any component and the parent remains the same as the salt. Here again, the parent is registered as the multicomponent mixture. Compounds containing any of the excluded atoms described above have their isotopes and solvents removed and then parents created, so that bioactivity data can subsequently be aggregated. For example, the antimony-containing compound sodium stibogluconate has two versions in ChEMBL 26, both with bioactivity data: CHEMBL3754364 is a version with water of crystallisation and CHEMBL3764926 is a version without. These are annotated as related forms so that the bioactivity data can be seen aggregated in the database. Cyanocobalamin is a cobalt-containing compound which is recorded as a parent and three different isotopes in the database (CHEMBL2110563, CHEMBL2104118, CHEMBL2104381 and CHEMBL2096655). Again, the GetParent module enables their data aggregation. Organometallic compounds do not however have salts removed due to the complexity of how they are often represented in the deposited molfile. For example, this is often achieved by drawing them as disconnected components as is the case for transplatin (CHEMBL1386) which was deposited into the database as N.N.[Cl-].[Cl-].[Pt + 2]. Removal of the chloride and ammonia components would incorrectly result in a platinum ion as the parent.
The list of salts used in ChEMBL is based on the USAN Council’s list of pharmacological salts (33). Additional entities have been added to this list where a significant number of examples have been present in ChEMBL datasets. The GetParent module will remove salts regardless of: i) the charge status (e.g. acetic acid or acetate, Cl- or HCl); ii) whether or not stereochemistry is depicted (e.g. tartaric acid); iii) cis/trans isomers (e.g. maleic and fumaric acid). The salts and solvents files are available in the GitHub repository (28). Currently, these files contain 162 salts and 9 solvents respectively. This list will be maintained and extended if additional salts and solvents are identified.
For the avoidance of any doubt, although parents, salts, solvents, isotopes and mixtures are all identified using the process just described, the bioactivity data recorded in ChEMBL is registered against the form it was measured on. The aggregation by parent structure is undertaken to make it easier to identify all the data for salts and isotopes of a common parent. For example, paroxetine (CHEMBL490) has bioactivity data determined for the parent molecule, two salts, one salt/solvent mixture and two different isotopes as well as there being an additional salt registered as an FDA approved drug. Another example is amphetamine (CHEMBL405), which has bioactivity data in ChEMBL for eight different salts in addition to the parent amphetamine and an additional two salts that are recorded in drug sources such as the FDA orange book (34). The parent aggregation process makes this data easily identified and grouped. This is illustrated in Fig. 1 for these two compounds.
Availability of Structure Curation Pipeline
The code for the pipeline has all been developed using the RDKit toolkit (version 2019.09.2.0). It is open source and publicly available in GitHub (28), currently as version 1.0.0. A conda package is also available to facilitate installation (35). The Standardizer, Checker and GetParent functions are also integrated in the ChEMBL Beaker webservices and can be used in this way via the ‘check’, ‘getParent’ and ‘standardize’ endpoints (36). Any new features developed by the ChEMBL group will be added to the repository and comments and suggestions from others are welcomed.