Chemical Annotation Score – A novel distance metric with application in de novo design

doi:10.21203/rs.3.rs-2213551/v1

The molecule generation process in de novo design can produce many hundreds of thousands of compounds from an input pool of one or more known compounds. Property filtering and other fast scoring methods can reduce this set in accordance with project goals. However, there may still be many tens of thousands of compounds to select from. The question we pose here is how to identify the most relevant compounds from a set for further analysis and selection. We define relevant as the compounds most suitable for exploring the SAR. Whilst similarity methods can be used to identify the most similar compounds to the input set we show that they are not well suited to this more general task. We introduce the Chemical Annotation Score as a novel method for calculating chemical distance. We show the superiority to fingerprint based similarity methods for identifying the most relevant set of compounds from a large pool for further exploration given an input query or queries.

De novo molecular design is a computational method for generating novel molecules known since the 1980s[1]. In recent years, the availability of large public databases such as ZINC[2], PubChem[3] and ChEMBL[4], has led to the application of machine learning techniques to de novo ligand-based molecular design. There is a large variety of methods which could be used for molecular generation: evolutionary algorithms [5–7], heuristic approaches by molecule growing [8–13], retrosynthetic or fragmentation rules for disassembly and reconstruction [6, 14, 15], reaction-based approaches [16–18], AI-based molecular generators [19–21]. It is possible to generate hundreds of thousands or even millions of virtual molecules with these methods, presenting a challenge for ranking and scoring. In particular, understanding what similarity means in this context is important. Similarity is a key concept in drug discovery and it is routinely used in its early stages [22, 23]. As formulated by Johnson and Maggiora, “Similarity like beauty is more or less in the eye of the beholder” [24] and will vary from person to person, algorithm to algorithm and use case to use case. This is probably the reason why there is a plethora of similarity descriptors and metrics that can be used for ranking or clustering the compounds. Most use cases are focussed on choosing compounds similar to a query (or set of molecules). The use case that we are concerned with in this paper is subtly different and directly related to de novo design, where the aim is to design novel molecules around a starting pool. In the context of Design-Make-Test-Analyse (DMTA) cycles, for example as implemented in our automated design platform BRADSHAW[26], the ranking and prioritisation of compounds for further consideration requires addressing the question of whether a de novo designed compound can be considered to be a sensible change or is it too far from the starting pool. Without a suitable chemical distance measure the pool can be populated with many compounds that are of little interest to the project. As we show below, standard similarity methods are not well suited to this task.

Choosing the most appropriate similarity method for a particular use case must be a consensus between simplicity, accuracy and efficiency. Methods to estimate similarity or dissimilarity between pairs of molecules are generally composed of three main components: 1) the global and/or local representations of the molecular structure, 2) the descriptors describing molecular representations, and, finally, 3) the metric used to calculate the similarity or dissimilarity quantitatively[27].

For example, the Tanimoto metric was found to be the gold-standard measure to identify closely similar compounds using structure based descriptors[28]. The most common method in case of similarity of larger databases is the Tanimoto metric with 2D hashed fingerprint descriptors such as Daylight/Path based or Morgan/ECFP[29, 30] circular fingerprints. The Tanimoto metric is fast and easy to implement, therefore it is ideally suited for large data sets [24]. In the context of filtering large numbers of generated molecules, molecules only distantly related or not related to the input compounds might appear. These molecules should be filtered out from the dataset. In the current paper we aim to develop a similarity score specifically targeted at this problem. The conventional 2D fingerprints used in virtual screening have their own drawbacks:

1. Quantization issue: The more bits set in the query molecule, the higher is the average Tanimoto coefficient of all other molecules in the database[31, 32] (this has a consequence that for small molecules the feature count is small and one would need to use different similarity cut-off than that for larger molecules).

2. Multiple unique features issue: The hashed fingerprints (e.g. Daylight, ECFP4) often contain large numbers of unique features, however, only a very small proportion of these occur in more compounds[24]. Therefore, it is not surprising that some similar molecules can contain larger number of unique features and will have low similarity value (this effect is particularly pronounced in the changes of the central parts of the molecules). A good example showing this issue is Fig. 3 in reference [31].

3. Feature repetition issue: Although there are fingerprint methods which can handle feature counts[33], most 2D hashed fingerprints used for virtual screening are compressed, they have smaller bit counts (e.g. 1024 or 2048 bits) and often do not include a feature count[34]. This means that molecules with repeated fragments will have closely similar Tanimoto coefficients to those molecules which have the fragment present only once (cf. Figure 2 in [31]).

4. Disconnection issue: By definition, hashed fingerprints are disconnected to the initial molecular structure. This implies that global features of the structure are not kept, which is a concern in the context of scaffold[35, 36].

5. As the similarity drops below the well-established thresholds[37], the quality of the ranking decreases (there are many ways for compounds to be “dissimilar”). This poses a particular problem when trying to identify novel molecules.

In the case of de novo design, it is expected that there will be a large number of similar molecules to the initial queries and standard similarity measures will identify these. However, we are particularly concerned with the edges of this space, for example, where a new ring is added or a fragment replacement is made in the core of the molecule. We want to identify these compounds whilst removing those that are truly “dissimilar”. In this regard, the issues described in points 1–5 above should be addressed. Two recently reported articles [29, 38] provide interesting approaches to categorizing chemical structures based on their common scaffold. Both articles tried to cluster compounds based on chemical series, Kruger et al.[29] has chosen to define the chemical series based on human-defined series of a Novartis projects and with this they have suggested a hierarchical clustering method as performing the best (UPGMA). We aimed to use an algorithm which is suitable to deal with 100K-2M structures, and these two articles dealt with smaller numbers (40k compounds were used in reference [38]). Clustering (particularly hierarchical clustering) can have high costs for such large sets and, as mentioned by the authors, causes issues for compounds that could belong to more than one cluster. In this paper, we develop the Chemical Annotation score (CAScore), a new multi-parametric dissimilarity score that can rank compounds based on “similarity” to the initial query set. CAscore addresses the issues raised above by combining different representations of molecular structure and multiple descriptions into a single score. In order to train CAscore we use an approach similar to that employed by Kruger et al.[29] using patent claims information. Since the compounds in patent claims belong to the same Markush structure, we assume that they can be good proxies for the concept of chemical series. The various descriptor contributions in CAscore have been optimized to successfully separate different chemical series (different patents belonging to the same target) from each other. The same score can then be used to rank output from de novo design algorithms in a chemically sensible, reproducible manner without tuneable parameters. In the following sections, we describe the development of CAscore and present results on several de novo design tasks, demonstrating the advantages over standard similarity methods.

Data Sets

Reference data sets (for training and validation) belonging to the same Markush structure[33] were created from patents belonging to the same biological target using the method described in a previous paper[39], with the only difference that the number of required compounds per patent was decreased to 250 for the training set, whereas for the validation no such constraint was used. For the validation set, it was intended to keep some more unusual chemistry (macrocycles, spiro structures). Compounds belonging to one patent were considered to form one chemical series. The initial targets (and the first patents connected to these targets) were chosen from our previous study for transfer learning[39], and additional patents belonging to the same biological target were selected. For the validation set 4 other biological targets were chosen with their corresponding patents (Table 1).

A total of 35,640 unique compounds over 87 patents and 12 biological targets were considered in this work. For the score development (training), 26,116 unique compounds over 43 patents and 8 biological targets were considered. The total set of patents for the training and validation is given in the Supporting Information (Table S1).

Table 1. The list of biological protein targets for which patents were collected, the number of molecules refers to all molecules found in the patents. The detailed list of the patents can be found in Supporting Information (Table ST1).

Set	Uniprot ID	Name	Gene ID	Num. Patents	Num. Mol.
Training	Q9NY46	Sodium channel protein type 3 subunit alpha	SCN3A	2	1550
Training	P10275	Androgen receptor	AR	3	1449
Training	P43490	Nicotinamide phosphoribosyltransferase	NAMPT	3	2517
Training	P52333, O60674, P23458	Tyrosine-protein kinase JAK	JAK1,2,3	4	2423
Training	O00763	Acetyl-CoA carboxylase 2	ACAB	5	3552
Training	P43405	Tyrosine-protein kinase SYK	SYK	6	4125
Training	P42345	Serine/threonine-protein kinase mTOR	mTOR	7	4022
Training	Q15858	Sodium channel protein type 9 subunit alpha	SCN9A	13	6478
Validation	Q07820	Induced myeloid leukemia cell differentiation protein Mcl-1	MCL1	3	1753
Validation	Q9BUB5	MAP kinase-interacting serine/threonine-protein kinase 1	MKNK1	11	2073
Validation	Q86V67	Phosphodiesterase isozyme 4	PDE4	15	2343
Validation	P11309	Serine/threonine-protein kinase pim-1	PIM1	15	3355

The SureChemBL public database[40] was used to compare the CAScore with fingerprint-based similarity (RDK7 and MFP2, vide infra), which were used in building the CAScore. A version of SureChemBL from 15 July 2021 was downloaded and compounds deduplicated. The filtering and processing was done using BIOVIA Pipeline Pilot[41]. During the processing Pipeline Pilot ECFP4 fingerprints[42] and ChemAxon JChem chemical hashed fingerprints[43] were used. REOS[44] and PAINS A[45] filters were applied to the data set and any molecule hitting those were removed. All compounds having a molecular weight larger than 650 g.mol^-1 were also removed. The remaining compounds were clustered using sphere exclusion clustering[46, 47] with 1024-bit ChemAxon chemical hashed fingerprints at 0.70 Tanimoto threshold. Clusters having sizes between 10 and 1500 were kept and the cluster centroids were calculated using K-means fingerprint clustering based on 1024 bits Pipeline Pilot ECFP4 fingerprint. 7,634,938 molecules were collected in 113’171 clusters. CAScore dissimilarity, RDKit[48] MFP2 and RDKit RDK7 Tanimoto dissimilarities were calculated between the centroid and all the compounds of each cluster using 90 CPUs in 2 hours and 6 minutes.

The ChEMBL31[4, 49] was used to create a case study for the application of CAScore to the identification of similar compounds of betrixaban related to coagulation factor Xa (fXa) inhibition. CAScore dissimilarity, RDKit MFP2 and RDKit RDK7 Tanimoto dissimilarities were calculated between the betrixaban and 2,304,837 compounds available in the ChEMBL31 using 90 CPUs in 17 minutes.

Representations, Descriptors and Dissimilarity metrics

RDKit (version 2020.09.5) [48] was used to fragment the structures and generate most of the molecular representations, descriptors and metrics. Two metrics were calculated separately, maximum common edge subgraph (referred as CAMCS) was calculated using JChem version 18.16[43] and Reduced Graphs [50] were calculated using an in-house code implemented in JChem. 7 different molecular representations are considered:

Global representations:

Molecule – Denotes the entire structure, including pharmacophoric representations that describe groups and atoms.

ScaffoldAB – Bemis-Murcko framework with explicit atom type and bonds specification.

ScaffoldABexo – Bemis-Murcko framework with atoms and bonds specification and ‘*’ atoms representing the attachment points to the framework.

Local representations:

Ring – Containing one or more rings systems

Linker – Non-ring systems usually connecting rings. Linkers should have at least two attachment points.

RingSubstitution – All substituents attached to a Ring, which are not Linkers are RingSubstitutions.

LinkerSubstitution – Non-cyclic side chains branched from a Linker.

For each input molecule, the generated fragments are merged to generate one object with several fragments. The presentation of the hierarchical fragmentation is shown in Figure 1.

Molecular descriptors can be calculated for each representation and can be separated in to 3 categories:

2D hashed fingerprints – All representations were described by two hashed fingerprints: Morgan (or circular) fingerprints and a path-based fingerprint as implemented in RDKit. In our experience, circular fingerprints tend to favour changes in ring connections, whereas the path-based fingerprint is better at capturing changes in chains and framework extensions.
1. MFP – 1024 bits with a radius of 2 was used.
2. RDK – 1024 bits with a maximal path length of 7, each fragment encoded into maximally 3 bits, branchedPaths option set to False.
Constitutional descriptors (0D and 1D) – These are simple descriptors expressing feature counts. These descriptors were incorporated to address some of the drawbacks of 2D fingerprints.
1. NumAromatic and NumAliphatic – Used in case of Ring representation only. Number of aromatic and aliphatic rings, respectively.
2. NumFused and NumSpiroCenter – Used in case of Ring representation only. Number of fused and spiro rings.
3. NumDoubleBond – Used in all local representations. Number of Z/E double bonds.
4. Counts of N, O, S, P, C atoms – Used in all local representations, but carbon count is only used to describe structural modifications and is not used in Ring, to reduce the dissimilarity between the compounds for which an enlargement/reduction or modification of ring is observed.
5. NumCharge – Used for Linker, LinkerSubstitution and RingSubstitution. Gives the count of charged atoms (anions/cations).
6. NumChiral – Used for Molecule. Gives the number of chiral centers.
7. NumTripleBond – Used in for Linker, LinkerSubstitution and RingSubstitution. Gives the number of triple bonds.
8. Counts of halogens F, Cl, Br, I – Used in all local representations.
9. NumFragments – Number of all fragments as used in local representation (see Figure 1, one descriptor for the total molecule).
Graph-based descriptors are used for the full Moleculerepresentation.
1. Reduced Graph (RG) – Reduced Graph is a 2D pharmacophoric representation of the molecule, where functional groups are replaced with superatoms representing the 2D pharmacophoric features [50]. Here we use the fingerprint derived from the RG representation.
2. Maximum common edge subgraph (MCS) between the query and target molecule – calculated using ChemAxon.

Similarity metrics use the molecular descriptors to give a final value (coefficient) of dissimilarity between two structures as 1 - similarity. A dissimilarity coefficient of 0 means identity for the descriptor and metric used, though not necessarily structural identity (e.g. as in the case of nitrogen count). For fingerprints, the Tanimoto coefficient provides a whole molecule dissimilarity (0 to 1). The more general Tversky metric, provides a subset logic, where a dissimilarity coefficient of 0 means that one of the structures is fully available in the other. Other descriptors will give a distance that, theoretically at least, is unbounded at the upper end. The summary of all the metrics considered is given in Table 2.

Table 2

List of different metrics to calculate dissimilarity from the molecular descriptors.
Name	Equation^*	Description
$AMCS$	${AMCS}_{(q,t)}=1-\frac{{NAt}_{qt}}{{NAt}_{q}}$	Atom based maximum common substructure (AMCS) metric. A value of 0 is obtained when the query structure is entirely contained in target. A value of 1 is observed if query structure is not present in target compound.
$SAD$	${SAD}_{(q,t)}=\sum _{i=1}^{n}‖{x}_{i,t}-{x}_{i,q}‖$	Sum of Absolute Differences metric: It can be calculated for the constitutional descriptors. Distance is not normalized (it does not have an upper bound) and takes a value of 0 if two compounds have exactly the same descriptor values.
$CAMCS$	${CAMCS}_{(q,t)}=1-\frac{{NAt}_{qt}}{{NAt}_{q}+{NAt}_{t}-{NAt}_{qt}}$	ChemAxon based maximum common substructure (CAMCS) metric. A value of 0 means that the two compounds are identical, whereas 1 means a complete dissimilarity.
$Tc$	${Tc}_{(q,t)}=1-\frac{QT}{\left(Q+T-QT\right)}$	Tanimoto metric – calculated for the RDK, MFP and RG fingerprints. It is 0 if the target compound is very similar (identical in the descriptor space) to the query compound and 1 if compounds are completely dissimilar.
$Tv$	${Tv}_{(q,t)}=1-\frac{QT}{\left(\alpha Q+\beta T-QT\right)}$	Tversky metric – calculated for the MFP, RDK fingerprints. It is dependent on α and β coefficients, which determine the weight of query and target structures. If α = 1 and β = 0, the metric it is a substructure-like dissimilarity, if α = 0 and β = 1, it is a superstructure-like dissimilarity. The final value of metric is bound between 0 (similar) and 1 (dissimilar)[51].
^* - NAt: number of atoms; q: query; t: target; qt: common part between query and target; Q: number of on bits of the query; T: number of on bits of the target; QT : the number of on bits in both query and target; x_i: i^th constitutional descriptor; N: the size of the fingerprint; α,β: coefficients used for Tversky metric

All possible combinations of molecular representations, descriptors and metrics were generated as presented in Table 3. This way 35 dissimilarity coefficients could be obtained for each compound pair.

Table 3

The total list of calculated dissimilarity coefficients.
Molecule	ScaffoldAB	ScaffoldABexo	Ring	Linker	RingSubstitution	LinkerSubstitution
RDK7_Tc	RG	RDK7_Tc	RDK7_Tc	RDK7_Tc	RDK7_Tc	RDK7_Tc
RDK7_Tv	AMCS	RDK7_Tv	RDK7_Tv	RDK7_Tv	RDK7_Tv	RDK7_Tv
MFP2_Tc	CAMCS	MFP2_Tc	MFP2_Tc	MFP2_Tc	MFP2_Tc	MFP2_Tc
MFP2_Tv		MFP2_Tv	MFP2_Tv	MFP2_Tv	MFP2_Tv	MFP2_Tv
SAD			SAD	SAD	SAD	SAD
RG
AMCS
CAMCS

Score definition

A score, named Chemical Annotation score (CAScore), was defined as a linear combination of the 35 dissimilarity descriptors. Initially all dissimilarity descriptors were assigned a weight of 1, the raw score is a sum of all descriptors. A representative structure was taken from each patent belonging to the same biological target and the CAscore was calculated between the representative structure and all the patent structures belonging to the same target. The weights of the descriptors were then optimized to increase the separation of the compounds coming from the same patent as the representative structure (“own” patent) and the compounds from other patents belonging to the same target (“foreign” patents, Fig. 2).

Three types of representative query structures were defined for each patent in order to check the influence of the query selection on score generation:

1. Centroid: closest compound to the patent centroid using a K-means fingerprint clustering based on 1024 bits MFP2 fingerprint.

2. Active: the most active compound according to patent activity data

3. Random: a randomly picked compound

All dissimilarity coefficients (Table 3) were calculated between each query and all patent molecules belonging to the same target. To optimize the weights of each dissimilarity coefficient, a classification model was constructed using a Random Forest (RF) as implemented the scikit-learn python library. In our current approach RF is not used for classical machine learning prediction, but to optimize weights of the dissimilarity descriptors to achieve good separation between the patents. The optimization should lead to a score which is able to separate the compounds from their own patent and the compounds coming from the foreign patents for the same target. The AUCC[52] metric was used for determining the quality of the separation between the two categories and was used as a target function for the RF to maximise. 8 data sets were used and 43 × 3 queries (training patents × types of representative structures). As an example, for target gene AR there were 3 patents included with 1449 total structures. Taking the centroid of patent 1 (US 2010016279 A1, Table ST1 in Supporting Information) all compounds from this patent were tagged with category of own patent (847 compounds) and all compounds coming from the other two patents were tagged as foreign patent (602 compounds). The sum of own and foreign patent compounds should be equal to the number of total structures for each target. The same process was applied to the most active and random query structures of patent 1. The process was repeated for patent 2 (foreign patents being patents 1 and 3) and patent 3 (foreign patents being patents 1 and 2). The data were combined and used as the training set for the RF. In this data set, each patent will appear 3 times as own patent (3 different query types) and 3 × (n − 1) times as foreign patent, where n is the number of patents within the target. The same logic was applied to all the other targets to generate the full training set. At each step of optimization, new coefficient weights define a new CAScore. Compounds belonging to the same target are ranked according to this changed score and the AUCC calculated to check the separation of the patents. In the next step the weights are modified in order to increase the AUCC. A RF with 100 decision trees was used. The patents have different compound numbers, therefore they provide an imbalanced data set. A balanced class weight was applied to reduce the impact of this imbalance. This procedure was repeated 1000 times by using 40% of the training dataset as out-of-bag fraction to assess the variability of the generated weights (Supporting Information Figure S1). The weights were extracted from the RF optimization. Some dissimilarity coefficients had either very low weights after the optimization or they were highly correlated with each other. Therefore, the weights of the dissimilarity coefficients were ranked and were gradually removed to estimate the impact on the AUCC. In this manner, 22 components from the initial 35 components were removed, leaving us with 13 components whilst preserving the AUCC on overall patent separation. The final CAscore provides a suitable generalisation of the existing trends between the local and global dissimilarity coefficients (Eq. 1).

$$CAScore = 0.172 * ScaffoldABexo\_RDK7\_Tc + 0.149 * Molecule\_RDK7\_Tc + 0.132 * Molecule\_MFP2\_Tc + 0.093 * ScaffoldABexo\_RDK7\_Tv + 0.09 * Ring\_RDK7\_Tc + 0.069 * ScaffoldABexo\_MFP2\_Tc + 0.051 * ScaffoldAB\_A\_MCS + 0.02 * ScaffoldAB\_RG + 0.011 * Linker\_SAD + 0.009 * Molecule\_SAD + 0.008 * Ring\_SAD + 0.002 * RingSubstitution\_SAD + 0.001 * LinkerSubstitution\_SAD$$

Equation 1: Weighted CAscore equation. CAscore ranges between 0 and infinity.

CAscore validation

Figure 3 shows the AUCC values before and after the RF optimization both for the patents used in the training (43 patents) and a set of 44 patents which were not used in the training. The figure also depicts the differences in performance for the three different ways of selecting the representative query structures.

Initially, 7 and 9 patents were presenting an AUCC higher than 0.99 for the training set and the validation set, respectively. This was improved to 20 and 29 by incorporating weights to CAScore equation. The optimization of the CAScore weights caused an increase of the AUCC score of 0.1 on average for all patents. All patents in the validation set could be separated well, showing that the weighted score is able to identify compounds from the same chemical series and differentiate them from the others. As noted in the Data Sets section, the validation set contained structures with more unusual chemistry (macrocycles, spiro structures), therefore good results on the separation performance of the score show that there is some transferability to other chemotypes. One positive (good separation, Figure 4) and one negative (bad separation, Figure 5) example will be shown here from the validation patent set. When selecting both examples, chemical series of low diversity were considered, which would cause a challenge to any score in separation. The figures show the centroid query structure and the top 15 scaffolds, using the ScaffoldABexo representation (scaffold with attachment points marked), for the own and foreign patents. All query structures and the related results about patent separation are available in the Supporting Information (Figures S2-S256).

Patent WO 2014/022752 A1 from the PIM1 target was well separated for all three query types (AUCC ≥ 0.969, Figure 4, which shows only the centroid query case). Figure 4A shows the query structure, Figure 4B shows the scaffolds (each scaffold is numbered) present in the patent where the query structure came from (own patent), Figure 4C shows the density distributions of CAScore for own patent and foreign patents and Figure 4D shows the scaffolds from foreign patents (the same are true for Figure 5). Both own patent and foreign patents contain a quinoline and triazolo pyridine containing core. If we compare the closest structures to the centroid query (Figure 4A) in the own patent (structure B1 in Figure 4B) and the foreign patents (structure D1 in Figure 4D) we see that the main difference between the two is that the structure in the foreign patent contains another attachment with a pyrrolidine group. Although, these 2 scaffolds are well separated by the MFP2 fingerprint, if we check scaffolds at larger distance, MFP2 would mix some of the own and foreign patent scaffolds. Structures B8, B13 and B14 have almost the same MFP2 dissimilarity value, whereas structures B6, B11, B12 and B15 have larger dissimilarity than D1, this would eventually cause a mixing of the scaffolds coming from the different patents, if MFP2 were used. CAScore only mixes some compounds from B9 and B11 with D1, since these have similar scaffold and from the query they differ mostly by adding a larger aliphatic ring (which in cases of B9 and B11 is even larger than for D1).

A poor separation for all three query types was observed for patent WO 2015/004024 A1 from the MKNK1 target with an AUCC of 0.72 (Figure 5). This arises from the fact that the own patent and foreign patents contain compounds with a highly conserved scaffold. The score could still separate the scaffold belonging to the query (query: Figure 5A, its scaffold is B1 on Figure 5B) from the other scaffolds, with a CAScore range between 0 and 0.05. However, there is already an overlap between scaffolds B2 and D1 (Figure 5B and Figure 5D, respectively). Scaffolds B2 and B3 show small modifications by exploring the position of the pyridine nitrogen, or by changing the pyridine to a pyridone while preserving connectivity, stereochemistry and shape observed in the query structure. By comparison, D1 and D2 provide similar type of small transformations by replacing the pyridine by a benzene, or even an indazole by a pyrazolopyridine. Since the changes are of similar magnitude from the query for B2 and D1 (a core-ring nitrogen is moved by one position vs. query in case of B1 and the same nitrogen is changed to carbon in D1), it would be a challenge to completely separate those. B11 and B2 have the same MFP2 dissimilarity of 0.31 compared to the query. CAScore identifies B11 as less similar to query than B2, B3, D1 and D2 scaffolds, because it changes both one sulphur to a nitrogen and the attachment position. Scaffolds B4 to B10 and B12 to B15 suggest scaffold extensions from small and simple to large and complex modifications such as the addition of an azetidine, pyperidine, pyrrolidine, piperazine, morpholine or even an oxa-azaspiro rings. Although B7 and B9 provide the same piperazine extension, CAScore is able to distinguish and prioritise B7 scaffold which has the same amide stereochemistry as the query. Scaffolds D3 to D15 suggest also a scaffold extension of structure D1 and are observed in the CAScore range of 0.1 to 0.3. The poor separation is hardly surprising given the proximity of the structures in these patents. In fact, merging into one larger group can be seen as an advantage for the primary use case of CAscore in ranking de novo design compounds.

As mentioned in the introduction the two most commonly used fingerprints for similarity searching are path-based and circular fingerprints, here represented by the RDK7 and MFP2 fingerprints. To illustrate the differences between CAscore and these fingerprints, we have taken a filtered set of the SureChemBL public database (see the Data Sets section) and computed the circa 7.6M MFP2 and RDK7 Tanimoto dissimilarity values within the clusters. These are plotted as a function of CAScore in Figure 6. Overall, CAScore and RDK7 have less skewed distribution compared to MFP2, which is shifted to higher dissimilarity values mostly falling within the interval 0.31-0.86. However, care must be taken with overinterpreting the RDK7 plot, as the clusters were generated with the highly correlated ChemAxon path-based fingerprint. Thus, the highly dissimilar area (>0.8) for RDK7 is not populated (right hand side of Figure 6). Note also, that on the plot there is a significant number of structures where the dissimilarity coefficient equals to 1.0 both in case of MFP2 and RDK7. These are in many cases containing non typical structures that the REOS and PAINS filters were not able to remove (Supporting Information).

Table 4 shows selected cases from the SureChemBL database, where the CAScore and the two hashed fingerprints do not agree well in terms of similarity. As highlighted above, the fingerprint methods are bounded between 0 and 1 whilst CAscore has no upper bound on dissimilarity. Thus, an exact comparison is difficult. For guidance, we have found a CAscore of ~0.4 to be a reasonable threshold for compounds belonging to the same chemical series. The dissimilarity thresholds for MFP2 and RDK7 will be around 0.3 and 0.2 respectively [37]. These structures show the drawbacks of the hashed fingerprints discussed in the introduction (vide supra). Example 1 shows a pair of smaller molecules, where MFP2 Tanimoto coefficient gives a large dissimilarity (showing a good example for the quantization issue). CAScore correctly determined a small difference between these two compounds, whereas the path based RDK7 considers the two structures to be the same. Example 2 shows a pair of compounds, where a couple of slightly different substituents appear and they have different connectivity, both CAScore and RDK7 give reasonable coefficients, however MFP2 gives a very low similarity for the two, mainly due to the multiple unique features issue and the different connectivity, to which MFP2 is sensitive. Example 3 is an excellent case of the feature repetition issue. Although SCHEMBL13286927 is almost twice as large as SCHEMBL2914696 both their MFP2 and RDK7 dissimilarity coefficient are relatively small, expecting closely similar structures, whereas CAScore gives a larger dissimilarity between these. Examples 4 and 5 show the disconnection issue for MFP2 and RDK7, respectively. MFP2 is not able to distinguish structures in example 4, since the 2 radius distance finds the same fragments with same neighbours in both, whereas in example 5 RDK7 is unable to distinguish them. Finally, in example 6 we show the dissimilarity between the same structure in its different tautomeric forms. As for the fingerprints, CAScore is sensitive to the tautomeric forms. Therefore, the usage of the CAScore (as well as RDK7 and MFP2) expects tautomer standardization in such cases.

The primary driver for developing CAScore was to address the problem of identifying the compounds within a large set of de novo generated molecules that would be suitable for further analysis and selection. To illustrate this use case and compare with standard practice of using path-based and circular fingerprints, we present a case study exploring the chemical space of the ChEMBL database near Betrixaban, a factor Xa (fXa) inhibitor. Factor Xa (fXa) compounds were subtracted from the ChEMBL database to form an independent fXa dataset. The recovery of compounds from fXa dataset versus the ChEMBL database is shown in Figure 7. CAScore dissimilarity is the only one that ensures a consistently higher percentage of fXa dataset enrichment over the ChEMBL database. For comparison, CAScore includes only 8% of the ChEMBL database for a 20% fXa dataset enrichment, while MFP2 and RDK7 fingerprints incorporate more than 57% and 55% of the ChEMBL database, respectively. Dataset enrichment with the standard fingerprint methods is at the expense of the chemical space covered by ChEMBL database, resulting in a dilution of the compounds of interest for factor Xa when MFP2 and RDK7 are used.

CAScore extracts 178 molecules from the fXa dataset, whereas MFP2 and RDK7 fingerprints prioritise only 24 and 22 molecules respectively from the same dataset. Figure 8 illustrates the diversity of scaffolds selected by each approach at the above dissimilarity thresholds. A comparison of structural modifications against the original Betrixaban scaffold was explored to see the quality of the ranking.

The A1 scaffold selected by CAScore is equivalent to the B2 and C1 scaffolds selected by MFP2 and RDK7 fingerprints. In comparison, CAScore identified and prioritised 104 molecules with this scaffold while MFP2 and RDK7 identified only 22 and 21 molecules. The exploration of the near SAR space is therefore more exhaustive with CAScore than with the reference methods. Scaffold B1 is prioritised by the MFP2 fingerprint as the most similar compound to Betrixaban. This illustrates the feature repetition issue outlined above as MFP2 is insensitive to the repetition of para-substituted benzene which appears as highly similar to Betrixaban for this fingerprint. In comparison, this same scaffold is selected by the CAScore in A4 and is not selected at all by RDK7 fingerprint. The number of molecules identified by CAScore (24) is also higher than for MFP2 (1). It indicates that MFP2 is very sensitive to local modifications of compounds reducing the selection of compounds to exemplify a scaffold. CAScore preferentially identifies scaffolds A2 and A3 over A4 (the latter is preferred by MFP2 – vide supra). A2 shows a benzene to pyridine modification, whereas A3 shows a ortho- to meta-pyridine change both being ring level changes. These changes on the scaffold are penalized by standard hashed fingerprints, whereas CAScore keeps them still at relatively close similarity.

Scaffold A7 (equivalent to B3 and C2) are selected by all metrics as interesting structures as extensions of the scaffold. CAScore prioritizes the four scaffolds of ring addition (A5, A6, A7, A9 and A10) based on their ring size and the presence of heteroatoms within the added ring systems (the added rings are: azetidine, pyrrolidine, pyperidine, hexamethyleneimine and morpholine, respectively). A8 is an interesting modification of the central region of the scaffold by a thiophene. A further 17 minor modifications are intended to suggest more extensive transformations of the scaffold from A11 to A28. All of these structures, however, allow for significant conservation of the original scaffold and present a CAScore between 0.26 and 0.30. This exercise showed that CAScore can retain more relevant structures at the cutoff given above than the other two fingerprints used.

In general, the CAScore provides a good quality ranking to categorise compounds containing both side-chain and core changes compared with the query. This new score can incorporate more compounds particularly with small changes in the reference scaffold than the investigated fingerprints, e.g. scaffolds A2, A3, A14 and A18 are very similar to the original scaffold, but all of them contain small modifications to it and they are all penalized by MFP2 and RDK7. This penalization is particularly strong in case of A14. Therefore, CAscore can prioritize compounds interesting for SAR exploration which would not be selected by the fingerprint-based methods.

In this paper we have proposed the Chemical Annotation Score (CAScore) as a novel metric for computing chemical distance. The CAScore is composed of terms from robust cheminformatics methods and can be readily understood. The terms of the score have been derived from an analysis of patents by using a Random Forest to optimise the AUCC for grouping the compounds from the same patent relative to other patents for the same biological target.

The application of CAScore is focussed on the problem of identifying from a large pool of de novo designed compounds the most relevant chemical space for SAR generation. To illustrate this use-case we have applied CAScore to an an example data set from ChEMBL and compared its performance with standard fingerprint-based methods similarity methods. The results show that fingerprint-based methods are not well suited to this task. CAScore was able to annotate eight times as many compounds as relevant compared to fingerprint-based methods.

In summary, CAScore is a novel metric for chemical distance that has been parameterised to identify relevant compounds for the purposes of SAR Generation in an automated design setting where many thousands of compounds may be generated.

Ethical Approval

Not applicable

Competing interests

SDP, PP and BC are employees for GlaxoSmithKline.

Authors’ contributions

SDP and BC conceived the approach. BC and PP implemented the method and ran the calculations. BC, PP and SDP wrote the manuscript. All authors approved the final manuscript.

Funding

Not applicable

Availability of data and materials

All datasets used in this analysis were extracted from the ChEMBL and SureChEMBL on-line resources.

Schneider, G. and U. Fechner, Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery, 2005. 4(8): p. 649-663.
Sterling, T. and J.J. Irwin, ZINC 15 – Ligand Discovery for Everyone. Journal of Chemical Information and Modeling, 2015. 55(11): p. 2324-2337.
Kim, S., et al., PubChem Substance and Compound databases. Nucleic Acids Research, 2015. 44(D1): p. D1202-D1213.
Gaulton, A., et al., The ChEMBL database in 2017. Nucleic Acids Research, 2016. 45(D1): p. D945-D954.
Globus, A., J. Lawton, and T. Wipke, Automatic molecular design using evolutionary techniques. Nanotechnology, 1999. 10(3): p. 290-299.
Schneider, G., et al., De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks. Journal of Computer-Aided Molecular Design, 2000. 14(5): p. 487-494.
Brown, N., et al., A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules. Journal of Chemical Information and Computer Sciences, 2004. 44(3): p. 1079-1087.
Luo, Z., R. Wang, and L. Lai, RASSE: A New Method for Structure-Based Drug Design. Journal of Chemical Information and Computer Sciences, 1996. 36(6): p. 1187-1194.
Gillet, V., et al., SPROUT: A program for structure generation. Journal of Computer-Aided Molecular Design, 1993. 7(2): p. 127-153.
Gillet, V.J., et al., SPROUT: Recent developments in the de novo design of molecules. Journal of Chemical Information and Computer Sciences, 1994. 34(1): p. 207-217.
Mata, P., et al., SPROUT: 3D Structure Generation Using Templates. Journal of Chemical Information and Computer Sciences, 1995. 35(3): p. 479-493.
Nishibata, Y. and A. Itai, Automatic creation of drug candidate structures based on receptor structure. Starting point for artificial lead generation. Tetrahedron, 1991. 47(43): p. 8985-8990.
Pearlman, D.A. and M.A. Murcko, CONCEPTS: New dynamic algorithm for de novo drug suggestion. Journal of Computational Chemistry, 1993. 14(10): p. 1184-1193.
Degen, J., et al., On the Art of Compiling and Using 'Drug-Like' Chemical Fragment Spaces. ChemMedChem, 2008. 3(10): p. 1503-1507.
Hussain, J. and C. Rea, Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets. Journal of Chemical Information and Modeling, 2010. 50(3): p. 339-348.
Boda, K., T. Seidel, and J. Gasteiger, Structure and reaction based evaluation of synthetic accessibility. Journal of Computer-Aided Molecular Design, 2007. 21(6): p. 311-325.
Vinkers, H.M., et al., SYNOPSIS: SYNthesize and OPtimize System in Silico. Journal of Medicinal Chemistry, 2003. 46(13): p. 2765-2773.
Hartenfeller, M., et al., DOGS: Reaction-Driven de novo Design of Bioactive Compounds. PLOS Computational Biology, 2012. 8(2): p. e1002380.
Blaschke, T., et al., Application of Generative Autoencoder in De Novo Molecular Design. Molecular Informatics, 2018. 37(1-2): p. 1700123.
Segler, M.H.S., et al., Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science, 2018. 4(1): p. 120-131.
Arús-Pous, J., et al., SMILES-based deep generative scaffold decorator for de-novo drug design. Journal of Cheminformatics, 2020. 12(1): p. 38.
Bender, A. and R.C. Glen, Molecular similarity: a key technique in molecular informatics. Organic & Biomolecular Chemistry, 2004. 2(22): p. 3204-3218.
Kubinyi, H., Similarity and Dissimilarity: A Medicinal Chemist's View. Perspectives in Drug Discovery and Design, 1998. 9(0): p. 225-252.
Maggiora, G., et al., Molecular Similarity in Medicinal Chemistry. Journal of Medicinal Chemistry, 2014. 57(8): p. 3186-3204.
Medina-Franco, J.L. and G.M. Maggiora, MOLECULAR SIMILARITY ANALYSIS, in Chemoinformatics for Drug Discovery. 2013. p. 343-399.
Green, D.V.S., et al., BRADSHAW: a system for automated molecular design. Journal of Computer-Aided Molecular Design, 2020. 34(7): p. 747-765.
Holliday, J.D., et al., Analysis and Display of the Size Dependence of Chemical Similarity Coefficients. Journal of Chemical Information and Computer Sciences, 2003. 43(3): p. 819-828.
Willett, P., V. Winterman, and D. Bawden, Implementation of nearest-neighbor searching in an online chemical structure search system. Journal of Chemical Information and Computer Sciences, 1986. 26(1): p. 36-41.
Kruger, F., N. Fechner, and N. Stiefl, Automated Identification of Chemical Series: Classifying like a Medicinal Chemist. Journal of Chemical Information and Modeling, 2020. 60(6): p. 2888-2902.
Bajusz, D., A. Rácz, and K. Héberger, Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics, 2015. 7(1): p. 20.
Flower, D.R., On the Properties of Bit String-Based Measures of Chemical Similarity. Journal of Chemical Information and Computer Sciences, 1998. 38(3): p. 379-386.
Dixon, S.L. and R.T. Koehler, The Hidden Component of Size in Two-Dimensional Fragment Descriptors: Side Effects on Sampling in Bioactive Libraries. Journal of Medicinal Chemistry, 1999. 42(15): p. 2887-2900.
Arif, S.M., J.D. Holliday, and P. Willett, Analysis and use of fragment-occurrence data in similarity-based virtual screening. Journal of Computer-Aided Molecular Design, 2009. 23(9): p. 655.
Liu, P., et al., Accelerating Chemical Database Searching Using Graphics Processing Units. Journal of Chemical Information and Modeling, 2011. 51(8): p. 1807-1816.
Vogt, M., et al., Scaffold Hopping Using Two-Dimensional Fingerprints: True Potential, Black Magic, or a Hopeless Endeavor? Guidelines for Virtual Screening. Journal of Medicinal Chemistry, 2010. 53(15): p. 5707-5715.
Willett, P., J.M. Barnard, and G.M. Downs, Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, 1998. 38(6): p. 983-996.
Papadatos, G., et al., Analysis of Neighborhood Behavior in Lead Optimization and Array Design. Journal of Chemical Information and Modeling, 2009. 49(2): p. 195-208.
Bandyopadhyay, D., et al., Scaffold-Based Analytics: Enabling Hit-to-Lead Decisions by Visualizing Chemical Series Linked across Large Datasets. Journal of Chemical Information and Modeling, 2019. 59(11): p. 4880-4892.
Amabilino, S., et al., Guidelines for Recurrent Neural Network Transfer Learning-Based Molecular Generation of Focused Libraries. Journal of Chemical Information and Modeling, 2020. 60(12): p. 5699-5713.
Papadatos, G., et al., SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Research, 2015. 44(D1): p. D1220-D1228.
Dassault Systèmes BIOVIA. BIOVIA Pipeline Pilot 22.1.0.2935. Release 2021; San Diego: Dassault Systèmes, 2021.
Rogers, D. and M. Hahn, Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 2010. 50(5): p. 742-754.
JChem. 18.16.0; ChemAxon. (accessed August 31, 2022).
Walters, W.P., M.T. Stahl, and M.A. Murcko, Virtual screening—an overview. Drug Discovery Today, 1998. 3(4): p. 160-178.
Baell, J.B. and G.A. Holloway, New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. Journal of Medicinal Chemistry, 2010. 53(7): p. 2719-2740.
Taylor, R., Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals. Journal of Chemical Information and Computer Sciences, 1995. 35(1): p. 59-67.
Butina, D., Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. Journal of Chemical Information and Computer Sciences, 1999. 39(4): p. 747-750.
Greg Landrum, P.T., Brian Kelley, Serena Riniker, Riccardo Vianello, Nadine Schneider, Andrew Dalke, Eisuke Kawashima, Brian Cole, Samo Turk, Matt Swain, Alexander Savelyev, David Cosgrove, Alain Vaucher, Maciej Wójcikowski, Daniel Probst, Guillaume Godin, Gareth Jones, Vincent F. Scalfani, Axel Pahl, Francois Berenger, J L Varjo, Doliath Gavid, Gianluca Sforna, Jan Holst Jensen. RDKit 2020.09.5. 2020; Available from: .
Gaulton, A., et al., ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 2011. 40(D1): p. D1100-D1107.
Harper, G., et al., The Reduced Graph Descriptor in Virtual Screening and Data-Driven Clustering of High-Throughput Screening Data. Journal of Chemical Information and Computer Sciences, 2004. 44(6): p. 2145-2156.
Kunimoto, R., M. Vogt, and J. Bajorath, Maximum common substructure-based Tversky index: an asymmetric hybrid similarity measure. Journal of Computer-Aided Molecular Design, 2016. 30(7): p. 523-531.
Jaskowiak, P.A., I.G. Costa, and R.J.G.B. Campello, The area under the ROC curve as a measure of clustering quality. Data Mining and Knowledge Discovery, 2022. 36(3): p. 1219-1245.

Table 4 is available in the Supplementary Files section

Competing interest reported. SP, BC and PP are full time employees of GlaxoSmithKline.

Name	Equation^*	Description
\(AMCS\)	\({AMCS}_{(q,t)}=1-\frac{{NAt}_{qt}}{{NAt}_{q}}\)	Atom based maximum common substructure (AMCS) metric. A value of 0 is obtained when the query structure is entirely contained in target. A value of 1 is observed if query structure is not present in target compound.
\(SAD\)	\({SAD}_{(q,t)}=\sum _{i=1}^{n}‖{x}_{i,t}-{x}_{i,q}‖\)	Sum of Absolute Differences metric: It can be calculated for the constitutional descriptors. Distance is not normalized (it does not have an upper bound) and takes a value of 0 if two compounds have exactly the same descriptor values.
\(CAMCS\)	\({CAMCS}_{(q,t)}=1-\frac{{NAt}_{qt}}{{NAt}_{q}+{NAt}_{t}-{NAt}_{qt}}\)	ChemAxon based maximum common substructure (CAMCS) metric. A value of 0 means that the two compounds are identical, whereas 1 means a complete dissimilarity.
\(Tc\)	\({Tc}_{(q,t)}=1-\frac{QT}{\left(Q+T-QT\right)}\)	Tanimoto metric – calculated for the RDK, MFP and RG fingerprints. It is 0 if the target compound is very similar (identical in the descriptor space) to the query compound and 1 if compounds are completely dissimilar.
\(Tv\)	\({Tv}_{(q,t)}=1-\frac{QT}{\left(\alpha Q+\beta T-QT\right)}\)	Tversky metric – calculated for the MFP, RDK fingerprints. It is dependent on α and β coefficients, which determine the weight of query and target structures. If α = 1 and β = 0, the metric it is a substructure-like dissimilarity, if α = 0 and β = 1, it is a superstructure-like dissimilarity. The final value of metric is bound between 0 (similar) and 1 (dissimilar)[51].
^* - NAt: number of atoms; q: query; t: target; qt: common part between query and target; Q: number of on bits of the query; T: number of on bits of the target; QT : the number of on bits in both query and target; x_i: i^th constitutional descriptor; N: the size of the fingerprint; α,β: coefficients used for Tversky metric

Chemical Annotation Score – A novel distance metric with application in de novo design

Status:

Version 1

Abstract

Figures

Introduction

Methods

Data Sets

Representations, Descriptors and Dissimilarity metrics

Score definition

Results

Conclusion

Declarations

Ethical Approval

Competing interests

Authors’ contributions

Funding

Availability of data and materials

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1