CAscore validation
Figure 3 shows the AUCC values before and after the RF optimization both for the patents used in the training (43 patents) and a set of 44 patents which were not used in the training. The figure also depicts the differences in performance for the three different ways of selecting the representative query structures.
Initially, 7 and 9 patents were presenting an AUCC higher than 0.99 for the training set and the validation set, respectively. This was improved to 20 and 29 by incorporating weights to CAScore equation. The optimization of the CAScore weights caused an increase of the AUCC score of 0.1 on average for all patents. All patents in the validation set could be separated well, showing that the weighted score is able to identify compounds from the same chemical series and differentiate them from the others. As noted in the Data Sets section, the validation set contained structures with more unusual chemistry (macrocycles, spiro structures), therefore good results on the separation performance of the score show that there is some transferability to other chemotypes. One positive (good separation, Figure 4) and one negative (bad separation, Figure 5) example will be shown here from the validation patent set. When selecting both examples, chemical series of low diversity were considered, which would cause a challenge to any score in separation. The figures show the centroid query structure and the top 15 scaffolds, using the ScaffoldABexo representation (scaffold with attachment points marked), for the own and foreign patents. All query structures and the related results about patent separation are available in the Supporting Information (Figures S2-S256).
Patent WO 2014/022752 A1 from the PIM1 target was well separated for all three query types (AUCC ≥ 0.969, Figure 4, which shows only the centroid query case). Figure 4A shows the query structure, Figure 4B shows the scaffolds (each scaffold is numbered) present in the patent where the query structure came from (own patent), Figure 4C shows the density distributions of CAScore for own patent and foreign patents and Figure 4D shows the scaffolds from foreign patents (the same are true for Figure 5). Both own patent and foreign patents contain a quinoline and triazolo pyridine containing core. If we compare the closest structures to the centroid query (Figure 4A) in the own patent (structure B1 in Figure 4B) and the foreign patents (structure D1 in Figure 4D) we see that the main difference between the two is that the structure in the foreign patent contains another attachment with a pyrrolidine group. Although, these 2 scaffolds are well separated by the MFP2 fingerprint, if we check scaffolds at larger distance, MFP2 would mix some of the own and foreign patent scaffolds. Structures B8, B13 and B14 have almost the same MFP2 dissimilarity value, whereas structures B6, B11, B12 and B15 have larger dissimilarity than D1, this would eventually cause a mixing of the scaffolds coming from the different patents, if MFP2 were used. CAScore only mixes some compounds from B9 and B11 with D1, since these have similar scaffold and from the query they differ mostly by adding a larger aliphatic ring (which in cases of B9 and B11 is even larger than for D1).
A poor separation for all three query types was observed for patent WO 2015/004024 A1 from the MKNK1 target with an AUCC of 0.72 (Figure 5). This arises from the fact that the own patent and foreign patents contain compounds with a highly conserved scaffold. The score could still separate the scaffold belonging to the query (query: Figure 5A, its scaffold is B1 on Figure 5B) from the other scaffolds, with a CAScore range between 0 and 0.05. However, there is already an overlap between scaffolds B2 and D1 (Figure 5B and Figure 5D, respectively). Scaffolds B2 and B3 show small modifications by exploring the position of the pyridine nitrogen, or by changing the pyridine to a pyridone while preserving connectivity, stereochemistry and shape observed in the query structure. By comparison, D1 and D2 provide similar type of small transformations by replacing the pyridine by a benzene, or even an indazole by a pyrazolopyridine. Since the changes are of similar magnitude from the query for B2 and D1 (a core-ring nitrogen is moved by one position vs. query in case of B1 and the same nitrogen is changed to carbon in D1), it would be a challenge to completely separate those. B11 and B2 have the same MFP2 dissimilarity of 0.31 compared to the query. CAScore identifies B11 as less similar to query than B2, B3, D1 and D2 scaffolds, because it changes both one sulphur to a nitrogen and the attachment position. Scaffolds B4 to B10 and B12 to B15 suggest scaffold extensions from small and simple to large and complex modifications such as the addition of an azetidine, pyperidine, pyrrolidine, piperazine, morpholine or even an oxa-azaspiro rings. Although B7 and B9 provide the same piperazine extension, CAScore is able to distinguish and prioritise B7 scaffold which has the same amide stereochemistry as the query. Scaffolds D3 to D15 suggest also a scaffold extension of structure D1 and are observed in the CAScore range of 0.1 to 0.3. The poor separation is hardly surprising given the proximity of the structures in these patents. In fact, merging into one larger group can be seen as an advantage for the primary use case of CAscore in ranking de novo design compounds.
As mentioned in the introduction the two most commonly used fingerprints for similarity searching are path-based and circular fingerprints, here represented by the RDK7 and MFP2 fingerprints. To illustrate the differences between CAscore and these fingerprints, we have taken a filtered set of the SureChemBL public database (see the Data Sets section) and computed the circa 7.6M MFP2 and RDK7 Tanimoto dissimilarity values within the clusters. These are plotted as a function of CAScore in Figure 6. Overall, CAScore and RDK7 have less skewed distribution compared to MFP2, which is shifted to higher dissimilarity values mostly falling within the interval 0.31-0.86. However, care must be taken with overinterpreting the RDK7 plot, as the clusters were generated with the highly correlated ChemAxon path-based fingerprint. Thus, the highly dissimilar area (>0.8) for RDK7 is not populated (right hand side of Figure 6). Note also, that on the plot there is a significant number of structures where the dissimilarity coefficient equals to 1.0 both in case of MFP2 and RDK7. These are in many cases containing non typical structures that the REOS and PAINS filters were not able to remove (Supporting Information).
Table 4 shows selected cases from the SureChemBL database, where the CAScore and the two hashed fingerprints do not agree well in terms of similarity. As highlighted above, the fingerprint methods are bounded between 0 and 1 whilst CAscore has no upper bound on dissimilarity. Thus, an exact comparison is difficult. For guidance, we have found a CAscore of ~0.4 to be a reasonable threshold for compounds belonging to the same chemical series. The dissimilarity thresholds for MFP2 and RDK7 will be around 0.3 and 0.2 respectively [37]. These structures show the drawbacks of the hashed fingerprints discussed in the introduction (vide supra). Example 1 shows a pair of smaller molecules, where MFP2 Tanimoto coefficient gives a large dissimilarity (showing a good example for the quantization issue). CAScore correctly determined a small difference between these two compounds, whereas the path based RDK7 considers the two structures to be the same. Example 2 shows a pair of compounds, where a couple of slightly different substituents appear and they have different connectivity, both CAScore and RDK7 give reasonable coefficients, however MFP2 gives a very low similarity for the two, mainly due to the multiple unique features issue and the different connectivity, to which MFP2 is sensitive. Example 3 is an excellent case of the feature repetition issue. Although SCHEMBL13286927 is almost twice as large as SCHEMBL2914696 both their MFP2 and RDK7 dissimilarity coefficient are relatively small, expecting closely similar structures, whereas CAScore gives a larger dissimilarity between these. Examples 4 and 5 show the disconnection issue for MFP2 and RDK7, respectively. MFP2 is not able to distinguish structures in example 4, since the 2 radius distance finds the same fragments with same neighbours in both, whereas in example 5 RDK7 is unable to distinguish them. Finally, in example 6 we show the dissimilarity between the same structure in its different tautomeric forms. As for the fingerprints, CAScore is sensitive to the tautomeric forms. Therefore, the usage of the CAScore (as well as RDK7 and MFP2) expects tautomer standardization in such cases.
The primary driver for developing CAScore was to address the problem of identifying the compounds within a large set of de novo generated molecules that would be suitable for further analysis and selection. To illustrate this use case and compare with standard practice of using path-based and circular fingerprints, we present a case study exploring the chemical space of the ChEMBL database near Betrixaban, a factor Xa (fXa) inhibitor. Factor Xa (fXa) compounds were subtracted from the ChEMBL database to form an independent fXa dataset. The recovery of compounds from fXa dataset versus the ChEMBL database is shown in Figure 7. CAScore dissimilarity is the only one that ensures a consistently higher percentage of fXa dataset enrichment over the ChEMBL database. For comparison, CAScore includes only 8% of the ChEMBL database for a 20% fXa dataset enrichment, while MFP2 and RDK7 fingerprints incorporate more than 57% and 55% of the ChEMBL database, respectively. Dataset enrichment with the standard fingerprint methods is at the expense of the chemical space covered by ChEMBL database, resulting in a dilution of the compounds of interest for factor Xa when MFP2 and RDK7 are used.
CAScore extracts 178 molecules from the fXa dataset, whereas MFP2 and RDK7 fingerprints prioritise only 24 and 22 molecules respectively from the same dataset. Figure 8 illustrates the diversity of scaffolds selected by each approach at the above dissimilarity thresholds. A comparison of structural modifications against the original Betrixaban scaffold was explored to see the quality of the ranking.
The A1 scaffold selected by CAScore is equivalent to the B2 and C1 scaffolds selected by MFP2 and RDK7 fingerprints. In comparison, CAScore identified and prioritised 104 molecules with this scaffold while MFP2 and RDK7 identified only 22 and 21 molecules. The exploration of the near SAR space is therefore more exhaustive with CAScore than with the reference methods. Scaffold B1 is prioritised by the MFP2 fingerprint as the most similar compound to Betrixaban. This illustrates the feature repetition issue outlined above as MFP2 is insensitive to the repetition of para-substituted benzene which appears as highly similar to Betrixaban for this fingerprint. In comparison, this same scaffold is selected by the CAScore in A4 and is not selected at all by RDK7 fingerprint. The number of molecules identified by CAScore (24) is also higher than for MFP2 (1). It indicates that MFP2 is very sensitive to local modifications of compounds reducing the selection of compounds to exemplify a scaffold. CAScore preferentially identifies scaffolds A2 and A3 over A4 (the latter is preferred by MFP2 – vide supra). A2 shows a benzene to pyridine modification, whereas A3 shows a ortho- to meta-pyridine change both being ring level changes. These changes on the scaffold are penalized by standard hashed fingerprints, whereas CAScore keeps them still at relatively close similarity.
Scaffold A7 (equivalent to B3 and C2) are selected by all metrics as interesting structures as extensions of the scaffold. CAScore prioritizes the four scaffolds of ring addition (A5, A6, A7, A9 and A10) based on their ring size and the presence of heteroatoms within the added ring systems (the added rings are: azetidine, pyrrolidine, pyperidine, hexamethyleneimine and morpholine, respectively). A8 is an interesting modification of the central region of the scaffold by a thiophene. A further 17 minor modifications are intended to suggest more extensive transformations of the scaffold from A11 to A28. All of these structures, however, allow for significant conservation of the original scaffold and present a CAScore between 0.26 and 0.30. This exercise showed that CAScore can retain more relevant structures at the cutoff given above than the other two fingerprints used.
In general, the CAScore provides a good quality ranking to categorise compounds containing both side-chain and core changes compared with the query. This new score can incorporate more compounds particularly with small changes in the reference scaffold than the investigated fingerprints, e.g. scaffolds A2, A3, A14 and A18 are very similar to the original scaffold, but all of them contain small modifications to it and they are all penalized by MFP2 and RDK7. This penalization is particularly strong in case of A14. Therefore, CAscore can prioritize compounds interesting for SAR exploration which would not be selected by the fingerprint-based methods.