Comparison between common scaffold representations and “Molecular Anatomy” to perform SAR analysis
As shown in methods section for to the COX-2 inhibitors dataset, scaffold representations with high level of abstraction, showed in Fig. 1a-b for Polmacoxib, perform generally better than the others in the identification of relevant chemotypes. Table 1 summarizes the results obtained for each representation in terms of number of clusters generated, starting, on one hand, from all the 819 COX-2 inhibitors in preclinical development or in a higher phase, and, on the other hand, from the subset of the COX-2 inhibitors matching the MDL substructure reported in Fig. 2, the most common COX-2 inhibitor moiety. In particular, the number of clusters containing the molecules matching the common substructure with exactly or more than 3 rings was specified.
Representation 1a clusters together most of the well-known marketed drugs, such as valdecoxib and celecoxib, as well as many others leads and experimental drugs, and collapses all the 142 active molecules with exactly 3 rings to a single cluster. This cluster likely includes also several inactive molecules. Interestingly, we can note that, even though this representation is used, still almost the 40% of the structural scaffolds information, corresponding to the molecules with additional rings, would be lost in unrelated clusters, impairing the identification of the most relevant additional structural information.
Using the less abstracted representation 1d, we can retrieve and distinguish the most diverse COX-2 inhibitor scaffolds, even if this information is distributed in 84 clusters considering both those with 3 or more rings. Furthermore, an intermediate representation as 1b, where only the atom type information is removed, could allow a more effective clustering of the relevant structural information, identifying only 11 different frameworks containing molecules with exactly 3 rings, instead of 31; but, almost the same number of clusters containing molecules with more than 3 rings is generated with the two representations (48 instead of 53).
This example on COX-2 inhibitors clearly shows how this kind of analysis strongly depends on the nature of the dataset; each scaffold abstraction of Fig. 1 provides some important structural information but none of them is sufficient, alone, to capture the complexity of the heterogeneous ensemble of molecules. Only the integration of the information captured from the different scaffold abstractions, in a Multi-Dimensional Hierarchical Scaffold Analysis, allows to effectively map the entire chemical space of multi scaffold libraries. Furthermore, the combination of the “Molecular Anatomy” approach, the fragmentation rules and the network representation allows to immediately focus the attention on the most interesting and useful structural information, easily navigating among several structural clusters, moving from a molecular framework to another on the basis of their hierarchy and according to the SAR.
Attempts to identify more relevant chemical moieties have been presented in the past, for example the rule-based decompositions proposed by Schuffenhauer et al [29], schematized in Fig. 7 for three COX-2 inhibitor scaffolds. However, a clear limitation resides in the difficulty to define a priori a set of rules able to maintain a general consistency with SAR information.
The method that we propose, involving the combination of correlated molecular frameworks and fragments, is able to efficiently identify relevant chemical moieties, and to cluster together different molecular species showing similar biological activity (also in the nanomolar range) within HTS campaigns, capturing most of the SAR information.
To fully exploit the hierarchical correlation among the molecular frameworks and to generate a full graphical representation of the analyzed dataset, we also propose a network visualization. Actually, the combination of the MF approach with a network representation provides a more convenient tool for SAR evaluation and visualization [30–33], usefully guiding the user from a molecular framework to another, on the basis of their hierarchy in the direction of increasing or decreasing level of abstraction and according to the SAR.
Figure 8 shows the complete network obtained for the dataset of 819 COX-2 inhibitors. As reported in the list of statistical parameters (Fig. 8b), 280 connected components were generated, corresponding to the clusters obtained using the most abstracted (basic wireframe) representation. It is possible to clearly note the biggest cluster at the top of Fig. 8a corresponding to the 142 molecules with exactly 3 rings (Table 1), all sharing the basic wireframe 1a. Figure 8c reports the hierarchical visualization of a smaller cluster, to further show how this graphical representation of the data matrix consists in an oriented network, where nodes are in general molecular frameworks, and the direction of the edges is defined by the direction of increasing abstraction level of the molecular representations.
Furthermore, it is possible to retrieve the relationships among the diverse representations within this cluster and, focusing on the most interconnected frameworks, to identify the structural characteristic representative of the active molecules, as shown in Fig. 9. On the other hand, the network visualization clearly shows the high number of singletons that would be dispersed considering only the representation 1a. Here, thanks to the use of the fragmentation, these singletons can be related each other if containing the same fragments, allowing to easily verify if they contain characteristics in common with relevant clusters of actives.
Focusing on the fragments related to the basic wireframe representation, all the clusters identified in Fig. 8a can be connected each other in a unique network, as can be visualized in Figure S1.
Furthermore, Figure S2 shows the two fragments, cyclohexane and cyclopentane, with the highest indegree value, which means the highest number of fragments connected within the network in Figure S1.
Some qualitative considerations about the obtained networks can be done. As a first point, it is reasonable that highly connected singletons tend to be small fragments shared by a large number of molecules included in the library (as shown in Figure S2). On the contrary, low molecular weight singletons involved in a small number of connections represent potential interesting decorations of a specific group of the original molecules. If this group is enriched in a specific activity of interest, the corresponding singleton fragments connecting all the molecules included in the group, could represent a pharmacophore. As a second point, high molecular weight singleton fragments, connecting cluster of molecules with enriched activity, could represent chemical scaffolds or the “minimal chemical entity” that confers the selected activity to the cluster. As a third point, it is comprehensible that the meaning of the singleton constituting the networks may change according to the fragmentation rules used. While the approach suggested herein consists in a purely informatics fragmentation procedure, an alternative method is possible, where singletons consist in reaction intermediates derived applying retrosynthetic rules to the original molecules. In other words, in this case the network would contains, as “fragments” the precursors used to synthesize larger molecules, and as pathways connecting couple of singletons, possible synthetic strategies to attach a specific interesting low molecular weight singleton to another one representing, for example, a scaffold.
In our experience, the “Molecular Anatomy”’ approach allows deciphering more easily the connections between chemotypes. In particular, filtering by EF and ranking by number of connections for each cluster allow to focus the analysis on the highly connected singletons. These frameworks have high relevance, considering that they connect different chemotypes without overlapping fragments and, then, could, suggest the most significant parts of active molecules, the fragments that could be exchanged, and the bond order and the atom type relevant for SAR derivation. This approach allows to include in SAR analysis also molecules usually underestimated because singletons, or compounds with small ligand efficacy, but here connected to relevant clusters corresponding to specific series of compounds. In this way, a valuable information could be added in the SAR of this major hit series, connecting them to additional latent ones [34]. This method could be considered an extension of the already proposed compound set enrichment [17, 35, 36], based on an implementation of an higher level of abstraction, potentially able to identify new hit series connected with the conventional one.
Case study: SAR analysis of an HTS campaign on HDAC7.
In order to better illustrate the molecular scaffold representations and the fragmentations rules that we introduced and with the intent to clarify the advantages to use the network visualization proposed for SAR evaluation, we present, as case study, the SAR analysis of the HTS campaign on HDAC7 performed for 26092 compounds.
First, the set of nine molecular frameworks at different abstraction levels were generated for the entire dataset. For each of the nine frameworks, the EF was calculated (according to the formula provided in SI), based on the inhibition data of the corresponding molecules; molecules were considered as active if belonging to the activity classes moderate, strong and very strong (Table S1).
Figure 10 shows the complete network obtained with Cytoscape, as described in Methods section, for this dataset, that clearly appears a more complex case study compared to the previous one, thus chosen to show the potentiality of our approach. 3061 connected components were generated, corresponding to the clusters obtained using the most abstracted (basic wireframe) representation.
The most interesting basic wireframe in terms of SAR evaluations are selected (Figure S3), filtered by the highest values of EF and number of connected active molecules, to focus the analysis on the abstracted scaffolds accounting for more actives.
Figure 11a reports the network corresponding to one of these selected clusters, using a hierarchical layout for a better visualization. The complexity of this specific network is due to the high number of nodes corresponding to all the molecules (on top, in light blue) and relative molecular frameworks (all other nodes) matching the basic wireframe reported in Fig. 11c. This complex network may however be considerably simplified removing nodes with EF value equal to 0, that is, removing all the nodes connected to inactive molecules. Applying this filter, a more clear and useful oriented network can be obtained (Fig. 11b), with exactly the most relevant dataset information. In this way, it is possible to easily extract only the interesting pathways in terms of SAR analysis, starting from a huge number of connections that ensure a complete evaluation of the structural information.
In more detail, starting from the basic wireframe selected (Fig. 11c), thanks to the network visualization, two more interesting sub-clusters can be identified corresponding to the decorated wireframe (definition in Fig. 3) reported in Fig. 12. The EF values of this decorated wireframe are higher than that of the basic wireframe in common, meaning that such approach allows focusing on specific characteristics of the active compounds. Furthermore, it is also possible to move to the less abstracted representation within the network, the decorated frameworks also reported in Fig. 12, that provide information about the bond order characteristics common to the active compounds. And so on, moving back through the network toward the lowest abstraction level is it possible to visualize the original molecules.
A first interesting consideration about these results concerns the introduction of decorations in our scaffold representations: defining a description level in which protruding bonds are added to the basic scaffold allows to better identify and distinguish the requirement essential for the activity. This point is clearly showed in Figs. 11 and 12, where moving from the basic to the decorated wireframes with higher values of EF and number of connections, it is possible to retrieve all the clusters containing the active molecules. On the other hand, 12 decorated wireframes and 37 decorated frameworks are identified in common with inactive molecules, another useful information to rationalize which scaffold decorations are responsible of decrease or even loss of activity.
Finally, we want to show how the most useful SAR information can be obtained extending the analysis and the network to the fragments. When the fragmentation rules are applied to the dataset, the network visualization of the fragmented library allows to interconnect all the molecular frameworks containing the same fragment and the EF can be recalculated for each fragment according to the activity data of all the molecules connected via the corresponding molecular frameworks.
In particular, focusing the attention on the interesting structures above identified, Fig. 13 reports the same scheme of Fig. 12, with the EF values recalculated considering all the clusters identified by molecular frameworks corresponding to superscaffolds of the scaffolds visualized (superframeworks).
To better explain this step, we report in Fig. 14, as example, one decorated wireframe of Fig. 13 and the corresponding five decorated wireframes retrieved in the fragmented library containing it as a fragment. For each of these decorated wireframes, the EF value is reported and that of the central wireframe, here treated as a fragment, is recalculated, adding the contribution of the other five ones. Comparing Figs. 12 and 13, it is possible to identify the molecular frameworks, the EF of which increases when they are considered as fragments, thus containing relevant structural characteristic of active molecules.
We can observe that, among the nine molecular frameworks, in this particular case study, the decorated wireframe turned out to be the most useful representation to obtain SAR information. Thus, in general we can conclude that the integration of all molecular frameworks and fragments in the network visualization is crucial for capturing the most relevant information in compound libraries analysis.