Application of deep metric learning to molecular similarity

Graph based methods are increasingly important in chemistry and drug discovery, with applications ranging from QSAR to molecular generation. Combining graph neural networks and deep metric learning concepts, we expose a framework for quantifying molecular similarity based on learned embeddings separate from any endpoint. Using a minimal deﬁnition of similarity, and data from the ZINC database of public compounds, this work demonstrate the properties of the embedding and its suitability for a range of applications, among them a novel reconstruction loss method for training deep molecular auto-encoders. We also compare the performance of the embedding to standard practices, with a focus on known failure points and edge cases.


Introduction
Quantifying the similarity of chemical structures has been a much used tool in drug discovery for decades [1], and has often been adopted as a design principle for lead optimization [2, 3], under the assumption that similar molecules have a higher probability of exhibiting similar properties than dissimilar ones [4,5,6]. Indeed, the successful use of bioisosterism in drug development makes heavy use of the concept [7,8], to the point that similarity is sometimes defined as a consequence of the properties, rather than the cause [9]. Most of the benchmarks for chemical structure similarity rely on this definition to compare methods [10,11,12], driven in part by the availability of public activity datasets [13]. Yet, pitfalls such as so-called "activity cliffs" [14,15,16] should moderate the confidence in the underlying principle. Furthermore, other use cases of similarity exist, and are not captured by the similar properties paradigm: patent mining and infringement prediction [17], building block selection for synthesis, retrosynthesis and scaffold hopping [18,19,20], molecular generation evaluation [21], etc. A "good" measure of similarity should ideally show equal performance in all these applications, never relying too much on any one definition or type of benchmark. On the practical side, similarity can be more generally understood as the combination of a molecular representation and an appropriate metric [3]. Today, the combination of two-dimensional molecular circular fingerprints [22,23]  Most of the processing after this was done using BIOVIA Pipeline Pilot [47]. All compounds belonging to a GFRG cluster with less than 4 members were removed.
In the case of compounds belonging to GFRG clusters with more than 10k members, DFRG clusters were used in place of GFRG. For DFRG clusters, a maximum size of 20k members was established, with random subsampling performed on clusters above this limit. than the positive. Selecting a very different compound is not optimal, since the chemical space size increase towards larger dissimilarities. Thus, while it would be correct to choose a negative control from a different cluster, choosing a compound that has some similar features to the reference is more valuable to the training process. Therefore we have randomly selected the negative control from a different cluster than the cluster of the reference, but their Reduced Graph should be the same. This way 12'361'633 triplets were created. A detailed schema of the data preparation can be seen on Figure 2.

Model training
For all training and benchmarking purposes, the random seed is fixed at 42 for repeatability, and the hyperparameters have been kept unoptimized and to the default values to prevent bias. We used the DGL-Lifesci open source framework for computations on graphs, and its message passing neural network implementation (MPNNPredictor) [48] as model architecture. This type of model repeatedly accumulates bond information as well as node information based on connectivity, and has been used with great effect in state of the art QSAR applications [49]. We For more details, hyperparameters, and training curves, please refer to the project's github page.

Benchmarks choice
The benchmarks for the present use case should optimally measure a number of things: • The performance on popular applications; here the activity classification tasks such as the ones described in Riniker et al [12]. Additionally, desired properties of an encoding come from the coupling with a metric. In particular, using a euclidean distance metric on a well defined euclidean vector space gives rise to a number of interesting properties: • very fast querying and operations • Similarity can be defined with respect to geometric elements: around a barycentre, along a path between molecules, within a cone, etc.
• the space and metric together are unbound in value for dissimilarity: there are many more ways of being dissimilar than similar, and the distances distribution could reflect that.

Activity prediction tasks benchmarking
While an imperfect measure of fitness for any new chemical embedding, the dominance of benchmarking platforms making use of a variety of activity prediction datasets makes it an obligatory step in evaluating any new contribution. In particular, it enables two separate conclusions to be reached: 1 Whether the information contained in the embedding is sufficient to fit models successfully, regardless of compared performance 2 Whether these models are statistically different from references to demonstrate the originality of the embedding To answer the second query, it is necessary to benchmark models on a suitably high number of instances for each class. For this purpose, a dataset of IC50 activities was extracted from the ChEMBL28 database. All targets with a unique structure  Figure 4, and answers our first point to our satisfaction.

Failure points of circular fingerprints
One noted effect of the bit-string fingerprints is the skewing effect of size on the distribution of similarities as illustrated in Figure 6 of Flowers et al [26]. Applying the same reference set of compounds for comparison on a diverse set of molecules using the MPNN learned embedding leads to a much better shape of the distributions.
While the larger molecule has a more chaotic profile of similarity (probably due to the fact that the larger a structure, the more ways for something to be similar to it), it otherwise seems independent from the size of the molecules. This is shown in Figure 5.
Another point where fingerprints fail to accurately describe molecular similarity is the case of molecules with repeated motifs. When using Tanimoto similarity of circular fingerprints in bit string form, the similarity tapers off quickly to a fixed nonzero value. The learned embedding is immune to this effect. Likewise, the insertion of moieties within a scaffold has an unduly small effect when it does not perturb the fragmentation of the structure by fingerprints, but is correctly shown to matter a lot by the embedding. In addition, it also retains the concepts of fragments, aromaticity, Another critical desired property for a novel molecular distance measure is the ability to correctly compare partial and chemically invalid molecular graphs and provide gradient information. This leads to the important fact that trained embeddings are essentially derivable reconstruction loss with a quadratic energy surface, with widespread potential applications. For example: • Accelerated training of reconstruction based molecular generators such as variational auto-encoders.
• Additional information in tasks such as missing edge and node prediction.
• Chemical subspace constraints for conditional molecular generators These tasks are deeply unsuitable to traditional fingerprints or property based similarity : for most of the training process, the molecular graphs on which computation happens are completely invalid, the chemical information on what is a molecule still being accrued. Yet a learned embedding, as is shown in Figure 8, is very robust to node and edge deletion, demonstrating a quasi linear distance relationship with the number of deleted elements. This is an exciting property, and we look forward to seeing it explored further.
Finally, a critical property of the embedding is its ability to be used in conjunction with transfer learning [54,55], and be retrained on particular subsets of the chemical space according to tailored similarities obtained from SAR, Molecular Matched Pairs [56], or a more complex multiple-parameters function. Such a retrained model would retain the general concepts of molecular graph similarity while quickly con- verging to a more appropriate representation of the problem at hand, thus sparing resources in training and data gathering.

Conclusions
We have shown that using the triplet margin loss jointly with molecular graph based     1 Declarations

Availability of data and materials
All code and data is available on https://github.com/DCoupry/ChemDist paper under an Apache 2 license (GlaxoSmithKline copyright) and is sufficient to reproduce our conclusions and graphs.   The process diagram of data preparation.

Figure 3
The architecture of the triplet loss embedding during training.

Figure 4
Performance in activity classi cation tasks from ChEMBL28.

Figure 5
Distribution of embedding distances of 5 references compounds to a diverse set of 120k compounds from the Zinc database.  Effect of random element deletion on embedding distance. No comparison with ECFP4 could be obtained due to the overwhelming rate of invalidity of the resulting structures.