Accurate prediction of molecular targets using a self-supervised image representation learning framework

The clinical efficacy and safety of a drug is determined by its molecular targets in the human proteome. However, proteome-wide evaluation of all compounds in human, or even animal models, is challenging. In this study, we present an unsupervised pre-training deep learning framework, termed ImageMol, from 8.5 million unlabeled drug-like molecules to predict molecular targets of candidate compounds. The ImageMol framework is designed to pretrain chemical representations from unlabeled molecular images based on local- and global-structural characteristics of molecules from pixels. We demonstrate high performance of ImageMol in evaluation of molecular properties (i.e., drug’s metabolism, brain penetration and toxicity) and molecular target profiles (i.e., human immunodeficiency virus) across 10 benchmark datasets. ImageMol shows high accuracy in identifying anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS) and we re-prioritized candidate clinical 3CL inhibitors for potential treatment of COVID-19. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery in a variety of human diseases, including COVID-19.


A.1 Datasets
We used three different types of datasets for molecular property prediction, drug metabolism prediction and anti-viral activity prediction tasks.

Datasets of Molecular Property Prediction
The statistical information of MPP datasets is described in Table S1, which include BBBP, Tox21, HIV, ClinTox and BACE datasets: BBBP (Blood-Brain Barrier Penetration) dataset includes binaryclassification records of barrier permeability properties between blood and brain of more than 2000 compounds.

Tox21
(Toxicology in the 21st Century) is a dataset of compound toxicity, including qualitative toxicity measurements for 8k compounds on 12 different targets.
HIV (Human Immunodeficiency Virus) dataset contains more than 40,000 records of whether the compound inhibits HIV replication for binary classification between active and inactive.
ClinTox (Clinical trial Toxicity) dataset includes 1491 drug compounds with known chemical structures for the binary classification between clinical trial toxicity (or absence of toxicity) and FDA approval status.

Datasets of Drug Metabolism Prediction
The PubChem Data Set I (Training Set) and PubChem Data Set II (Validation Set) are used as DMP (Drug Metabolism Prediction) datasets, which include binary-classification records of whether the compounds are the Cytochrome P450 inhibitors. The five major CYP isoforms in cytochrome P450 are used, namely 1A2, 2C9, 2C19, 2D6, and 3A4. The statistical information is described in Table S2.

Datasets of Anti-Viral Activity Prediction
The statistical information of the processed datasets is shown in Table S6. In addition, for a fair comparison with other method, we also used the SARS-CoV-2 datasets in [1] to train some models of anti-SARS-CoV-2 activities, and its statistical information is shown in Table S13.

A.2 Selection of K in K-means
In the clustering pseudo-label classification task, we determined the values to be 100, 1000, and 10000, respectively. In order to determine the value of in K-Means method, we first use different values, ranging from 1 to 14000, to cluster the dataset and to calculate the sum of squared distances.
Then, we use the value as the x-axis and sum of squared distances as the y-axis to draw a curve. Finally, a knee point detection algorithm [2] is used to find the knee point of this curve. As shown in Figure S15, the dotted line indicates the value corresponding to the "elbow" point. Obviously, the larger the value, the more difficult it is for ImageMol to perform the clustering pseudo-label classification task. Therefore, we select two values ( =100 and 1000) on the left side of the "elbow" point and one value (k=10,000) on the right side of the "elbow" point.

A.3 Hyperparameters of pre-training and finetuning
The hyperparameters of the pre-training and fine-tuning process are shown in

B.1 Molecular image and fingerprint generation
In this study, we use image as molecular representation. To obtain molecular images, the RDKit (https://github.com/rdkit/rdkit) is used to generate standard and unique image [3] for each molecule. Unlike molecular graph, molecular image is composed of a pile of pixels rather than vertices and edges. In detail, we first filter out molecules without SMILES information in the original dataset. Second, we transform the SMILES sequences to molecular images using RDKit and set the image size to 224 × 224. Finally, these molecular images with the same size will be used as the initial dataset of our method. Considering that molecular fingerprints are easy to obtain and can express some priori knowledge of molecules, we chose MACCS keys to assist our pre-training process to make our model learn molecule-related priori knowledge. The MACCS keys are one of the commonly used structural molecular fingerprint [4], which contain 166 keys related to molecular structure. In our work, we used RDKit to generate a distinct 166-D molecular fingerprint for each molecule.

B.2 Pre-task details in pre-training
This section will describe the pre-training details of ImageMol with five pretasks. The overall data flow of the ImageMol framework during training is shown in Figure S17. In general, the original input images is processed into three different datasets. Augmented images !"# is obtained by using data augmentation on , including RandomHorizontalFlip(), RandomGrayscale(p=0.2) and RandomRotation(degrees=360) in torchvision.
Shuffled images $%# is obtained by performing a jigsaw puzzle on !"# . The puzzle rule uses "permutations 100" in [5]. Then, randomly select a batch of data from these three datasets and input them into ResNet18 without classification layer to extract 512-D latent features !"# , )%# , &!'( . Finally, these latent features are input into the sub-network for each task for further processing.
Specially, Figures S1-S5 show the architecture of each pre-training strategy.
In multi-granularity chemical clusters classification (MG3C) task ( Figure S1), chemical fingerprints are first extracted from SMILES and input into unsupervised KMEANS with different K values to produce clusters with different structure granularity. Then, these clusters are treated as pseudo-labels of molecular images. Finally, the molecular encoder and structural classifier are jointly used to predict the labels of molecular images and optimizing the loss between pseudo-labels and predicted labels in pre-training. The structural classifier is a multi-task learner that receives 512-dimensional features as input and then forward-propagates to 3 fully connected layers with different numbers of neurons ( 100 , 1000 and 10000 ) for classifying different clustering granularity.
In molecular rationality discrimination (MRD) task ( Figure S2), we first disrupt the molecular structure to construct an irrational molecular image, which uses a 3x3 grid to decompose the molecular image into 9 patches and randomly shuffle them to form an irrational image. The original images are viewed as rational molecular images. Then, these rational and irrational molecules will be input to the molecular encoder to extract 512-D features. Finally, these features are forward propagated to rationality classifier for rationality judgment. The rationality classifier is a simple MLP structure that takes 512 -dimensional feature as input and directly outputs 2 -dimensional results (rational or irrational).
Finally, the Molecular encoder is used to extract features of rearranged images and subsequently input into the jigsaw classifier for predicting the permutation (100 classification). The Jigsaw classifier is a simple MLP, which consists of a 512-dimensional input layer and a 100-dimensional output layer.
In MASK-based contrastive learning (MCL) task ( Figure S4), we randomly mask a 16 × 16 region in the molecular image, which is filled using the mean of the image (some masked examples in Figure S16). Subsequently, image pairs (original image, masked image) are fed into the molecular encoder to extract features and maximize the similarity. Here, the Euclidean distance is used to constrain the similarity between two features, and we should minimize the Euclidean distance to ensure greater similarity.
In molecular image reconstruction (MIR) task ( Figure S5), we build our GAN model based on context encoders [6]. The detail of GAN model is described in Figure S5. In generator, firstly, the latent features !"# are forward to a single hidden layer MLP model, which accepts 512-d input and obtains a 128-d output.
Subsequently, four ConvTranspose2D layer with BatchNorm2D and ReLU are used. In ConvTranspose2D, the numbers represent input channels, output channels, kernel size and stride respectively. Finally, a ConvTranspose2D layer with Tanh activation function is used to generate 64 × 64 images. In discriminator, !"# is first preprocessed to resize to 64 × 64. Then resized !"# and *+, are input to a Conv2d with LeakyReLU and three Conv2d with BatchNorm2D and LeakyReLU (negative slope is 0.2). In Conv2d, the numbers have the same meaning as ConvTranspose2D. Finally, a Conv2d is used to discriminate the real or fake of input images.

C.1 Results on the pre-training
As shown in Figure S18, it shows the details of the loss change of ImageMol during pre-training. We did not show the training details of the Image reconstruction task because the loss is adversarial. In general, the loss of ImageMol in the remaining four pre-tasks is a decreasing trend and gradually converges, which shows that our ImageMol can learn different information about molecular images in these pre-tasks.

C.2 Performance comparison with existing approaches
We compare ImageMol with four different types of models, which are fingerprint-based models, sequence-based models, graph-based models and image-based models, respectively.  Table S3), except for the accuracy metric on CYP2C9 Isoform, which is only 0.8% lower, and the others are improved by 1.0%~4.4%, which shows the features extracted by ImageMol are richer than manual features.
Sequence-based models. Due to the simplicity and efficiency of the Simplified Molecular-Input Line-entry System (SMILES) sequence, it has become one of the most popular molecular representation [8,9]. We compared our ImageMol with several popular pre-training models (RNNS2S [10], SMILES Transformer [11] and ChemBERTa [12], MolVec [13], N-GRAM [14]) in molecular targets, blood-brain barrier penetration, drug toxicity benchmarks. The performance of our ImageMol can outperform the state-ofthe-art sequence-based pre-training models on any datasets (Figure 2.d or Table S4 and Figure 2.e or Table S5), which is an absolute improvement in ROC-AUC ranging from 0.7% to 15.3% compared with other methods. This shows that molecular image-based representation has obvious advantages compared with sequence-based molecular representation because these models can only learn 1D sequence information but lacks 2D structural information.
Graph-based models. Considering that molecules can be naturally represented as graphic structures, some graph-based methods [15,16] have recently emerged to learn the 2D topological structure information of molecules. We compared our ImageMol with several graph-based pre-training model (Jure's GNN) [15]. The performance of ImageMol comprehensively exceeds the state-of-the-art graph-based pre-training models on all datasets (Figure 2.e and Table S5 ranging from 3.4% to 12.6%, which shows the advantage of using molecular images as a representation. Although both molecular graph and image are based on 2D representation, they are significantly different in representation type. The molecular graph focuses on topological information at the atomic level, while the molecular image focuses on the spatial structure information at the pixel level. In spatial structure, more rich information is included, such as the shape of molecules, the angle of chemical bonds, and the relative distance between atoms, etc. Image-based models. We selected several latest molecular image-based models as comparison methods, which are Chemception, ADMET-CNN and QSAR-CNN respectively. We find that our ImageMol has high performance and outperforms the state-of-the-art methods with a huge performance gap ranging from 6.8% to 37.1% (Figure 2.b, Figure 2.c and Figure S7). Specifically, we observed a similar performance between ImageMol_NonPretrain and Chemception, which is 73.2% vs 72.2% and 73.4% vs 75.2% on the HIV and Tox21 datasets respectively ( Figure S6). However, after pre-training on 8M molecular images, our ImageMol showed a significant improvement on the HIV (an increase of 9.9%) and Tox21 (an increase of 7.2%) with an average increase from -0.4% to 8.6%, which proves the effectiveness and superiority of our pre-training strategies for molecular images.

C.3 Results on anti-viral activities across SARS-CoV-2 targets
In Table S7 and Table S8, the experimental results of Jure's GNN, ImageMol and REDIAL-2020 on anti-SARS-CoV-2 activities estimation task are described.
In Table S7, we obtain the results of Jure's GNN by running its public source code (https://github.com/snap-stanford/pretrain-gnns) on our SARS-CoV-2 dataset. In Table S8, REDIAL-2020 announced the dataset they used  inhibitors is depicted in Figure S8. Additionally, to further validate the effectiveness of our approach, we also screened drugs for SARS-CoV-2 from approved drugs. We use HEK293's model for virtual screening because it models larger data volumes and has good performance. The screening results are shown in Table S11. The 13 of the top 20 drugs were verified by different literatures, demonstrating the great potential of ImageMol. We also tested the accuracy of the model on the external validation set (Table S12), which is provided by [17] and has 122 inhibitors of the SARS-CoV-2 (shown in https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-04482-x/MediaObjects/41586_2022_4482_MOESM1_ESM.pdf). Since we focus on virtual screening of small molecule drugs, we took the intersection of these 122 inhibitors and drugs in DrugBank and finally got 70 small molecules for testing. Of these 70 drugs, we successfully predicted 47 potential drugs, demonstrating the potential of ImageMol as a novels drug discovery tool. Figure S1:  we first disrupt the molecular structure, which uses a 3x3 grid to decompose the molecular image into 9 patches and randomly shuffle them to form an irrational image. Then, these rational and irrational molecules will be input to the molecular encoder to extract visual features. Finally, these features are forward propagated to rationality classifier for rationality judgment.      inhibitors. The x-axis represents the confidence that the drug is a 3CL inhibitor, and the y-axis represents the number of drugs within a certain confidence interval. The range of confidence is 0 to 1. The higher the confidence, the more likely the corresponding drug is to be a 3CL inhibitor.   [21]. The warmer color, the higher attention; the cooler color, the lower attention.

Supplementary Figures
Obviously, all meaningful structural regions are highlighted by ImageMol. Figure S11: The heat maps of attention to local structural information, which are highlighted by Gradient-weighted Class Activation Mapping (Grad-CAM) [21]. The warmer color, the higher attention, and the cooler color, the lower attention. Obviously, ImageMol captures more local regions for inference. Figure S12: Quantitative information about the molecular structure area that is focused on by ImageMol. The coarse-grained hit rate represents the proportion of the molecular structure in each molecular image that was noticed by ImageMol, and fine-grained hit rate represents the proportion of the molecular structure area that was noticed by ImageMol to the total molecular structure area.

Figure S13:
The performance of ImageMol pre-trained under the data scale of 0M (no pre-training), 0.2M, 0.6M, 1M and 8M (our final ImageMol) with scaffold split on four property prediction datasets. "Mean" represents the average performance of ImageMol pre-trained with different data scales on four datasets.

Fig. S14:
The ablation study of the pretext task in ImageMol, which uses ROC-AUC evaluation with scaffold split on four property prediction datasets. "Mean" represents the average performance of ImageMol pre-trained with different pretask on four datasets. We find the "elbow", which is a value corresponding to the point of maximum curvature in an elbow curve, by using knee point detection algorithm [2].      ImageMol is not pre-trained and ImageMol is pre-trained on 8M molecular images. "MACCS-" represents the method based on MACCS molecular fingerprint, "FP4-" represents the method based on FP4 molecular fingerprint.

Supplementary Tables
"CC-" represents combined algorithm (ensemble learning). For details on other methods, see [7]. ImageMol_NonPretrained and ImageMol are fine-tuned on PubChem Data Set I and evaluated on PubChem Data Set II. The best and second best result for each dataset is bolded and underlined, respectively.