Extracting Predictive Representations from Hundreds of Millions of Molecules

Although deep learning can automatically extract features in relatively simple tasks such as 8 image analysis, the construction of appropriate representations remains essential for molecular predictions 9 due to intricate molecular complexity. Additionally, it is often expensive, time-consuming, and ethically 10 constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small 11 and diverse datasets. In this work, we develop a self-supervised learning approach via a masking strategy 12 to pre-train transformer models from over 700 million unlabeled molecules in multiple databases. The 13 intrinsic chemical logic learned from this approach enables the extraction of predictive representations from 14 task-speciﬁc molecular sequences in a ﬁne-tuned process. To understand the importance of self-supervised 15 learning from unlabeled molecules, we assemble three models with diﬀerent combinations of databases. 16 Moreover, we propose a new protocol based on data traits to automatically select the optimal model for a 17 speciﬁc predictive task. To validate the proposed representation and protocol, we consider 10 benchmark 18 datasets in addition to 38 ligand-based virtual screening datasets. Extensive validation indicates that the 19 proposed representation and protocol show superb performance. 20 million unlabeled data pre-training, while models CP CPZ


33
In the past few years, machine learning (ML) has had profoundly changed the landscape of science, 34 engineering, technology, finance, industry, defense, and society in general. It becomes a new approach for 35 scientific discovery, following traditional experiments, theories, and simulations. In image analysis, deep 36 learning algorithms, such as convolutional neural networks (CCN), can automatically extract image fea-37 tures without resorting to hand-crafted descriptors. However, for molecular predictions, due to the internal 38 complexity of molecules, generating molecular representations or descriptors is an essential issue that is as 39 important as data and algorithm in determining ML performance [1,2]. It is a procedure that translates the 40 chemical information in a molecule into a set of "machine" understandable features. Although many molec- In recent years, molecular fingerprints based on 3D structures have been developed to capture the 3D 50 spatial information of molecules. [8] However, the complexity and elemental diversity of molecular structures 51 are major obstacles to the design of 3D fingerprints. [1] A variety of advanced mathematics-based 3D molec-52 ular representations, including algebraic topology[9], differential geometry [10], and algebraic graph-based 53 methods [11,12,13], were devised to generate molecular fingerprints aimed at encoding 3D and elemen-54 tal information of molecules by mathematical abstraction. These methods have been highly successful 55 in the classification of proteins and ligands, as well as in the prediction of solubility, solubility free en-56 ergy, protein-ligand binding affinity, protein folding stability changes after mutations, and mutation-induced 57 protein-protein binding affinity changes [1]. However, these approaches rely on high-quality 3D molecular 58 structures, which limits their applications. 59 Deep learning (DL) has been a successful and powerful tool in various fields, such as natural language 60 processing[14], image classification [15], and bioinformatics [16,17]. Conventional deep learning methods are 61 constructed based on deep neural networks (DNN). In molecular sciences, the input to these models is usually 62 a pre-extracted molecular descriptor, e.g., ECFP, MACCS. However, this type of input may not preserve 63 certain molecular information and thus compromise the performance of downstream predictive tasks. [18,19]   we use a self-supervised learning strategy, especially the BET [28,40], to obtain our pre-trained models. In 123 Figure 1: Illustration of the self-supervised learning platform. a Three public datasets are involved in the pre-training datasets module (blue rectangle). Set C only contains the ChEMBL dataset. Set CP consists of ChEMBL and PubChem datasets, and Set CPZ contains ChEMBL, PubChem, and ZINC datasets. b Based on those three datasets, three pre-trained models (green rectangle) are obtained by self-supervised learning, which is Model C, Model CP, and Model CPZ, respectively. c The dataset analysis module (purple rectangle) contains the Wasserstein distance analysis module and decision module. It will point to the best pre-trained model for a specific dataset. d The fine-tune module (yellow rectangle) fine-tunes the pre-trained model using a specific dataset. Finally, the fingerprints are generated from the fine-tuned model and used as input for the downstream machine learning tasks. a, The correlations between pre-training datasets and downstream datasets, including 5 classifications (Classif.) and 5 regressions (Regre.) datasets, and pre-trained datasets C, CP, and CPZ. f Normalized predicted results of the fingerprints from pre-trained model C for DPP4, ESOL, FreeSolv, Lipophilicity (Lipop), and LogS five regression datasets.
For the dataset analysis module, we use the Wasserstein distance analysis submodule and the decision 124 submodule to decide the optimal model for the downstream task. Firstly, we generated the distribution of  Table S1, and the distributions of complete symbols are shown in Figure S1. Additionally, we also 131 counted the distribution of the number of symbol types contained in each SMILES, as shown in Figure 2c.

132
To analyze a specific dataset, we can also generate these distributions. In the second step, based on the 133 various distributions obtained (63 in total), we use the Wasserstein distance analysis submodule to analyze 134 the correlation between different datasets in several ways. Finally, using the decision submodule, a ridge 135 linear regression model is used to determine the most suitable SSLP for a specific small dataset. Since the 136 symbols in SMILES all have corresponding meanings, using the dataset analysis module, we can make a 137 comprehensive comparison of the datasets from these distributions. In the SSLP, the fine-tune module is 138 used to generate the task-specific fingerprints for the specific dataset. We can fine-tune the selected pre- mance between fingerprints, we prefix a set of general machine learning parameters, as shown in Table S2.

149
To reduce the systematic errors in the machine learning process, we applied for different random numbers FPs, auto-FPs, and circular 2D FPs achieved the best results in 3, 4, and 3 tasks, respectively. For circular 160 2D fingerprints, in each task, we pick the best fingerprint from nine parameter settings for comparison.

161
Although our SSL-FPs did not achieve the best performance on all tasks, it still performs on a comparable 162 level for most prediction tasks. For the five regression tasks (except for the DPP4 dataset), deep learning-163 based fingerprints, including our fingerprints and autoencoder-based fingerprints, have a better performance 164 than the circular 2D fingerprints. The complete results with multiple metrics are listed in the Table S3. 165 We also compared the fingerprints generated by different pre-trained models, as shown in Figure 3c and d.

166
It is interesting to see that model C achieved the best performance in 7 of the 10 tasks. For pre-trained 167 model C, we only applied about 1.9 million unlabeled data for pre-training, while models CP and CPZ   In the VS experiment, for the DUD database, fingerprint laval was the best performing fingerprint among 28 2D fingerprints, and in the MUV database, fingerprint ap was the best performing one among all the 2D fingerprints. c, a summary of the VS experiments concluded that the four fingerprints, ours, auto, laval, and ap, obtained the best performance on 18, 7, 6, and 7 data sets, respectively.
than that of the SSL-FPs and laval which indicates that other molecular fingerprints can also obtain very 208 close performance in these datasets. In summary, our SSL-FPs showed stable and higher superiority in VS 209 experiments. The complete results for all fingerprints are listed in Table S4 and Table S5.

228
In this work, we applied the SSL strategy to train different BET-based models using three datasets, 229 i.e., Set C, Set CP, and Set CPZ (listed in Table 1). For a specific downstream task, such as a regression 230 task, we simply use the task-specific dataset as input data to fine-tune the model so that task-specific 231 molecular fingerprints can be generated from the fine-tuned model. We carry out the fine-tuning process to 232 adapt the model to a specific task, allowing the resulting molecular descriptors to focus on relevant task-233 based information, thereby improving the accuracy of downstream tasks. Figure 1f shows the comparison from the pre-trained model. As shown in Figure 4, the SSL-FPs from pre-trained model C can also achieve 243 18 best results over 38 tasks.

244
In contrast to the encoder-decoder structure of traditional autoencoder models, in this work, we utilize 245 the encoder-based BET, which greatly improves the efficiency of model training. For some downstream 246 machine learning tasks, such as the molecular property prediction tasks and VS experiments discussed in 247 this work, our self-supervised learning-based pre-trained encoder alone can be used to achieve excellent 248 performance. Moreover, the parallel computing capability of the transformer was a crucial element for us to 249 engage over 700 million molecules in our training. [28] The structure of the BET is shown in Figure S4.  Table S4 and Table S5, where model C can perform even better.

256
However, on some datasets, such as the LogS dataset in the regression tasks, the best performance is obtained 257 by model CP. Based on this observation, we hypothesize that the performance of a model for a task depends 258 on the correlation of the task-specific dataset with the pre-training dataset. To verify our hypothesis, we 259 developed a dataset analysis module in our self-supervised learning platform, which aims to identify pre-260 trained models that can provide the best performance.

261
Based on the composition of symbols in SMILES strings, we counted 61 common symbols and all the 262 symbols listed in Table S1. For each type of symbol, we calculated its percentage in each SMILES. Therefore, 263 for each dataset, we can obtain a distribution from 0% to 100% for each symbol, as shown in Figure 2. For 264 the organic small molecule database, we can see that the distribution of carbon, oxygen, and nitrogen are 265 the widest in each dataset, which indicates high diversity of these essential elements. In addition, the symbol 266 'c' represents the carbon element in the ring structure. As shown in Figure 2, it can be obtained the ring 267 structure of dataset C has a higher diversity compared to the dataset CP and dataset CPZ For the special 268 symbols, it can be noted that dataset CPZ has higher diversity for symbol '[', symbol ']', and symbol '+' 269 which indicates that there is a more charged atom in the dataset. In addition to the symbolic analysis, Data processing for Self-supervised learning To enable the self-supervised learning, in this work, we 290 pre-process the input SMILES. A total of 51 symbols, as listed in the Table S1, are used to split these SMILES 291 strings. '< s >' and '< \s >' two special symbols were added to the beginning and end of each input. Since 292 the length of SMILES varies from molecule to molecule, the '< pad >' is used as a padding symbol to fill 293 in short inputs to reach the preset length. In the masking process, 15% symbol of the SMILES will be 294 operated. Among these 15% symbols, 80% of symbols were masked, 10% of the symbols were unchanged, 295 and the remaining 10% were randomly replaced. The strategy of dynamically changing the masking pattern 296 was applied to the pre-training data. [43] 297 Bidirectional encoders of transformer for molecular representation Unlike sequences learning 298 models such as RNN-based models, transformer is based on an attention mechanism [28], which is a kind of 299 scaled dot-product attention, The Q, K, and V , namely query matrix, key matrix, and value matrix, are mapping from input data. structure of BET is shown in Figure S4.

319
For a specific downstream task, we use supervised learning to fine-tune the pre-trained model. There 320 is no additional pre-processing for the input SMILES. The Adam optimizer is set as the same as that of 321 pre-training. The warm-up strategy is used for the first 2 epochs, and a total of 50 epochs are trained for 322 each dataset. The mean square error and cross-entropy are used in the fine-tuning stage for the regression 323 task and classification task, respectively. The process of fine-tuning is shown in Figure S5. The molecular 324 representation was generated from the last encoder layer's embedding vector of the first symbol, i.e. '< s >'. 325 Wassertein distance analysis of datasets In this work, the Wasserstein distance is used to measure the 326 correlation between two distributions. Mathematically, the Wasserstein distance is a distance function defined where Γ(µ, v) denotes the collection of all distributions on M × M with marginals µ and v on the first an 331 second factors respectively. Also, the Wasserstein metric is equivalently defined by where E represents the expected value and the infimum is taken over all joint distributions of the random 333 variables X and Y with marginals µ and ν respectively.
where α > 0 is the complexity parameter and it controls the amount of shrinkage.
[44] And y here corresponds 341 to the index of three pre-training models, i.e., 0, 1, and 2. Additionally, considering the influence of feature dimensionality on the accuracy of the least squares, we use the principal component analysis (PCA) method 343 to downscale the feature X. Figure S6 shows the accuracy of the model in selecting the best model as the 344 feature dimension increases.

345
Downstream machine learning and evaluation metrics For the downstream prediction tasks, three 346 machine learning algorithms are used in this work, namely, GBDT, RF, and SVM. [41] To better compare 347 the performance of molecular fingerprints, we did not over search for the best machine learning model 348 hyperparameters. Therefore, for these three machine learning methods, we simply set universal parameters 349 based on the amount of data in the training set for the downstream task, as shown in Table S2. The   The authors declare no competing interests. 376 10 Supporting Information

377
The Supporting Information is available on the website at xxxxxx 378