An adaptive graph learning method for automated molecular interactions and properties predictions

Improving drug discovery efficiency is a core and long-standing challenge in drug discovery. For this purpose, many graph learning methods have been developed to search potential drug candidates with fast speed and low cost. In fact, the pursuit of high prediction performance on a limited number of datasets has crystallized their architectures and hyperparameters, making them lose advantage in repurposing to new data generated in drug discovery. Here we propose a flexible method that can adapt to any dataset and make accurate predictions. The proposed method employs an adaptive pipeline to learn from a dataset and output a predictor. Without any manual intervention, the method achieves far better prediction performance on all tested datasets than traditional methods, which are based on hand-designed neural architectures and other fixed items. In addition, we found that the proposed method is more robust than traditional methods and can provide meaningful interpretability. Given the above, the proposed method can serve as a reliable method to predict molecular interactions and properties with high adaptability, performance, robustness and interpretability. This work takes a solid step forward to the purpose of aiding researchers to design better drugs with high efficiency. Deep learning methods can provide useful predictions for drug design, but their hyperparameters need to be carefully tweaked to give good performance on a specific problem or dataset. Li et al. present here a method that finds appropriate architectures and hyperparameters for a wide range of drug design tasks and can achieve good performance without human intervention.

D rug discovery is a lengthy, costly and complex process that plays a crucial role in human health and well-being 1,2 . At present, experimental assays 3 remain the most reliable approach to screen compounds, but cost too much. Although many computational methods 4 have been proposed to estimate molecular interactions and properties and improve drug discovery efficiency, it is still a tricky process.
Graph learning methods have the potential to improve drug discovery efficiency dramatically because of their ability to amplify insights available from existing drug-related datasets 5 . Using the insights to predict molecular interactions and properties 6,7 is key to finding potential drug candidates from the vast chemical space with extremely fast speed and low cost. On the other hand, molecular generation 8,9 based on the insights can more efficiently traverse the vast chemical space to find potential drug candidates. Accordingly, graph learning is becoming a rising area of interest within the field of drug discovery 10 .
However, the pursuit of high prediction performance on a limited number of existing datasets has crystallized their architectures and hyperparameters, making them lose advantage in repurposing to new data generated during drug discovery. In practice, researchers tend to crystallize, that is, select and optimize architectures and hyperparameters from a huge design space to achieve the best performance on a dataset [11][12][13] . This heavily limits repurposing to newly generated data, which tend to be increasingly complex 14 in drug discovery. Most graph learning methods rely heavily on these architectures and hyperparameters to achieve their claimed state-of-the-art results, and if the author does not release these specific architectures and hyperparameters, their claimed state-of-the-art results cannot be reproduced 15 .
Recently, a few works have been reported to address the crystallization problem. MolMapNet introduced an out-of-the-box deep learning method based on broadly learning knowledge-based representations to achieve reliable prediction performance on more datasets without human intervention 7 . A recent work also introduced a neural architecture search-based method to design a neural architecture for any dataset of molecular property prediction automatically 11 .
In this Article we propose 'graph learning-based adaptive machine' (GLAM), a flexible method that can adapt to any dataset and make accurate predictions without human intervention. We compared our proposed method with previously reported methods in terms of prediction performance, on a wide range of datasets. The results show that our proposed method can adapt to all tested datasets exceptionally well and obtain far better prediction performance than the other reported methods. We also investigated the robustness and interpretability of our proposed method, and found that it was more robust than the other tested methods and can provide meaningful interpretability, making it a more reliable method.

Results
Method overview. Our method utilizes an automated pipeline to learn from datasets and build a predictor, as shown in Fig. 1. Previous graph learning methods [16][17][18][19][20][21] rely heavily on human experts to design the architecture, set the model hyperparameters, select the optimizer and select the loss function. We creatively combine these four items and build a configuration space. Starting from this configuration space, GLAM performs a series of processes to build a blended predictor, as shown in Fig. 2.

An adaptive graph learning method for automated molecular interactions and properties predictions
Yuquan Li 1,2,9 , Chang-Yu Hsieh 2,9 , Ruiqiang Lu 1 , Xiaoqing Gong 1 , Xiaorui Wang 3 , Pengyong Li 4 , Shuo Liu 5 , Yanan Tian 5 , Dejun Jiang 6 , Jiaxian Yan 7 , Qifeng Bai 8 , Huanxiang Liu 5 , Shengyu Zhang 2 and Xiaojun Yao 1,3 ✉ Improving drug discovery efficiency is a core and long-standing challenge in drug discovery. For this purpose, many graph learning methods have been developed to search potential drug candidates with fast speed and low cost. In fact, the pursuit of high prediction performance on a limited number of datasets has crystallized their architectures and hyperparameters, making them lose advantage in repurposing to new data generated in drug discovery. Here we propose a flexible method that can adapt to any dataset and make accurate predictions. The proposed method employs an adaptive pipeline to learn from a dataset and output a predictor. Without any manual intervention, the method achieves far better prediction performance on all tested datasets than traditional methods, which are based on hand-designed neural architectures and other fixed items. In addition, we found that the proposed method is more robust than traditional methods and can provide meaningful interpretability. Given the above, the proposed method can serve as a reliable method to predict molecular interactions and properties with high adaptability, performance, robustness and interpretability. This work takes a solid step forward to the purpose of aiding researchers to design better drugs with high efficiency.
We designed two general architectures, one for molecular interaction and another for molecular property, as shown in Fig. 3. Each block in the general architecture is created with its own design space, as shown in Extended Data Fig. 1. These architectures take graphs, including molecular graphs and protein graphs, as input. A molecular graph is constructed with atoms as nodes and bonds as edges. A protein graph is constructed with amino acids as nodes and contact information calculated by RaptorX 22 as edges. The architecture takes a molecular graph and protein graph as input when performing drug-target interaction tasks, two molecular graphs as input when performing drug-drug interaction tasks, and a single molecular graph as input when performing molecular property prediction tasks.
Adapt to datasets for high performance. GLAM is designed to adapt to any given dataset to obtain a high prediction performance. To investigate the adaptability and performance of our method, we compared its performance on 14 datasets with a range of representative traditional methods 7,[16][17][18][19][20][23][24][25] . The types of tested dataset include drug-protein interactions, drug-drug interactions, physical-chemistry property, bioactivity, pharmacokinetics and toxicity.
Given that different splits of datasets produce different performances, we let all methods share the same splits of datasets so as to obtain a fair evaluation. We also manually adjusted the architectures and hyperparameters of previously reported methods to achieve their best performance on two representative datasets. Finally, we ran benchmarks and analysed their adaptability and performance on these datasets.
Compared to all traditional methods, our proposed methods can adapt to datasets well and achieve promising prediction performance, as shown in Tables 1 and 2 and Supplementary Table 1. GLAM thus establishes a new state of the art for both molecular interactions and properties prediction. Relative to the best scores in previously reported results, the proposed method achieved an average 18.7% decrease in prediction error on 14 datasets compared to the best of the traditional methods. In addition, GLAM can consistently achieve the best scores without human intervention, whereas previously reported methods achieve uneven prediction performances. GLAM is therefore poised to be a flexible, reliable and trustworthy method that works well across a wide range of activities in drug design.

Fig. 2 | GLAM pipeline details.
Initially, GLAM samples a lot of configurations from the configuration space. The dataset is then fed to these configurations for low-fidelity training, which produces evaluation scores to select the high-performance configurations. In the next high-fidelity training, each previously selected configuration is fed with the dataset to perform full epochs training with three different random seeds. High-fidelity training can produce predictors and the corresponding validation scores used for top-performance predictors selection. The number of selected predictors is predefined by the ensemble size. Finally, all selected predictors are blended into a predictor. Low-fidelity training refers to a rapid training process with few epochs to obtain validation scores for all configurations to estimate their prediction performance quickly. High-fidelity training refers to a slow training process with enough epochs to accurately estimate the prediction performance of selected configurations.
High robustness against molecular structure perturbation. The next issue to consider is robustness 26 , another essential indicator of good practice in method development. We assume that a robust predictor should not change its output greatly when a small perturbation of structures that has little effect on the molecular property is applied. Natural perturbations can always affect a machine learning method, and may lead to a wrong result with serious consequences in some safety-sensitive domains (such as healthcare). Admittedly, the robustness of a graph learning method is also an essential issue.
To evaluate the robustness of our proposed method, we first introduced a principle termed property-slightly-affected structure perturbation (PASP), then built a real-world perturbed dataset following this principle from the PhysProp 27 dataset. We then performed a robustness experiment based on the dataset. More details are provided in the Methods. Table 3 shows that GLAM is less affected by molecular structure perturbations and demonstrates higher robustness than the conventional methods. We fed original molecules and perturbed molecules to the predictor and investigated the differences in the outputs to measure the effect score. Compared with conventional methods, our method is less affected by PASP than conventional methods on all three levels of PASP. The robustness of GLAM is most probably due to the model blending at the end of the pipeline. The main idea of blending is to train several models and draw a final prediction    from averaging. So, perturbing the molecular structure may affect the individual predictors but not the blended model.

Interpretation cases.
To better understand the predictors generated by GLAM, we investigated its decision-making process and interpreted its learned knowledge. In the past, most machine learning models have been considered as black boxes. Previous works have adopted attention mechanisms 20,28,29 to aid interpretation of models.
Here we explain the model from the hidden states by averaging and visualizing it, thus directly utilizing the information provided by the models in the predictor, as shown in Extended Data Fig. 2. Extended Data Fig. 2a presents some case studies of solubility prediction, which are consistent with the intuition of chemists. Generally, hydroxyl and amino groups are considered to be more hydrophilic, and alkyl and halogen groups are considered to be more lipophilic. We selected and visualized some representative molecules from the PhysProp dataset. The atoms in the hydrophilic group tend to be bluer in our visualization, which means their weights are closer to 1. Meanwhile, the atoms in the lipophilic group tend to be redder in our visualization, which means their weights are closer to −1. These observations are consistent with the intuition of chemists, indicating that the models in the predictor can detect essential atomic groups with clear interpretability of their solubility.
In the same way, we also visualized some drug-drug interaction identification cases, as shown in Extended Data Fig. 2b. We considered the interactions between sidenafil/udenafil and nitrates (nicorandil/isosorbide dinitrate) as cases. These were combined into four pairs of drug-drug interactions and fed into models in the predictor to visualize the decision process. (Typically, sildenafil/udenafil can selectively inhibit phosphodiesterase type 5 (PDE5) targets in the human body. The N-methyl groups in the pyrazolopyrimidone rings of sildenafil/udenafil are important for the activity and selectivity of PDE5. For these reasons, combining sildenafil/udenafil with nitrate drugs may lead to serious drug-drug interactions. As a result, the two kind of drugs may have interactions when they are combined.) The visualization results show that the models in the predictor pay more attention to the nitrates of isosorbide dinitrate and nicorandil, and more attention to the N-methyl groups of sildenafil and udenafil. Accordingly, our visualization results are consistent with previous findings for these drug interactions, indicating that the models in the predictor can provide deep insights into molecular interactions.
Ablation studies. Time consumption and resource cost. We analysed the time consumption and resources costs of GLAM and other methods, as shown in Supplementary Table 2. We analysed multiple aspects, including computing time, computing device, training details and dataset size. From the comparison, GLAM costs ~10 times more in terms of time consumption and four times more resources. We believe that these computing resources and time are worth it for the high performance achieved.
Preferences analysis. The preferences of GLAM may lead to ideas for the design of new methods, so we analysed the preferences of GLAM in multiple configuration items, as shown in Supplementary  Table 3. These analyses were performed on three representative datasets: ESOL, BBBP and Tox21. GLAM prefers the global pool method and Adam optimizer on all tested datasets. In addition, GLAM may prefer message-passing networks (MPNs) on small datasets and prefers complex cores on bigger datasets for the choice of message-passing cores. On the choice of message-passing steps, GLAM may prefer bigger steps on bigger datasets.
Ensemble size. It is well known that ensembling models can create better predictors. We designed and conducted two experiments to investigate the effect of ensemble size on performance and robustness, as shown in Table 3 and Supplementary Table 4. From  Supplementary Table 4, we can see that the performance improved as the ensemble size increased from 1 to 3, and remained stable from 3 to 7. On the other hand, we can see that a bigger ensemble size, from 1 to 7, always brings less effect from PASP, as shown in Table 3. We also tested many ensembled competing graph models, such as 3*MPNN and GCN + GIN + MPNN. Their performance and robustness improved, but were still not as good as those of GLAM. 31 . It is now common to feed an automated machine learning method 30,31 with structured data to obtain an excellent predictor. Auto-Sklearn and Auto-Gluon are well-known examples. However, we cannot make a fair comparison with them because they are very different in terms  of their accepted data and objective optimization. Auto-Sklearn and Auto-Gluon take structured data (images and so on) as input and optimize independent machine learning models with hyperparameters. GLAM takes multiple unstructured data (double/single molecular graphs) as input and optimizes configurations consisting of architectures, hyperparameters, optimizers and losses. Despite this, we processed the molecules into structured data (molecular fingerprints 32 ) and fed them to AutoMLs so that we could make a comparison with GLAM, as shown in Supplementary Table 5. Although their performance is not as good as that of GLAM, they have a considerable advantage in their computational speed and cost.

Discussion
We have shown that GLAM can adapt well to all tested datasets and make accurate predictions automatically. In the past, adaptability to new data has largely been ignored, as researchers have addressed almost all their attention to achieving a high prediction performance. Our well-designed method can serve as a reliable method to predict molecular interactions and properties with high adaptability, prediction performance, robustness and interpretability. Furthermore, the automated pipeline of our proposed method enables more researchers, even those who lack machine learning experience, to make full use of the power of machine learning. These advantages of our proposed method will greatly increase the acceptance of machine learning-aided drug discovery.

Limitations and frontiers
Adaptive feature input can help the models in our proposed method to extract important and sufficient representations. In this Article we only describe the graph model with basic node features, such as atomic/residual number. Adaptive feature input can be of great help in some particular prediction jobs. Adding feature decisions to the configuration space might improve our proposed method.
A strategy that provides neither too little information nor too much redundant information would contribute greatly to the representation extraction process. A more intelligent hyperparameter optimization algorithm built on multi-graph cards may increase the configuration search efficiency. The current version of GLAM uses a basic optimizer based on random search. If a more intelligent optimizer is embedded into GLAM, it will help GLAM find ideal configurations in less time.

Outlook
Our method is expected to advance and evolve automated drug design 1,33 . Recent advances, such as chemical retrosynthesis predictions 34,35 and molecular generations [36][37][38] , have laid the foundation for an automated drug design pipeline in the future. Automated drug design or semi-automatic drug design will become a trend. The proposed method can serve as a predictor generator that will contribute to automated drug design. For further applications, our proposed method can be repurposed to more scientific discovery fields, such as agrochemicals and materials design.

Details of datasets.
We used LIT-PCBA 39 , BindingDB 40 and DrugBank 41 to evaluate our proposed method for molecular interaction prediction. From LIT-PCBA we selected four datasets of representative proteins based on the number of positive and negative samples. We used datasets in MoleculeNet 42 to evaluate the proposed method for molecular property prediction. MoleculeNet 42 is a set of benchmarking datasets for molecular machine learning that can be used to achieve a fair performance comparison.
LIT-PCBA is a virtual screening dataset consisting of 14 targets, 7,844 confirmed active and 407,381 confirmed inactive compounds 39  Configuration space. The configuration space of GLAM has two parts: architecture decisions and training decisions. The architecture decisions decide how to build the architecture of a model. The general architectures of pairs of molecules and single molecules contain eight and four blocks with their own independent design spaces, respectively, as shown in Fig. 1. The training decisions decide how to train the model, including batch size, number of epochs, type of loss, type of optimizer, learning rate, reduction of learning rate, reduction of the patience of learning rate and early-stop patience.
Graph learning in architectures. Given a molecular graph G with x i denoting node features of node i and e ji denoting edge features from node j to node i, the feedforward block can be described as where h i denotes the hidden node features and f nn a feedforward neural network. The message-passing block can be described as where f u denotes the update function and f i the interaction function. The output properties p are transformed by the global pooling block and the final feedforward block can be described as where f nn denotes a feedforward neural network and f pool the global pooling layer.  Design spaces of blocks. Each block is created with its own design space, as shown in Extended Data Fig. 1. The feedforward block consists of a normalization layer, a dropout layer, a feedforward layer and an activation layer. The normalization layer, dropout layer and activation layer can be chosen to be empty in this block. Most parts of the message-passing block are the same as in the feedforward block, but the core is changed to a message-passing layer with a choice of five possible types. The fusion block is designed to extract information on a pair of interacting molecules. The global pool block consists of one layer of graph pool layer with a choice of three types of pooling layer.
Preparing for both molecular interactions and properties. We prepare two general architectures for both molecular interactions and properties prediction, as shown in Fig. 3. The pair-graph architecture for molecular interactions accepts a pair of molecules as input, and outputs their interaction. The single-graph architecture for molecular properties accepts a molecule as input and outputs one or multiple properties of the molecule. Some essential tasks in drug discovery relate to molecular interactions, such as protein-ligand interactions. All molecules are processed to graphs with basic node attributes as input of the architectures. Small molecules are processed into atom-level molecular graphs, where the edge information is provided by chemical bonds. Proteins are processed into residue-level graphs, with the edge information provided by contact maps predicted by RaptorX 22 .
Multi-graph-cards parallel. GLAM works in parallel with multiple graphics cards. The most time-consuming parts of a graph learning process are the training, validation and testing. The proposed method contains lots of independent graph learning processes. We let them work in parallel to fully utilize the computational resources. In detail, we build a queue and insert all these processes, as jobs, into it. If a graphics card is free, a job will pop up and be assigned to the card until all the jobs have popped up.

Robustness experiments.
This experiment aims to investigate the robustness of a predictor based on the perturbed dataset. If the prediction of a method is not greatly affected by a slight perturbation that has little effect on the molecular property, the method may be a robust method. Given a molecule set M with ground-truth properties Q and a trained predictor f, we predict the property set P by equation (4), and the training and validation sets are used to train and save weights to obtain our predictor f: Given the perturbed test set M′ with properties Q′, we predict the property set P′ by equation (5): We estimate the robustness of the predictor by calculating the error between the predicted value P of the original molecule set M and the predicted value P′ of the perturbed molecule set M′. Given a distance function L, L(P, P′) represents the distance between P and P′. If L(P, P′) > L(Q, Q′), the predictor is not robust, that is, the perturbation will have a big impact on the performance of the predictor. If L(P, P′) ≤ L(Q, Q′), the predictor is robust and the perturbation will have less impact. We define a perturbation effect score with equation (6), where the Δ represents the perturbation effect score of method f: In this work, we use the following settings. We picked out the original set M (N = 2,362) and the perturbed set M′ (N = 2,362, 2,362, 2,362 for level 1, 2, 3) from all 14,176 molecules, and we use the r.m.s.e. as our loss function L.
PASP. This principle is used to determine an ideal perturbed molecule set with small perturbations that do not significantly affect the properties. We need to ensure two conditions are met to let PASP work. The first condition is that the change in property should be within an acceptable range, and the second condition is that the molecular structures do not change much.
Given a molecule pair {xi, x ′ i }, their properties are {qi, q ′ i }. Assume their molecular fingerprint similarity is S(xi, x ′ i ) ∈ [γ min , γ max ), where [γ min , γ max ) is a predefined similarity range, and the difference of their properties is L(qi, q ′ i ) < ϵ2, where ε 1 is a predefined acceptable value, then the molecular pair is an ideal perturbed molecule pair that follows the principle.
Building the perturbed dataset. In this experiment we build a perturbed dataset based on real-world datapoints by searching and selecting from the PhysProp 27 dataset for robustness estimation. The PhysProp dataset consists of 14,176 molecules structures and their corresponding properties (log P). We compare all potential molecule pairs, calculate the fingerprint similarity of all molecules and their difference in log P, and pick out molecule pairs that meet the following two conditions. The first condition is that the difference in the log P of the molecule pairs should be less than 0.2, and the second condition is that the molecular fingerprint similarity should be in the range of 0.3-1.0. We then divide these molecule pairs into three levels (range 0.8-1.0, 0.5-0.8 and 0.3-0.5, marked as levels 1, 2 and 3), pick out those molecules that exist in all three levels, and build the test set (N = 2,362) with them. The corresponding molecules in the pairs of three levels are separated to build the perturbed test sets (N = 2,362, 2,362 and 2,362) of the three levels. All remaining molecules are used to build the training set (N = 7,684) and test set (N = 2,561) to train the models and save the model weights.
Node-level interpretation. We extract the output of the message-passing block of the best predictor generated by GLAM, which is a matrix X = {xij|1 ≤ i ≤ N, 1 ≤ j ≤ M}, where N is the number of atoms and M the dimension of the outputs. We then obtain the weight W = {wi | 1 ≤ 1i ≤ N} of each atom according to wi = 1 M ∑ M j=1 xij, and visualize the molecule with the weights. In some cases, the weight w i may be scaled to [−1, 1] or [0, 1].

Data availability
All data used in this paper are publicly available and can be accessed as follows: LIT-PCBA 39

Code availability
All code of GLAM is freely available at https://github.com/yvquanli/GLAM with an MIT licence. The version used for this publication is available at https://doi.org/ 10.5281/zenodo.6371164 43 .