Gaussian processes regression for cyclodextrin host-guest binding prediction

Machine Learning (ML) techniques are becoming an integral part of rational drug design and discovery. Data-driven modeling regularly outperforms physics-based models for predicting molecular binding affinities, placing ML as a promising tool. Cyclodextrins are nano-cages used to improve the delivery of insoluble or toxic drugs. Due to chemical similarity to proteins, ML approaches could vastly profit to improve affinity prediction and enhance their carriable drug portfolio. Here we evaluate the performance of three well-known ML methods—Support Vector Regression (SVR), Gaussian Process Regression (GPR), and eXtreme Gradient Boosting (XGB)—to predict the binding affinity of cyclodextrin and known ligands. We perform hyperparameter tuning through Random Search. The results were compatible with the presented literature. We increased our previous prediction performance and present a GPR model to adjust to the data (R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document} = 0.803) with low prediction errors (RMSE = 1.811 kJ/mol and MAE = 1.201 kJ/mol).


Introduction
Cyclodextrins (CD) are widely used as drug nano-carriers, mainly for enhancing solubility, stability, and bioavailability of a variety of drugs [17,21,22]. They are extensively applied in the pharmaceutical field, have relatively controllable variables, and specific evaluation indexes [12,22,29]. Drug-CD complexation and electrostatic interactions can significantly alter the drug's pharmacokinetic profile [30]. The CD-drug complexes work as a type of molecular encapsulation [7], CD wraps the drug and partly shields it from the external environment, preventing drug degradation [25] ( Fig. 1). Also, due to their simplicity and chemical similarity to proteins, CD complexes serve to develop predictive models [33].
To enable using CD nano-carrier system for a novel class of drugs, one should perform a screening and/or biochemical characterization of potential complexes, which can be a quite laborious and expensive process [28].
Computational methods aim to automatize and reduce the time and cost to discover new drugs [15,18]. However, traditional scoring functions (SF), empirical or physics-based, still struggle to rank potential binders unless rigorous free energy calculations could be performed. This limitation has been addressed in several community challenges [24], highlighting the potential of Machine Learning Scoring Functions (MLSF) to improve predictions. Therefore, it is crucial to develop models for pre-screening to direct rigorous tests.
MLSF consistently outperforms classical scoring functions to predict binding properties, leading to more accurate hit rates and potencies, placing ML techniques as a promising tool [33].
Zhao et al. [33] presented a complete study when applying different ML methods (LightGBM, Random forest, and Deep learning) for predicting interaction energy between a vast number of cyclodextrin classes and ligand molecules.
Solov'ev et al. [27] presented a study over 3D molecular fragment descriptors for structure-property modeling. The authors apply a multiple linear regression model to predict the complexation free energy between antipodal guests and -cyclodextrins. Xu et al. [31] presented a quantitative structure-property relationship study of -cyclodextrin complexation free energies of organic compounds applying multiple linear regression and artificial neural network.
Di et al. [6] perform an in silico prediction of binding capacity and interaction forces of organic compounds with -and -cyclodextrins applying a consensus method coupling five ML methods (Radial Basis Function Regression, Gaussian Processes, Random Forest, Gradient Boosted Trees, and Tree Ensemble).
Kerner et al. [16] propose a computational strategy to predict drug interactions to unassociated biomedical implants and present tests over cyclodextrin host-guest systems. The authors have applied Artificial Neural Networks coupled with an autoencoder structure to reduce the large dimensionality input data.
In this paper, we assess the performance of three wellknown ML methods to predict binding affinity between cyclodextrin and ligand molecules in host-guest systems: (i) Support Vector Regression (SVR), (ii) Gaussian Process Regression (GPR), and (iii) eXtreme Gradient Boosting (XGB). We perform hyperparameters tuning through Random Search (RS).
We focused on building a generalized model capable of pre-screening systems of different classes of cyclodextrin and a variety of ligand molecules.

Data collection
Data was curated and made available by the BindingDB community (https:// www. bindi ngdb. org). Each record covers a host (large) and a guest (small) molecule. The database provided the molecule structural information through SMILES, and experimental conditions, including pH and temperature ( • C), and the binding free energy, measured as ΔG (kJ/mol).
Here we focused on , , and cyclodextrin (-CD) [11] classes, that differs in the number of glucose units: 6, 7, and 8, respectively. For sake of consistency, we only considered the experiments with available pH and temperature measurements within the following range: 6.9 ≤ pH ≤ 7.4 and 14.5 ≤ Temp ≤ 30.1 ; resulting in 280 unique observations of -CD (73), -CD (164), and -CD (43) experiments. Physical-chemical molecular properties were calculated using the module RDKit descriptor calculation [20] from KNIME (https:// www. knime. com). Table 1 shows the calculated descriptors for the three CD hosts. The guest's representation follows with the same descriptors listed plus the formal charge (FC). Figure 2 shows some of the distributions for the guest molecules descriptors, plus the distribution for the complexation energy values ( ΔG).

Machine learning approach
The database comprises 280 unique observations, 145 guests, 3 hosts, and 25 descriptors (9 for hosts, 10 for guests, Fig. 1 The interaction of a drug and a cyclodextrin to form an inclusion complex [25]  2 for the environment, 3 identifiers, and 1 objective variable). The dataset was randomly divided into a unique set of training (224) and testing (56) using the Stratified K-fold method to maintain the same proportion of instances of each CD class in both sets [8] coupled with a Kullback-Leibler divergence analysis between the sets. The data and code are available from the authors upon request. Additionally, Table 7 in Appendix shows more descriptive information over the considered input and output variables and Table 8 and Table 9 in Appendix show statistics of the training and testing sets, respectively. In the present paper, we compare the performance of three ML methods to predict the binding affinity of numerous ligands to cyclodextrins: (i) -Support Vector Regression, (ii) Gaussian Process Regressor, and (iii) eXtreme Gradient Boosting.

Gaussian process regressor (GPR)
Gaussian Process Regressor (GPR) is a non-parametric method that, instead of inferring a distribution over parametric function parameters, infers a distribution over functions directly. As described for Xiaoling Ou et al. [23], for all regression approaches, a Gaussian process model can be defined as: where f(.) is a non-linear function mapping the input vector X to a scalar output y. is Gaussian noise with zero mean (i.e. ∼ G(O, 2 v ) ). The prior for f(.) is assumed to be a Gaussian process, f (.) ∼ G( , ) , where is usually set to 0 and C is the covariance matrix.
The prior's covariance is specified by passing a kernel object. Here, we applied both RBF kernel (as described above), and Matern kernel, given by where is the length scale, controls the smoothness of the resulting function, Γ f (.) is the gamma function, and K is a modified Bessel function [19].

"-support vector regression ( "-SVR)
The -Support Vector Regression ( -SVR) is a version of the classical Support Vector Machines [3] applied in several fields and a potential tool for drug discovery studies [14]. Previous studies performed for our research group indicate good results with this method, although there is where and y i is output data associated with i . L is the -insensitive loss function [10], C is a regularization parameter and is a -SVR parameter.

The eXtreme gradient boosting (XGB)
The eXtreme Gradient Boosting XGB [4] is an ensemble method for combining several low-complexity and higherror estimators, called weak learners, to generate a robust learner. In the case of the XGB, weak learners are regularized decision trees [5]. The XGB prediction for a instance i is where M is the number of estimators, and h m are the weak learners built based on parameters of max_depth and a minimum loss factor for partitioning a new tree leaf ( Γ tree ).
Since it is a integrative boosting method, it is built in a greedy fashion of the form where is a learning rate and the newly added tree h m (x) is fitted in order to minimize a sum of losses L m and is given by Eq. (2).
In Eq. (2), we take where N is the number of samples and is the a regularization term, T is the number of leaves in the decision trees (weak learners), are the leaf weights, and reg and reg are a L 1 and L 2 regularization term on weights, respectively.
Many researchers have recently obtained excellent results with the application of the XGB method to predict molecular interaction in Host-Guest systems [8,33]. Thus, we test this ML technique for our predictions.

Randomized search (RS) strategy
In addition to the internal parameters adjusted in the training step, ML methods are generally sensitive to hyperparameter definitions [26]. We propose a hybrid approach coupling the ML methods with hyperparameters tuning through a Randomized Search (RS) strategy in this work. The RS performs a random choice of values, where each setting is sampled from a distribution over possible parameter values [1]. The process allows a budget to be chosen independently of the number of parameters; besides, adding parameters that do not influence the performance does not decrease efficiency.
Each RS run considers 3-fold cross-validation and 1000 machine samples. All optimized parameters follow uniform distributions.

Model performance criteria
We apply three metrics to calculate the prediction errors where y is the measured value, ŷ is the predicted value and ȳ is the average of the measured values given by ȳ = 1 n ∑ n i=1 y i considering the n instances i.
We analyzed the application domains of the best regression through a Williams plot. Williams plot is a method widely applied for evaluating application domains, which provides leverage values plotted against prediction errors [9]. The leverage value (h) measures how far a new instance i is from the centroid of the training set, and we can calculate it by where X is the training set descriptor matrix [9]. Here, we considered the warning leverage h * = 3 (p+1) n , where p is the number of molecular descriptors and n is the number of training samples. If a new instance of CD-ligand has leverage higher than the threshold h * , its predictive value is considered unreliable. We may hold these molecules outside the descriptor space, either the application domain.

Results and discussion
First, we present the results related to the hyperparameter search through a Randomized Search (RS). Each run of RS finds an adjusted machine (model) with an associated training RMSE. We considered the best-optimized model the machine associated with the lowest training RMSE. Here we present the distribution of the optimized parameters selected over 1000 runs using Random Search for each of the ML strategies. The SVR parameters are shown in Fig. 3, the GPR parameters appears in Fig. 4, while Fig. 5 display the XGB parameters. The hatched bar indicates the range that involves the best-optimized model among the RS runs. Note that for SVR, Fig. 3, even though search space was defined in a larger range, the best models were consistent, having ≤ 3 . A similar result is observed for Γ , once it essentially converges to values closer to zero. For GPR, none of the best results were based on RBF kernel selection. In Fig. 4, we notice the sigma increases for GPR with the Matern kernel while nu decreases in the best results. Table 2 shows the overall average metrics for training and testing sets for each method. The results demonstrate the consistency of the methods. The GPR achieved better results compared with the other methods. Table 3 shows the metrics for the best-optimized models of each tested method achieved by the RS strategy. Table 4 presents the set of optimized hyperparameters associated with each bestoptimized model. Again, we notice better results from GPR in every metric. Thus we follow the analysis focusing on this approach.  Figure 6a-c shows the high adjustability of the best-optimized model to the training data ( R 2 = 0.953 ) along with the other evaluation metrics. We indicate instance attributes that frequently mislead the predictions. The definition of each class is described in Table 5. Figure 6d-f shows the model prediction ability on the testing set. The generalization ability of the GPR model is evident according to the values of the performance metrics presented in Table 2. Compared to the model's performance obtained in the training set, the increase in the prediction errors for the test set was expected. However, it is observed that there was no overfitting since the worsening of performance in the test set is not significant, demonstrating the good performance of the strategy on a set of unseen samples. Figure 6 explicit the occurrence of an outlier prediction in the test set (sample with greatest error). The instance related to this prediction is the interaction between the -CD host and the guest 1-butylamine -CCCC[NH3+]. The ΔG real = −17.124 and the ΔG pred = −8.095 . In this particular instance, SVR and XGB models predicted ΔG pred ∼ −13.5 , which still characterizes a high error but closer to the actual value than GPR.   Table 5 show the overall average RMSE (1000 runs) for each instance separation (charged ligand, ligand flexibility, and CD classes). There are indeed oscillations in the predicted measurements between the classes. Although the difference in RMSE does not overcome ∼ 0.8 kJ/mol, it should take into account before applying the ML method over new datasets.
Given the previous results, we need to verify if the error levels are compatible with the model's molecular interactions application domain. In this context, interactions of electrostatic potential, van der Waals, salt bridges, and hydrogen bonds (HB) are prevalent forms of interaction.
Among them, hydrogen bonds and electrostatic interactions are determinants in receptor-ligand complexes [32].
HB is weaker than electrostatic interactions but extremely more frequent. The literature highlights that HB connections have an interaction potential always lower than 10 kJ/mol; in the vast majority of cases, they are lower than 3 kJ/mol [2,32]. In protein complexes, for example, the contributions from the formation of HB are 5 ± 2.5 kJ/mol [32]. Thus, the error level in the predictions can be related to the number of HB inserted in the non-predicted range. For this work, an error of up to one non-predicted hydronium bond is considered acceptable.
The RMSE and the MAE metrics indicate an average error while keeping the same measure as our objective variable (kJ/mol). Observing our GPR best model RMSE and MAE in training (RMSE = 1.034 and MAE = 0.396) and testing (RMSE = 1.811 and MAE = 1.201), we see that the values remain safely below the threshold of 5 kJ/mol. The same is valid for our overall average analysis presented in Table 2. It guarantees an average error of only one hydrogen bond between the real and the predicted Δ G. We can also verify the best GPR model's application domain through the Williams plot in Fig. 7. The majority of compounds in the training and testing sets are placed within the reliable area, indicating that these compounds are most likely to be well predicted by the model.
There are four training compounds with leverage values greater than the threshold of h * . They differ from the other instances of ligand SMR, TPSA, or RB equal to zero (more info in the Complementary Results 1 ). These instances may influence the model's performance in the training set but are  Table 5 Experimental Binding Affinity (kJ/mol)   [4,7], and high RB the range [8,11]. We disregard the instance that generated the outlier prediction for this RMSE calculation. not necessarily outliers to be deleted from the training data set once their standard residual values are shallow and within the established limit. Overall, the standardized residuals are smaller or in the limit of our safe region given by .07 , where resid is the residuals standard deviation, except one instance of the test set (instance #10). This instance has already been identified in Fig. 6d-f. Once its descriptors leverage does not outrage the value h * , it may behold as a worst-case scenario. The labeled data in Fig. 7 are instances arranged at the limit or outside the Williams plot security region. Table 6 relates host and guest BindingDB identifies with the relative h and standardized residuals obtained for the best GPR model. Finally, we briefly compare the results with some papers presented in the literature. Note that this comparison is not over results based on the same datasets. We only seek to demonstrate that, for the current database, the error levels are compatible, a priori, with it, is frequently presented in the literature. Dimas et al. [28] selected a set of complexes between -CD and 57 small organic molecules that have been previously studied with the binding energy distribution analysis method in combination with an implicit solvent model. Even being a study focus only on the -CD and applying a physics-based method, the errors levels were R 2 = 0.66 and RMSE = 9.330 kJ/mol, values that are worst than we achieved with our average metrics in Table 2 and best model in Fig. 6d-f.
Zhao et al. [33], on the other hand, also applied ML methods for predicting interaction energy between a vast number of cyclodextrin classes and ligand molecules (see Sect. 1). The best result obtained for Zhao was based on Light Gradient Boosting Machine (LightGBM) approach, similar to XGB approach, with metrics R 2 = 0.86 , RMSE = 1.83 kJ/ mol, and MAE = 1.38 kJ/mol. Comparing with our best XGB model presented in Table 3, we achieved competitive results considering our smaller database size. However, we may be on the edge of an overfitting situation as our training data were adjusted considerably better than our testing data. Since the XGB hyperparameter space has many degrees of freedom, it demands reasonable dataset sizes to achieve its optimal generability levels. Future analysis may address XGB search hyperparameters space, reducing possible values driving the solution for a local optimum of an overfitting occurrence. We may indicate the learning rate ( ) hyperparameter (Fig. 5c) as a possible overfitting originator. Our best machine indicates a high value in this parameter ( = 0.82 ), indicating the possibility of a biased solution and a high tendency of accumulating specific information of the training data. In this scenario, the ML does not get generalized solutions for the testing data, as we desire. Through our results in Fig. 5c, we may find a better solution for XGB when ≤ 0.6 , or with a even smaller cutoff value.
Comparing the Zhao et al. [33] results with our best GPR model presented in Table 3 and Fig. 6d-f, we see that we achieved better MAE and RMSE values. Moreover, GPR is a method of less computational complexity compared to XGB. Disregarded the outlier prediction, the metrics are even better for the mean error metrics. However, the R 2 could not be adjusted with the same values as Zhao et al. [33].

Conclusion
This paper proposes an approach combining machine learning methods (SVR, XGB, and GPR) coupled with a Random Search hyperparameter tuning strategy to perform to  predict the interaction between cyclodextrins host-guest systems. GPR achieved a better average and best results than the other methods. The proposed approach was enough to define a GPR model with great generality ( R 2 = 0.803 ) and low associated errors (RMSE = 1.811 and MAE = 1.201 ) modest prediction on the application domain. The results were compatible with the presented literature, even though using a less computational complexity method. As further research, we will apply other ML methods to this task and investigate using an evolutionary algorithm for optimizing the hyperparameters.