Self-Supervised Machine Learning Approach for Identifying Biochemical Influences on Protein-Ligand Binding Affinity

Drug discovery is incredibly time-consuming and expensive, averaging over 10 years and $985 million per drug. Calculating the binding affinity between a target protein and a ligand is critical for discovering viable drugs. Although supervised machine learning (ML) models can predict binding affinity accurately, they suffer from lack of interpretability and inaccurate feature selection caused by multicollinear data. This study used self-supervised ML to reveal underlying protein-ligand characteristics that strongly influence binding affinity. Protein-ligand 3D models were collected from the PDBBind database and vectorized into 2422 features per complex. LASSO Regression and hierarchical clustering were utilized to minimize multicollinearity between features. Correlation analyses and Autoencoder-based latent space representations were generated to identify features significantly influencing binding affinity. A Generative Adversarial Network was used to simulate ligands with certain counts of a significant feature, and thereby determine the effect of a feature on improving binding affinity with a given target protein. It was found that the CC and CCCN fragment counts in the ligand notably influence binding affinity. Re-pairing proteins with simulated ligands that had higher CC and CCCN fragment counts could increase binding affinity by 34.99-37.62% and 36.83%-36.94%, respectively. This discovery contributes to a more accurate representation of ligand chemistry that can increase the accuracy, explainability, and generalizability of ML models so that they can more reliably identify novel drug candidates. Directions for future work include integrating knowledge on ligand fragments into supervised ML models, examining the effect of CC and CCCN fragments on fragment-based drug design, and employing computational techniques to elucidate the chemical activity of these fragments.


INTRODUCTION
Drug discovery is the basis of the modern pharmaceutical market, and encompasses most of the industry's research and development funding [1]. On average, it takes 12-15 years and $985 million to deliver a drug to market, demonstrating the exhaustive time and effort required to complete the drug discovery process [2,3]. Drug-Target Interaction (DTI) analysis is one of the most critical parts of drug discovery, and it involves calculating the binding affinity between a target protein and a ligand molecule so that appropriate ligand candidates for drugs can be chosen. These ligand candidates go on to be included in in vitro experimentation in order to identify lead compounds for the final drug. The affinity of a ligand to bind with a protein depends on the atomic interactions between the ligand and the binding region (referred to as the "binding pocket") on the protein ( Fig. 1) [4].
Calculating the binding affinity between a protein and ligand can be completed through Virtual Screening (VS), where compounds are screened and binding affinity calculated using 1 computer software [5] (Fig. 2). The "Scoring Function", which is the function used to calculate binding affinity, is critical for VS. Machine Learning (ML) algorithms have demonstrated considerable promise as a scoring function compared to other standard function types [6]. Given a set of training data, ML algorithms are able to learn chemical features from protein-ligand models through supervised learning functions. This allows them to accurately predict the binding affinity based on learned features that have statistically high influence [7][8][9]11]. Beyond simply predicting the binding affinity, supervised ML algorithms can be used to determine the importance of certain pharmaco-like features in influencing the binding affinity, thereby revealing important insights that can inform the development of innovative drugs [8]. However, supervised ML algorithms suffer when multicollinearity -the presence of significant intercorrelations between two or more independent variables -exist in a dataset [12,13]. This is because supervised feature analysis methods, the most popular being Random Forest feature 2 selection, rely on differential performance with/without certain predictor variables in order to determine the importance of that variable [14]. Therefore, when predictor variables are highly intercorrelated with one another, calculated "feature importance" scores do not accurately represent the independent contribution one predictor variable has to a response. In addition, supervised nonlinear models (e.g. Random Forests or Deep Neural Networks) that are successful in predicting a response output accurately suffer from lack of interpretability, meaning that informative patterns learned by the algorithm cannot be easily extracted. Therefore, supervised learning models have been given the term"black box", and are problematic for analyzing the features patterns from large-scale datasets [15,16].
On the other hand, correlation analysis techniques such as Spearman's Rank Correlation and R 2 values have demonstrated high interpretability and computational efficiency in analyzing molecular binding properties [17,78]. In addition, self-supervised learning techniques such as Autoencoder Networks and Generative Adversarial Networks have been shown to be useful in learning low-dimensional representations from high-dimensional datasets, thereby capturing significant patterns that can verify the quantitative importance of a feature [18,19]. Correlation analysis and self-supervised learning techniques have not yet been applied to reveal the differences between protein-ligand complexes specifically in regards to their binding affinity. This is a research gap that can be filled to address the "black box" nature of supervised machine learning algorithms and also reveal significant biochemical insights into the most important features of protein-ligand complexes that influence their binding affinity.

Objectives:
There is a pressing need to more reliably identify and analyze biochemical features that influence binding affinity. Current literature either suffer from drawbacks in interpretability caused by supervised learning or do not account for the multicollinearity present in protein/ligand feature datasets. The objectives of this study are three-fold: 1) Account and rectify multicollinearity present between features of protein-ligand complexes, 2) Identify specific biochemical features responsible for high variance in binding affinities, and 3) Quantify the effect of these features on improving the binding affinity of drug complexes.
Gathering a greater understanding of which features influence binding affinity is necessary for developing ML algorithms that achieve higher hit rates during VS. VS will thereby 3 be more efficient and effective, significantly improving the critical stages of early drug discovery.

Dataset Preprocessing:
In this study, protein-ligand models were collected from the PDBBind database [19,41].
The 2015 "Refined" set and the 2015 "Core" set were downloaded. In order to extract relevant quantitative features of each model, a workflow described in [40] was utilized (Fig. 3).
For each complex, 2422 quantitative features were collected. The frequency of 2282 unique substructural molecular fragments were collected. The remaining 140 features were frequencies of amino-acid interactions, with seven types of interactions per amino acid: 1) Hydrophobic, 2) Face-to-face aromatic, 3) Edge-to-edge aromatic, 4) H-bond accepted by ligand, 4 5) H-bond donated by ligand, 6) Ionic bond (ligand partially negative), and 7) Ionic bond (ligand partially positive). Files with a resolution of <2.5 Å were retained to ensure the accuracy of all feature counts, resulting in 3481 complexes from the "Refined" set and 180 from the "Core" set.
The experimentally determined binding affinity (in pKd) for each complex was also collected from the PDBBind database to serve as a target variable.

Minimizing Multicollinearity:
Supervised ML models suffer from lack of interpretability when multicollinearity is present in a dataset [12,13]. Therefore, the variance inflation factor (VIF) of each feature was calculated to quantitatively determine multicollinearity. VIF measures the factor by which the variance of a feature's estimated ordinary least squares (OLS) regression coefficient is increased due to correlation with other features [20].
It was observed that the VIF of most features was significantly above the recommended value of 10 [21]. Therefore, Standard Scaling and LASSO Regression were used to identify and remove unimportant features (those with a regression coefficient of 0) from the dataset [86].
After the regression was performed, significantly high VIF were still observed in many features.
Hierarchical clustering was used to cluster the remaining features. One feature from each cluster was selected so that features providing redundant information were removed [87]. The remaining 34 features all had VIF < 10.
To ensure that the 34 features could accurately predict binding affinity, two Random Forest Regressors were trained on the "Refined" set (hereon referred to as the training set) and tested on the "Core" set (hereon referred to as the testing set) to predict binding affinity. The first regression tree was trained/tested on all 2422 features whereas the second was trained/tested on the 34 selected features. The R 2 , Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (PCC) between the predictions and ground-truth values on the test set was compared for each regressor to ensure that the 34 features retained significant predictive capability. A Random Forests model was used because of its exceptional predictive performance compared to more standard models such as Linear Regression, which cannot accurately represent the information stored in non-linear datasets [26].

Correlation Analysis:
Supervised feature importance calculation methods suffer from being uninterpretable, coined with the term "black box" [15,16]. Therefore, the correlation between features was 5 analyzed to determine potential significant features [80,81]. The Spearman Correlation between each feature and binding affinity was calculated. In addition, the adjusted R 2 value was calculated between each pair/triplet of features and binding affinity to ensure that features with high independent influence were not insignificant in the presence of certain other features [79,82,83]. Each of the three resulting lists were ordered from highest correlation to lowest. The features that were common among the top 25th percentile of correlations in each list were extracted (Fig. 4). The 25th percentile was chosen as the threshold because it was the maximum percentile in which not every one of the 34 features were common among that percentile between all the three lists. It was observed that 8 features were common among the top 25th percentile of each list. The description of each feature is discussed in the Results section.

Autoencoder to Filter Features:
It has been shown in current literature that autoencoders are effective self-supervised models for learning low-dimensional representations of multivariate data [18]. Therefore, an autoencoder was designed and implemented to analyze complexes with high and low counts of 6 each of the 8 features. A three-stage autoencoder was designed and implemented to compress the 34-feature dataset into two dimensions (Fig. 5).
The autoencoder was trained on the training set for 30 epochs with a batch size of 32. The testing set was used for validation at each training epoch. In order to use the autoencoder to verify if the 8 features were influential, the "latent space" of the autoencoder was extracted. This refers to the 2-dimensional output produced by the third Dense Layer. This is because if the latent space showed a difference in binding affinity between complexes with high and low counts of a certain feature, then that feature must be influential on the complex's binding affinity [18].
After training concluded, the Encoder and Bottleneck layers were used to calculate the latent space representation for all complexes in the training set. This resulted in a 2D representation of every complex in the training set.
For each of the 8 features, the 90th and 10th percentile of counts in the training set were calculated. Only complexes with a feature count greater than or equal to the 90th percentile or less than or equal to the 10th percentile for that feature were extracted. Each of these complexes' latent space was graphed on a 2D heatmap, with the "heat" determined by the binding affinity of that complex. This was repeated for each of the 8 features to determine if high/low counts of a certain feature held a significant influence on binding affinity in the latent space representation.
Following Cohen's Effect Size, which has shown to be useful for analyzing biomedical and 7 molecular datasets, any feature whose heatmap exhibited an adjusted R 2 value greater than 0.25 was determined to have a "large" Effect Size on binding affinity [22,77,84]. It was determined that out of the 8 features, the CC and CCCN substructural molecular fragments had large Effect Sizes. No augmented atoms or amino acid residue counts exhibited large Effect Sizes.

Generative Adversarial Network to Simulate Novel Higher-Affinity Complexes:
Although the autoencoder was utilized to determine the significance of CC and CCCN fragment counts, the effect of these features on improving binding affinity was not examined.
Analyzing the ability of the feature counts to improve the binding affinity of weakly-bound complexes is important to understand the application of these features to real-world VS and drug design. Generative Adversarial Networks (GANs) have been shown to be effective at generating synthetic data for molecular datasets and learning representations of protein-ligand complexes [19,[71][72][73]. They are also successful at generating de novo molecules, accurately simulating the development/discovery of new ligands for inclusion in novel drugs [74][75][76]. Therefore, in order to quantify the effect of significant features on increasing binding affinity, a GAN proposed in [23] was implemented and trained on the 34-feature training set. This GAN utilizes a Long Short-Term Memory (LSTM)-based generator and Multi-Layer Perceptron (MLP)-based discriminator. After training concluded, it was used to generate 3481 synthetic protein-ligand complexes, mirroring the size of the training set. Following [23], the quality of these synthetic complexes was determined by calculating a "similarity score": the Spearman Correlation between the logarithmic transformations of the means and standard deviations of all features in each dataset (real and synthetic). The log-transformed means and standard deviations for each dataset were concatenated to produce two lists that were used to calculate the Spearman Correlation.
After it was verified that the synthetic data was sufficiently representative of the training set, each of the high-Effect-Size features were used to identify ligands in the synthetic dataset that could improve the binding affinity of low-affinity (chosen as pKd<4 in this study) complexes from the testing set.
First, 26 low-affinity complexes were identified in the testing set. Every ligand in the synthetic dataset that had a certain feature count less than or equal to n higher than the same feature count in the low-affinity complex's ligand was selected. For each selected ligand, its features were concatenated into a list with the amino acid residue counts of the low-affinity  [24]. Only percent increases with an associated CDF value below 0.05 were considered statistically significant. Before performing the CDF test, the distribution was first determined to be approximately normal through the Kolmogorov-Smirnov test [25]. Distributions were concluded to be approximately normal if the p-value of the Kolmogorov-Smirnov test was below 0.05. This entire workflow was repeated for all integer values of n from 2-10 for a given feature, with the CC fragment count and CCCN fragment count being the features of interest (Fig. 6).

Minimizing Multicollinearity:
The two Random Forests models trained on the 2422-feature dataset and 34-feature dataset, respectively, did not show significantly different predictive performance on their corresponding testing sets (Table #1). All performance statistics were equivalent between both models with the exception of 0.01 increase in Pearson Correlation Coefficient for the 2422-feature model. This suggests that the 34 non-collinear features accurately represent the chemical properties influencing binding affinity stored in the original "Refined" dataset. Therefore, the 34 features were deemed sufficient for use in further analysis.

Correlation Analysis:
The correlation analysis described in the Methods section resulted in eight features that were common among the top 25% of correlations in each group (Table #2). These eight features were considered to likely hold significant influence on the binding affinity of a complex. 10 Autoencoder to Filter Features: The autoencoder described in the Methods section was evaluated by measuring the training and testing loss/accuracy over the training period. The autoencoder learned to accurately reconstruct a complex's features from a 2D representation, as it exhibited a high testing accuracy of 0.9667, a low testing loss of 0.8735, and exponential-like training loss with a final training loss of 0.9297 (Fig. 7). This suggests that the autoencoder was effective in extracting an accurate latent space representation of all complexes in the training set.
As described in the Methods section, complexes with feature counts above the 90th percentile or below the 10th percentile were graphed on a heatmap using both latent space dimensions and the binding affinity as the "heat". The CC and CCCN fragment count features demonstrated high Effect Size, with R 2 values of 0.39 and 0.30, respectively (Fig. 8C-D). No other features demonstrated high Effect Size (Fig. 8) [22]. This suggests that the CC and CCCN fragment count features play a significant chemical role in increasing binding affinity, and that they are distinctly more influential than any other feature [27]. 11 Generative Adversarial Network to Simulate Novel Higher-Affinity Complexes: The GAN described in the Methods section produced 3481 synthetic complexes, the quality of which was calculated by comparing the means and standard deviations of each feature between the original and synthetic dataset. It was observed that there is a strong correlation between the properties of the original and synthetic dataset, as evidenced by a high "similarity score" of 0.9912 (Fig. 9) [23]. This suggests that the synthetic data was accurately representative of the real dataset.

Table #3 (Continued)
in the synthetic dataset were filtered for high CC/CCCN counts and concatenated with the proteins from the low-affinity complexes to determine the effect of each feature on increasing the binding affinity of weakly-bound complexes. Due to non-collinearity, the effect of other features on binding affinity as lurking variables was minimized. The Kolmogorov-Smirnov test and CDF test were utilized to determine the statistical significance of a given increase in binding affinity.
Replacing ligands in low-affinity complexes with ligands that had higher CC fragment counts resulted in an average increase in binding affinity of 34.99-37.62% (Table #4). This suggests that 14 a higher CC fragment count can significantly increase the binding affinity of a ligand with a target protein [28][29][30]. No significant correlation between increase index (value of n by which for a generated ligand to be selected, it must have had a CC count less than or equal to n greater than the CC count in the low-affinity ligand) and mean increase in binding affinity was observed. This suggests that increasing the CC fragment count can consistently increase binding affinity, but that increasing the amount by which the CC fragment count is increased does not necessarily result in a greater binding affinity. A positive relationship between CC increase index and percent of increases that were statistically significant was observed. This is supported by the fact that increasing the value of n also resulted in a greater number of low-affinity complexes experiencing statistically significant increases. This collectively suggests that ligands with greater CC fragment counts more reliably result in greater binding affinities.
The same evaluation process was conducted on the CCCN fragment count feature.
Replacing ligands in low-affinity complexes with ligands that had higher CCCN fragment counts results in an average statistically significant increase in binding affinity of 36.83%-39.64% (Table #5). This suggests that a higher CCCN fragment count can significantly increase the binding affinity of a ligand with a target protein [28][29][30].
No significant correlation between CCCN increase index and mean increase in binding affinity was observed. This suggests that increasing the CCCN fragment count can consistently increase binding affinity, but that increasing the amount by which the CCCN fragment count is increased does not necessarily result in a greater binding affinity. A positive relationship between CCCN increase index and percent of increases that were statistically significant was observed. This is supported by the fact that increasing the value of n also resulted in a greater number of low-affinity complexes experiencing statistically significant increases. This collectively suggests that ligands with greater CCCN fragment counts more reliably result in greater binding affinities.
It was also observed that certain complexes experienced statistically significant increases in binding affinity in both groups. These complexes are those with I.D. 5, 24, 27, 33, 102, 132, 133, 137, 141, 145, and 169. Several complexes did not experience any statistically significant increase in binding affinity. These complexes are those with I.D. 1, 14, 50, 65, 84, 93, 113, and 125. This suggests that CC and CCCN fragments may play a varied chemical role in different low-affinity complexes depending on the ligand's binding mode properties [31,32].

CONCLUSIONS
In this study, three main objectives were achieved: 1) Multicollinearity was minimized in a dataset consisting of 2422 features per protein-ligand complex, 2) Specific ligand fragments were discovered to have a notable chemical influence on increasing binding affinity, and 3) It was shown that increasing the count of these fragments can significantly improve the binding affinity of protein-ligand complexes. The methods utilized in this study improved upon current literature by: 1) Minimizing the bias in results occurring from multicollinear datasets, and 2) Increasing the reliability and interpretability of feature selection by utilizing a unique pipeline of self-supervised learning and correlation analysis techniques instead of supervised learning methods. It is concluded in this study that CC and CCCN ligand fragments are chemically significant in determining the binding affinity of protein-ligand complexes and can determine ligands that notably improve binding affinity when bound to a target protein.
There are several applications of this work to real-world drug discovery. Understanding the influence of CC and CCCN fragments on binding affinity will produce a more accurate representation of ligand chemical activity, aiding researchers to build ML algorithms that more accurately predict binding affinity [33][34][35][36]. Integrating previous knowledge into supervised ML algorithms has been shown to increase predictive performance by an entire order of magnitude, demonstrating the significance of the knowledge uncovered in this study to supervised ML research [37,39,40]. More importantly, pre-training knowledge on ligand fragments will result in ML models that overfit less, making them more generalizable to new datasets and thus reliable for analyzing novel drug candidates [43][44][45]. Incorporating prior knowledge can also increase the explainability of supervised ML models, further reducing their "black box" effect [38].
Increasing the accuracy, generalizability, and explainability of supervised ML models using knowledge such as that concluded in this study will improve industry-wide VS processes by more reliably identifying ligand hits for inclusion in novel drug compounds [41,42].
The effect of improving ML models for effective VS are profound. It has already been demonstrated that for certain proteins such as Interleukin-1 receptor associated kinase-1 (IRAK1), ML models can increase novel ligand hit rates by over 1000% compared to standard scoring functions [46]. Using the information revealed in study to develop more accurate, generalizable, and explainable ML models can result in similar increases across wide ranges of proteins because models will be able to screen novel ligands without significant decreases in reliability. Using the conclusion of this study as well as others to develop more robust ML models is therefore critical for identifying promising drug candidates for innovative medicines.
It is significant to note that the discoveries of this study is useful in other scientific contexts, such as synthetic drug design. Using known information on fragments such as the two focused on in this study (CC and CCCN), synthetic ligands can be chemically designed to bind optimally to a target protein [42,43]. Computational tools (including, but not limited to, ML models) can also be developed to design novel synthetic drugs using known relationships between ligand fragments [44][45][46]. Gathering a clear, data-driven understanding of ligand fragment activity is a significant method by which synthetic drug design for new medications can be improved.
In industrial fields such as process chemistry, ligand fragment activity with target proteins can be used to direct enzyme evolution in biocatalytic reactions [52][53][54][55]. Biocatalysis involves the binding of a small molecule (ligand) and an enzyme (protein) to catalyze the chemical reaction of the small molecule [52]. It has been shown that ligand features can guide the development of metathesis catalysts and nitrile hydration catalysts, demonstrating the varied applications of ligand feature information [56][57][58].
Therefore, it is evident that the significant ligand features elucidated in this study have important applications to Virtual Screening research, synthetic drug design, and industrial processes such as biocatalysis.

LIMITATIONS AND FUTURE WORK
It is significant to note that there are limitations and directions for future work based on this study's methodology and results.
In this study, several arbitrary thresholds were chosen for the purpose of experimentation.
For example, it was chosen that complexes with feature counts above the 90th percentile or below the 10th percentile would be considered in determining that feature's influence on binding affinity. It was also chosen that the range of n values for the GAN-based simulative testing would be 2-10. Using statistical techniques to guide the decision of all thresholds may lead to slightly different results [75]. Further, employing additional statistical tests such as confidence intervals may increase the reliability of the results [77]. In addition, due to online availability and computational limits, only the PDB-Bind database was analyzed in this study. However, conducting the same methodology on different datasets may support or refute the results of this study [85].
There are several other directions for future work based on the methodology of this study.
For instance, other self-supervised and unsupervised learning algorithms such as Principal Component Analysis, t-Distributed Stochastic Neighbor Embedding, and Variational Autoencoders can be used to reveal significant feature-based influences on binding affinity [60][61][62][63]. Analyzing the relationship between features instead of just independent features' influence can also reveal significant chemical phenomena that influence binding affinity [64,65].
In addition to the methodology, the results of this study can lead to future work. For example, integrating the importance of CC and CCCN fragments as prior knowledge into ML models may lead to improved predictions of binding affinity [37,39,40,59]. Investigating the effect of these fragments on fragment-based drug design can lead to in-silico techniques that directly improve synthetic drug design [66,67]. Most importantly, utilizing methods such as 3D quantitative structure-activity relationships (3D-QSAR) will help expand on this study by 18 revealing the specific chemical reason why CC and CCCN fragments improve binding affinity [68][69][70]. Therefore, there are several exciting directions for future chemical research based on the methodology and conclusions of this study. 19

Authors' Contributions
The corresponding author of this study completed all experimentation described and wrote/reviewed/edited the manuscript independently.