Activity Cliffs As Protein-Related Phenomenon: Investigation Using Machine Learning Against Numerous Protein Kinases

Activity cliffs (ACs) are analogous compounds of signicant anity discrepancies against certain biotarget. We propose that the ACs phenomenon is protein-related and that the propensity of certain target to have ACs can be predicted by some intrinsic protein properties. We pursued this assumption by collecting the crystallographic structures of 84 protein kinases, each of which has numerous reported inhibitors (hundreds). Following data augmentation using synthetic minority oversampling technique (SMOTE), we attempted to correlate the presence/absence of ACs within the ligand pools of collected protein kinases with their corresponding protein properties using genetic algorithm (GA) coupled with variety of machine learners (MLs). Very good GA-ML models were achieved with accuracies of around 75% against external testing set. The models were further validated by Y-scrambling. Shapely additive explanations highlighted the signicance of protein rotatable bonds, hydrophobic and acidic residues in relation to the presence of ACs. These results support the hypothesis that ACs are protein-related.


Introduction
Activity cliffs (ACs) are de ned as pairs of closely analogous compounds that exhibit signi cant a nity differences against certain biotarget [1]. The widespread occurrence of ACs within SAR data [2] makes it imperative for modern computer-aided drug design and discovery to successfully handle this phenomenon [3][4][5][6][7][8][9]. Nevertheless, ACs represent signi cant hurdles for bioactivity-supervised discovery methods that assume smooth and continuous structure-activity relationships [10].
ACs are generally explained to exist due to subtle local differences between active and inactive pairs in their ligand-receptor enthalpic 3D contacts [15]. In this context, machine learning methods (e.g., random forests, support vector machine models and particle swarm optimization) were used to identify and predict activity cliff pairs through ligand descriptor patterns [2,[11][12][13] or target-speci c pharmacophore constraints [14]. Additionally, free energy perturbation (FEP) and molecular dynamics were also used to explain ACs [16][17][18][19][20], nonetheless with a nity prediction errors [21].
Upon careful evaluation of a single class of enzymes, namely, protein kinases, we noticed that some kinases showcase many ACs, while others seem to be resistant to this phenomenon despite numerous (hundreds) reported inhibitors. This observation prompted us to propose that the existence of ACs is target protein-related. However, since all protein kinases share analogous ATP catalytic sites, which are normally targeted by designed inhibitors, we propose that the propensity of having ACs is related to the whole protein matrix rather than the binding site only.
To pursue this assumption, we focused on protein kinases of high-resolution crystallographic structures and numerous reported inhibitors. We then determined the number of ACs within the inhibitors' population of each protein. Subsequently, we scanned numerous machine learners (MLs) to assess the possibility of linking protein properties with the existence and/or absence of ACs within the corresponding ligands population. Moreover, we applied genetic algorithm [22] to identify the most probable protein descriptors that control the ACs phenomenon. We subsequently evaluated the relative in uence of signi cant protein descriptors using Shapely additive explanations [23].

Data Collection
The crystallographic structures of protein kinases (kinase domains) were collected from the protein databank. Wild type proteins were given priority over mutants. In case that certain protein is represented by several entries in protein databank, we opted for the one that combines the best possible resolution and longest complete (uncut) peptide chain. Subsequently, ChEMBL database was queried to collect inhibitors for each protein kinase in the list. However, only inhibitors that have their bioactivities reported in Ki format were collected for subsequent processing to minimize errors resulting from bioassay discrepancies among different labs. Duplicate structures and compounds that have their bioactivities reported as being more or less than certain values were discarded. This left us with 84 proteins, each associated with list of inhibitors ranging from 393 to 2514.
AC pairs were identi ed in each pool of inhibitors using the "Find Activity Cliffs" protocol implemented in Discovery Studio 4.5 (Biovia Inc., USA). This protocol performs Matched Molecular Pairs (MMPs) transformations to nd activity cliffs [24,25]. MMPs are pairs of ligands that differ by a single localized structural change. MMPs with activity difference threshold exceeding 100 folds (i.e., 2 log cycles) were considered ACs provided that the active AC member is of Ki value ≤ 100 nM. Table 1 shows the selected proteins, their resolutions, count of collected of inhibitors, and count of AC pairs. The actual data (protein structures and corresponding ligands) is found in the supplementary materials.

Protein Descriptors
Each protein (kinase domain) was downloaded from the protein databank. Hydration water and cocrystallized ligands were removed. Hydrogen atoms were added utilizing the Discovery Studio 4.5 hydrogen atoms template. The following descriptors were determined for each protein using Discovery Studio 4.5 working environment: Count of non-hydrogen atoms, count of intramolecular hydrogen bonds (allowing hydrogen-bonds to exist over a maximum distance of 3.5 Å); count of intramolecular π-π stacking interactions; count of intramolecular bumps (pair of atoms at close proximity such that they violate each other VDW spheres by at least 30% without being covalently bonded); count of disul de linkages; and count of rotatable bonds. The normalized versions of these descriptors (calculated by dividing each descriptor by the count of non-hydrogen atoms within the particular protein) were also included. Additionally, the counts of each individual amino acid (i.e., Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, and Val), count of basic amino acids (summation of Lys, Arg and His residues), count of acidic amino acids (summation of Glu and Asp residues); count of hydrophilic amino acids (summation of Arg, Asn, Asp, Gln, Glu, His and Lys residues); count hydrophobic amino acids (summation of Ala, Cys, Ile, Leu, Met, Phe, Val) together with their normalized versions (calculated by division by the total count of amino acids of each protein) were also included as protein descriptors.

Machine Learning
For subsequent ML scanning, the protein descriptors were employed as explanatory variables, while the response variable was de ned as binary code based on the presence or absence of ACs, such that if the list of inhibitors for a particular protein is devoid of ACs, it is given response label of "Zero", while the presence of at least one pair of ACs warranted a response label of "One".
The collected kinase data ( Table 1) was divided into training and testing sets by ascending ranking of the collected proteins according to their count of intramolecular hydrogen bonds (arbitrary), then each fth protein was moved from the list to the testing set (16 proteins, marked with asterisks in table 1 including 6 devoid-of-ACs and 10 showing-ACs). The remaining proteins were retained as training set (68 proteins including 26 devoid-of-ACs and 42 showing-ACs).
However, since the number of collected protein kinases is inadequate for e cient ML modelling we decided to augment the training and testing lists, separately, using Synthetic Minority Oversampling Technique (SMOTE) [26] as implemented in the SMOTE KNIME (version 4.3.3) node. The algorithm works by creating synthetic rows by extrapolating between a real object of a given class and one of its nearest neighbors (of the same class). It then picks a point along the line between these two objects and determines the attributes (cell values) of the new object based on this randomly chosen point. We implemented the following parameters: Select a maximum of 5 neighbors and oversampling by 3 folds.
The resulting SMOTE-enhanced training list includes 272 observations (rows), of which 104 are labeled as "devoid of ACs" while 168 are labeled as "showing ACs". Similarly, the SMOTE-enhanced testing set includes 64 observations, of which 24 are labeled as "devoid of ACs" and 40 as "showing ACs".

Screening Machine Learners (MLs)
Initially, 10 MLs were scanned using all protein descriptors as explanatory variables, while the response variable was set to the presence or absence of ACs. The screened learners were: K-Star [27], Locally weighted learning (LWL) [28,29], Logistic Model Trees (LMT) [30,31], LogitBoost [32], Support Vector Machine (SVM) [33,34], Naïve Bayes [35], Random Forests [36], Probabilistic Neural Network (PNN) [37,38], Xgboost [39] and k-nearest neighbors (k-NN) [40]. The MLs were implemented as corresponding where N is the number of all observations in the training list, TP and TN are the numbers of truly identi ed proteins as "showing ACs" and "devoid of ACs", respectively, by the particular ML. P o is the relative observed accuracy (i.e., agreement among raters), and P e is the probability of chance agreement (hypothetical). This is done by using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement, then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by P e ), κ = 0.
Negative Cohen's kappa value implies the agreement is worse than random [46].

Genetic algorithm (GA) for descriptor selection
High-ranking MLs were subsequently individually combined with GA to select optimal groups of protein descriptors capable of achieving best Cohen's Kappa values in classifying the training set implementing leave-20%-out cross-validation. The GA work ow within KNIME Analytics Platform (Version 4.3.3) was implemented for this purpose.
GA operates through a cycle of four stages [22]: (i) Encoding mechanism; (ii) De nition of a tness function; (iii) Creating a population of chromosomes; (iv) Chromosomes Genetic manipulation. A genebased encoding system is implemented herein, whereby the presence or absences of a certain descriptor(s) in a particular suggested model is encoded by gene (chromosome) format, i.e., each value in the gene string represents an independent variable (0 = absent, 1 = present). Each chromosome is associated with a tness value that re ects how good it is compared to other solutions. Genetic manipulation involves mating among successful chromosomes and mutation of some genes within randomly selected chromosomes to help exit local minima. GA control parameters that need to be con gured prior to modeling are: the population of initial random chromosomes, the maximum number of generations to exit from a basic cycle and tness criterion [22]. In the current project, these were con gured as follows: population size = 500, maximum number of generations = 10000 and the tness criterion was set to Cohen's Kappa value of the ML model resulting from features selected by each genetic chromosome.

Shapley Additive Explanations (SHAP)
Shapley additive explanation (SHAP) of a particular feature (protein descriptor) indicates how much the feature has contributed to the deviation of the prediction of certain observation from the base prediction (i.e., the mean prediction over the full sampling data) [23]. The SHAP node within KNIME Analytics Platform (Version 4.3.3) was implemented using K-means to summarize the data of the evaluated feature before forming coalition among remaining features.

Data Collection
In order to correlate proteins' properties with propensities of having ACs, it was necessary to gather accurate ligand binding data for the collected protein kinases. Duplicate molecules and molecules of inaccurately reported structural information (e.g., unde ned stereochemistry) were removed. Additionally, we limited the collected inhibitors to molecules that have their bioactivities expressed as Ki values to minimize the effects of inter-laboratory assay variations commonly seen with other bioactivity indicators (e.g., IC 50 ) [47]. Moreover, ACs were strictly de ned as MMPs of bioactivity difference exceeding 100 folds provided that potent ACs members are of Ki values below 100 nM. On the proteins' side; only wide type protein kinases were considered, and if certain protein is represented by several entries in protein data bank, we opted for the one that combines best possible resolution and longest complete (uncut) peptide chain.
However, since the presence/absence of ACs can be a function of the explored chemical space of the particular protein kinase, it can be argued that absence of ACs for certain protein kinase is related to the limited medicinal chemistry efforts against this target rather than due to certain intrinsic factors related to the protein kinase itself. We adopted two measures to remedy this issue. Firstly, we only collected protein kinases of substantial number of reported ligands (Table 1). However, to propose a reasonable minimal number of ligands for a particular kinase to be included in the modeled list (i.e., Table 1) we looked at protein kinases that combine the highest counts of reported inhibitors and least number of ACs. Two kinases met these criteria, namely, LCK (1818 inhibitors and 3 ACs) and FYN (1643 inhibitors and 6 ACs) with ACs-to-ligands ratios of 1 in 606 and 1 in 274, respectively. Their average ratio is 1 in ca. 440. Accordingly, based on this ratio, it can be reasonably assumed that if the binding space of a particular protein kinase has been explored by ≥ 440 ligands without identifying any ACs then it is likely that the inhibitors space of this target is free from ACs. Based on this plausible assumption we collected protein kinases of at least 440 reported inhibitors. Still, we noticed 5 protein kinases of inhibitors' counts ranging from 393 to 437 (i.e., FES, BRK, ACK1, PYK2, and AXL, table 1), two of which have established ACs (BRK and ACK1) and thus can be safely included for ML. However, the other three are devoid of reported ACs, still we included them for ML because they have inhibitors' counts close to the proposed threshold (i.e., 440 inhibitors), i.e., PYK2 (428 inhibitors), AXL (437 inhibitors) and FES (393 inhibitors). Additionally, these are rather limited in number (just three cases) and therefore should have limited error contribution in ML (if any).
The second measure we took to minimize errors related to limited medicinal chemistry efforts is to squash ACs counts of each protein kinase into a binary response of either "having ACs" (given binary code of one) or "devoid of ACs" (given binary code of zero). This is done to emphasize that the mere presence of even a single valid AC indicates that protein is vulnerable to other ACs and that counts of reported ACs are irrelevant (in fact all our attempts to correlate the counts of ACs with protein properties were futile). However, although current absence of ACs for certain target does not guarantee that there will be no ACs upon future medicinal chemistry exploration of this target, still, targets labeled as being "devoid of ACs" with ligands counts exceeding 440 can be reasonably considered as being resistant to the ACs phenomenon. Accordingly, the binary label of being "showing ACs" or "devoid of ACs" is really surrogate for "AC-vulnerable" or "AC-resistant" targets. This conclusion is underlined in gure 1. Figure 1 shows the counts of "devoid of AC" and "showing AC" protein kinases as function of the counts of their reported inhibitors in ChEMBL database. Clearly from the graph the "ACs-devoid" category supersedes the "AC-showing" category in the rst two intervals, i.e., 390-700. However, protein kinases of higher ligands' counts tend to incline towards the "AC-showing" class despite the presence of signi cant "AC-devoid" minority. In the highest ligands' counts interval, i.e., 1301-2514, all 8 collected protein kinases showed ACs among their ligands. Nonetheless, even in this interval, two protein kinases were rather resistant to ACs, i.e., LCK (3 ACs out of 1818 inhibitors reported inhibitors) and FYN (6 ACs out of 1643 inhibitors). Conclusions from gure 1 provide impetus for our proposition that the presence or absence of ACs, within certain protein kinase binders, points to the level of resistance/vulnerability of that target to ACs. However, since it is hard to set certain numerical threshold for the number of ACs to consider a particular protein kinase as being AC-resistant or AC-vulnerable, our use of "devoid of AC" and "showing AC" labels is rather reasonable surrogate of AC resistance/vulnerability.

Machine Learning
To evaluate the relevance of protein descriptors to the propensity of exhibiting ACs, we initially t-tested the difference between protein kinases "showing-ACs" versus those "devoid-of-ACs" with respect to calculated protein descriptors. Four descriptors showed signi cant difference (p < 0.05) between the two groups, namely; normalized count of hydrogen bonds (N/H-bonds), normalized count of basic residues (N/basic amino acids), normalized count of hydrophobic residues (N/Hydrophobic amino acids) and normalized count of rotatable bonds (N/Rotatable bonds). Although the number of signi cantly different descriptors between the two protein kinase categories is rather limited (only four), it should not preclude the possibility of signi cant differences resulting from interactions between different descriptors warranting subsequent evaluation by MLs.
ML studies commenced by splitting the collected kinases into training and testing sets (as in table 1, testing compounds are marked with asterisks). However, due to the limited number of protein kinases, i.e., for effective machine learning, we opted to augment the training and testing lists, separately, using Synthetic Minority Oversampling Technique (SMOTE) [26]. However, SMOTE was used to augment both classes "Showing-ACs" or "Devoid-of-ACs" (i.e., not only the minority class) to yield an overall 272 training observations of which 104 are labeled as "devoid of ACs" while 168 are labeled as "showing ACs". The SMOTE-enhanced testing set included 64 observations, of which 24 are labeled as "devoid of ACs" and 40 as "showing ACs". Subsequently, we scanned ten ML algorithms (see experimental section) implementing "Showing-ACs" or "Devoid-of-ACs" as response labels, while all calculated protein descriptors were used as independent variables. However, seven MLs yielded accuracies exceeding 70% (based on leave-20%-cross validation) warranting subsequent combination with GA to identify the best predictive descriptors. Two MLs succeeded in maintaining signi cant accuracies and Cohen's Kappa values despite GA-driven feature reduction, namely; K-star (K*) and XGBoost (XGB). Table 2 summarizes the statistical criteria of the best models.
The K-star algorithm (K*) is an instance-based classi er that uses entropy-based distance function [48].
Classi cation with K* is made by summing the probabilities from the new instance to all of the members of a category [27]. On the other hand, eXtreme Gradient Boosting (XGBoost, or XGB) is a tree-based standardized ensemble method that relies on an ensemble of weak decision tree (DT)-type models to create new subsequent boosted DT-type models of a reduced loss function. XGB system includes a tree learning algorithm, a theoretically justi ed weighted quantile sketch procedure with parallel and distributed computing [49]. Table 2 suggest that it is possible to predict the propensity of ACs for certain protein kinase by applying few protein descriptors in certain ML model(s). Cohen's Kappa values of SMOTE-enhanced observations in table 2 position the corresponding ML models within moderate to substantial interrater reliability.

Accuracies and Cohen' Kappa values in
However, to exclude the possibility of chance correlation, we decided to reevaluate the models using the original unaugmented training and testing sets and to test the validity of the models using Y-scrambling [50]. The results are summarized in table 2. Interestingly, both ML models maintained successful statistical criteria even upon using the original unaugmented data.
In Y-scrambling, 100 random bioactivity data were generated based on either the SMOTE-enhanced training set or the original unaugmented training set. Subsequently, each successful learner and corresponding features were challenged to use these random data to generate ML models of equal or better accuracies compared to the original nonrandomized data, as judged by Leave-20%-Out crossvalidations [50]. Table 2 summarizes the results of Y-scrambling while supporting Tables S1 to S4 show the results in details. The results show that in both successful MLs, the non-randomized data yielded models of signi cantly superior accuracy and Cohen's Kappa values compared to all randomized trials. The effect is particularly evident in Cohen's Kappa values. Overall, the results support the validity of our ML models. Table 2 points to interesting facts: (i) Most of the signi cant descriptors, from both models, are normalized suggesting that protein size is not of signi cance with regard to ACs. (ii) The two ML models emphasize the signi cance of two classes of amino acid residues, namely, acidic and hydrophobic amino acids. Acidic amino acids are represented by N/Asp, N/Glu and N/Acidic amino acids, while hydrophobic groups are represented by N/Hydrophobic amino acids and N/Val. (iii) K* ML model uniquely emphasizes on the signi cance of N/Rotatable bonds, while the XGB model underlines the in uence of hydrogen bonds count on the propensity of having ACs.
Emergence of N/Hydrophobic, N/Rotatable and N/H-bonds agrees with the t-test results mentioned earlier. Strangely though, N/basic failed to emerge in any of the top models despite being signi cantly different between the two protein kinases (Showing and Devoid of-ACs) according to t-test. It appears that the interaction between different descriptors in the optimal ML models in table 2 overshadowed the effects of N/basic.

Signi cance of Individual Features
Although ML-models in table 2 have reasonable accuracies and Cohen's Kappa values, it is hard to infer the role contributed by each protein feature in predicting the class label of a particular testing protein kinase. For example, it is hard to tell how N/Rotatable bonds contributes to the "Showing ACs" class within the context of GA-K* ML model (table 2). Therefore, we decided to implement SHapley Additive exPlanations (SHAP) values to explain the relative contributions of individual descriptors in predicting class labels [23].
SHAP values originated in game theory, however, within the context of machine learning they are useful for explaining model predictions. The SHAP approach enables prioritization of features that determine the predicted classi cation of certain observation using any ML model. The SHAP value of certain feature for a certain observation indicates how much this feature (percentage wise) has contributed to the deviation of the prediction from the base prediction (i.e., the mean prediction over the full sampling data) for this observation. SHAP of all features should add up to the difference between the mean prediction and the actual prediction for certain testing observation. Figure 2 shows the average SHAP values for GA-selected descriptors in the optimal K* and XGB models. Each average SHAP was calculated as the mean of SHAP values of the particular descriptor across "showing ACs" or "devoid of ACs" SMOTE-enhanced testing compounds. Clearly from the gure, the selected descriptors exhibited SHAP probability contributions consistent with the class categories of testing observations, i.e., they yielded positive probabilities towards the "Showing ACs" label for protein kinases that have ACs, while they also gave positive probabilities for "Not showing ACs" label for protein kinases that do not show ACs.
It can be concluded from Figure 2 that the most signi cant contributor to probability predictions in the GA-K* model is the normalized number of rotatable bonds (N/Rotatable bonds). On the other hand, the most signi cant contributor to probability predictions in the GA-XGB model is the normalized number of aspartic acid moieties (N/Asp), while the normalized number of hydrogen bonds and hydrophobic groups (N/H-Bonds and N/Hydrophobic groups, respectively) come second.
Another interesting conclusion from Figure 2 is the orthogonality of the two models, as can be deduced from the differing probability contributions of descriptors among the two models suggesting the possibility of stacking the two ML models in a meta-learning model (e.g., consensus voting) [51].

Discussion
Emergence of N/Rotatable bonds, N/H-Bonds, H/Hydrophobic and N/acidic amino acids as major contributors in our successful GA-ML models probably encodes for signi cant role played by protein exibility in the propensity of having ACs: Presence of extensive hydrogen bonding, hydrophobic and/or electrostatic networks within certain protein matrix is suggestive of conformational rigidity, while on the other hand, numerous rotatable bonds suggest tendency towards molecular exibility [52][53][54].
Reports of some protein receptors to undergo drastic conformational rearrangements upon binding to certain ligand(s) while other analogous ligand(s) fail to cause similar response [55][56][57][58][59] prompted us to assume that this trend is related to our ndings, namely, the connection between protein exibility and the propensity of having ACs.
Interestingly, signi cant conformational modi cations in the receptor usually accompany entropy-driven ligand binding while enthalpy-driven binding involves only slight (or even no) associated receptor conformational modi cations [55,56,[60][61][62]. For example, the close potent analogues in table 3 [60] exhibit signi cantly discrete conformational in uences on their target protein in such a way that the entropic binder causes signi cant conformational rearrangement in the target protein, while its enthalpic twin causes only subtle protein modi cations. In this context, our ndings of connecting protein exibility with the propensity of having ACs, combined with the fact that activity cliffs have nearly identical 3D binding features lead us to theorize that AC twins differ in that one member, i.e., the more potent one, binds to the receptor and induces signi cant entropically-driven conformational modi cations in the host protein causing additional enthalpic binding interactions. The less potent AC member, on the other hand, is enthalpically-driven binder that fails to induce signi cant conformational modi cations in the protein receptor, and therefore fails to recognize additional hidden binding interactions [63]. This assumption is not without precedence. For example, Table 4 shows the thermodynamics and a nities of activity cliff pair against carbonic anhydrase (isoform CA-XIII). Clearly, the potent twin has entropic and enthalpic advantages over the inactive one [62]. Therefore, the existence of ACs should correlate with protein exibility, while more rigid receptors should me be immune to the ACs phenomena.

Conclusions
This study and related conclusions are of immense signi cance in the elds of drug design and discovery because it suggests that it is possible to predict whether a drug discovery endeavor will face signi cant ACs-related issues or not a priori.
Declarations Figure 1 Counts of protein kinases classes ("devoid of AC" and "Showing AC") as function of number of reported inhibitors in ChEMBL database.