Associating molecular properties with CDK2 inhibition
By analyzing the general physicochemical properties of compounds, we found no strong correlation between independent molecular features and the inhibition constant, pKi (Pearson’s correlation coefficient of up to 0.21). Across our datasets, both CDK2 inhibitors and non-inhibitors generally followed Lipinski's rule of five (RO5)22 and Veber's Rule23, reflecting an intrinsic bias in the screening libraries routinely used. Most of the active molecules evaluated had no more than 10 hydrogen bond acceptors, less than 5 hydrogen bond donors, octanol-water partition coefficient (log P) less than 5, no more than 10 rotatable bonds, polar surface area (TPSA) less than 140 Å2 (Figure S1).
Despite a modest correlation between inhibition strength and drug-likeness properties, some physicochemical properties did distinguish between CDK2 inhibitors and non-inhibitors. Potent CDK2 inhibitors had a lower log P (Figure S1 C) (p-value < 0.001, using a two-sample Kolmogorov − Smirnov test), indicating they are more hydrophilic and are more likely to be distributed in aqueous regions such as blood serum. Consistent with this observation, inhibitors also had a larger TPSA (Figure S1 E) (p-value < 0.001) compared to non-inhibitors, reflecting a potential to establish more interactions with kinases.
Molecular substructure mining
To further our understanding of the chemical landscape of known kinase inhibitors, we used molecular substructure mining to identify enriched chemical groups. Using the Molecular Substructure Miner (MoSS)24, we found two chemical fragments, sulfanilamide (16.2% support) and 2-(N-Anilino)pyrimidine (10.1% support), that occurred more frequently in CDK2 inhibitors compared to non-inhibitors (Figure S2). Atoms in these enriched groups include hydrogen bond donors and acceptors, and the two negatively charged oxygens in sulfonamide that can form electrostatic interactions with positively charged amino acids in the kinase. Additionally, the ring structures in the fragments can mimic the adenine component of ATP, important for competitive inhibitors.
Two enriched fragments, sulfanilamide (10.7% support) and 4-Amino-1,3-thiazole-5-carbaldehyde (8.6% support) were found in type I1/2 inhibitors (Fig. 2A). By searching molecules containing both fragments in the Protein Data Bank25, we found that both substructures can form hydrogen bonds with gatekeeper residues and the hinge region, respectively. While they occur together in type I1/2 inhibitors less frequently (4.1% support), these substructures never appear together in other types of inhibitors. This may suggest a distinctive binding mode of type I1/2 inhibitors, the coordinated interactions with the gatekeeper pocket and the hinge region caused by the αC-helix out and DFG in kinase conformation.
The enriched substructure (24.2% support) in type II inhibitors is composed of a 1-Phenylurea connected to a ring (Fig. 2B). The odds ratio is 64.7 compared to type I, and 41.6 compared to type I1/2, indicating confident enrichment. Urea can form a hydrogen bond donor-acceptor pair with the αC-helix and DFG motif, consistent with experimentally solved structures. The nitrogen atoms can establish hydrogen bonds with the glutamate side chain, which is conserved in αC-helix, while the carbonyl group can establish a hydrogen bond with the backbone amide of the aspartate in the DFG-motif. The benzene ring close to the donor nitrogen can form aromatic interactions with the gatekeeper residue in the kinase and a hydrophobic moiety (at the top right corner in Fig. 2B) accommodates it in the back pocket. Accordingly, the urea acts as a bridge between the two ring structures and the two pockets exposed by the DFG out and αC-helix out kinase conformation.
Substructure enrichment for type I and allosteric inhibitors was not thoroughly analysed. Type I inhibitors form stronger interactions with the hinge region similar to ATP, without having access to the back pocket and gatekeeper area (Fig. 2C). As both type I1/2 and type II inhibitors share common substructures capable of occupying the ATP binding site, no substructure was found exclusively in type I inhibitors. Additionally, the limited sample size for allosteric inhibitors (32 in 10-fold cross-validation, 15 in blind test) did not allow for an unbiased enrichment analysis.
Identifying CDK2 inhibitors
Our predictive model was trained using different supervised learning algorithms. The best performing algorithm, Extra Tree Classifier (M5P) with 23 features (identified via feature selection), was chosen. Table 1 shows the overall model performance. Although the dataset used is relatively unbalanced (595 non-inhibitors, 1040 inhibitors), the model still achieved high and consistent Matthew’s Correlation Coefficients (MCCs) on both 10-fold cross-validation (0.74) and independent blind test set (0.66). F1 score (0.91 on cross-validation, 0.88 on blind test) and AUC (0.86 on cross-validation, 0.84 on blind test) also demonstrated model robustness (Fig. 3). The performance metrics obtained via rigorous internal and external validation suggest potent CDK2 inhibitors can be correctly identified.
Table 1
Extra tree classifier performance for CDK2 inhibitor identification on training and blind test sets.
| MCC | F1 | AUC |
10-fold CV | 0.74 | 0.91 | 0.86 |
blind test | 0.66 | 0.88 | 0.84 |
To shed light into properties that can explain differences between CDK2 inhibitors and non-inhibitors, we conducted a two-sample Kolmogorov-Smirnov test on the feature set. Figure S3 A shows the top three features with the smallest p-values. Inhibitors tend to have higher partial charges and van der Waals surface area contributions (PEOE_VSA12 attribute), a higher frequency of sulfonamides, and more hydrogen bond donors (p-values < 0.001). These characteristics reveal different non-covalent interactions between enriched substructures (sulfanilamide and 2-(N-Anilino)pyrimidine) and CDK2, including electrostatic interactions, hydrogen bonds, and van der Waals forces which can stabilize favour inhibitor binding.
Compared to the deep learning models developed by Balachandar et al. on the same dataset, our classical machine learning algorithm has competitive performance. On the blind test, we achieved an AUC of 0.84, whereas Balachandar et al. achieved an AUC of 0.738. lthough the performance results are not directly comparable since the training and test set splits are different, our model does demonstrate satisfactory generalization. The small score difference between 10-fold cross-validation (0.86) and blind test (0.84) provides further confidence in model robustness. Additionally, by investigating both the significant features and enriched substructures, we inferred discriminative physicochemical properties of potent inhibitors and discussed their biological significance. In contrast, no relevant biochemical insight was drawn from previous works2, as features were encoded as bit strings to accommodate deep learning architectures, which are not explainable. Therefore, our model does not only have competitive prediction performance but also contributes to the detection of novel scaffolds among potent inhibitors and shed light into their potential mode of action.
Predicting CDK2 ligand-kinase inhibition constant (pKi)
By predicting the pKi values of small molecules, the inhibition strength can be quantified. A Random Forest Regressor (RF) with 22 features was trained and validated. Table 2 shows the overall model performance. We obtained a Pearson’s correlation coefficient of 0.76 (RMSE of 0.62) on 10-fold cross-validation, and 0.68 (RMSE of 0.65) on an independent blind test set. The consistent performance between internal and external validation indicates model generalisation. After removing 10% of outliers, Pearson’s correlation coefficients increased to 0.87 on cross-validation and 0.78 on blind test (Fig. 4). Here, no enriched substructures were observed exclusively in outlier molecules, indicating their structural diversity.
Table 2
Random Forest regressor performance on pKi prediction.
| Pearson | Spearman | Kendall | MSE | RMSE |
10-fold CV | 0.76 | 0.71 | 0.56 | 0.39 | 0.62 |
blind test | 0.68 | 0.59 | 0.45 | 0.43 | 0.65 |
By dividing the molecules into two groups with the cut-off pKi value of 6, we were able to compare the physicochemical differences between the defined potent inhibitors (pKi ≥ 6) and non-inhibitors (pKi < 6) in cell-based assays using the two-sample Kolmogorov-Smirnov test. Figure S3B depicts three significant features (p-values < 0.001) discriminating molecules with a high binding affinity (pKi ≥ 6). These features were consistent with those identified previously.
While the regression model with the pKi cut-off could also potentially be useful for classification purposes, in general continuous labels can have higher variance compared to discrete classes, and may lead to poor classification performance.
To test this assumption, we converted the predicted and the true pKi values to inhibitor and non-inhibitor classes. As expected, after the label transformation, the model achieved lower MCC (0.64 on cross-validation, 0.57 on blind test) compared to our dedicated CDK2 inhibitor classification model (0.74 on cross-validation, 0.66 on blind test). Accordingly, rather than being a substitute for the classification model, our regression model can serve as a tool to quantify and rank inhibition strength in addition to inhibitor identification.
Classifying different types of kinase inhibitors
The dataset for classification of inhibitor type is highly unbalanced (1425 type I, 394 type I1/2, 190 type II, and 47 allosteric inhibitors), which significantly increases the challenges of identifying the minority classes. However, our model was able to distinguish type II inhibitors from type I1/2 inhibitors, despite their smaller sample sizes. As shown in Table 3, the type I1/2 versus type II classifier achieved MCCs of 0.80 on cross-validation and 0.73 on blind test sets. Additionally, it also achieved the highest AUC with 0.91 on the blind test set (Figure S4). The method has also identified allosteric inhibitors effectively, with a MCC of 0.68 on cross-validation and 0.63 on blind test.
Table 3
Performance of the inhibitor type classification model on training and blind test sets.
Classifier | Metric | kinCSM cross validation | kinCSM blind test | Miljkovic et al.10blind test |
Type I versus II | F1 | 0.73 | 0.64 | 0.71 (± 0.03) |
BACC | 0.80 | 0.74 | 0.78 (± 0.02) |
MCC | 0.73 | 0.65 | 0.70 (± 0.04) |
Type I versus I1/2 | F1 | 0.54 | 0.43 | 0.58 (± 0.04) |
BACC | 0.69 | 0.64 | 0.74 (± 0.02) |
MCC | 0.50 | 0.41 | 0.47 (± 0.05) |
Type I1/2 versus II | F1 | 0.87 | 0.82 | 0.77 (± 0.03) |
BACC | 0.90 | 0.88 | 0.82 (± 0.02) |
MCC | 0.80 | 0.73 | 0.69 (± 0.03) |
Allosteric or not | F1 | 0.64 | 0.57 | 0.36 (± 0.18) |
BACC | 0.73 | 0.70 | 0.63 (± 0.07) |
MCC | 0.68 | 0.63 | 0.48 (± 0.09) |
Compared to the best machine learning model developed by Miljković et al.10 was validated on a randomly generated external blind test set, our model achieved higher MCCs in identifying allosteric inhibitors and distinguishing type I1/2 and II inhibitors even when the blind test set presents low similarity with the training set (Table 3). This means our model has a better generalisation for unseen data when the sample size is limited and unbalanced.
Another challenge for this task was to do with the molecular structures of the three ATP competitive inhibitor types, which can be modelled as a continuum instead of distinct categories as the kinase conformation they bind changes in a stepwise manner26. Type I inhibitors bind to the DFG-in, αC-helix in conformation, then the movement of αC-helix (DFG-in, αC-helix out) allows binding of type I1/2 inhibitors, and lastly, the DFG-out, αC-helix out conformation is recognized by type II inhibitors. Two selected machine learning features (fluorine and hydrophobe counts) demonstrate this continuum (Figure S5 A and B). The distributions of type I1/2 inhibitors can be visualized as a mixture of type I and type II inhibitors, biased towards type I. This may suggest that the shared substructures between types of binding mode can affect model performance.
Being positioned in the middle of the continuum, type I1/2 becomes the most challenging class, even though it has an adequate sample size. The type I versus type I1/2 classifier achieved the lowest performance (MCC of 0.50 on cross-validation, and 0.41 on blind test, shown in Table 3). After integrating the prediction outcomes from the four binary classification models, a large proportion of type I1/2 inhibitors were wrongly classified as type I inhibitors (Figure S6). One possible reason is that type I1/2 inhibitors share a larger proportion of common substructures with type I inhibitors in comparison with type II inhibitors. Although type I1/2 inhibitors can form interactions with residues in the gatekeeper pocket, making them distinguishable from type I, this characteristic may not be captured by our model. Rather, their strong affinity with the hinge region lead to similar physicochemical properties (e.g., low lop P as shown in Figure S5 C) as type I inhibitors. Nevertheless, our model does capture features capable of distinguishing type I1/2 inhibitors from others (e.g., higher frequency of nitrogen-containing functional groups attached to aromatics, as shown in Figure S5 D - Welch two-sample t-test p-values < 0.001 compared to type I and II).
Although type II inhibitors have a distinctive characteristic (back pocket access), larger sample size still causes biased predictions towards type I inhibitors (Figure S6). The type I versus II classifier achieved MCC of 0.73 and 0.65 for 10-fold cross-validation and blind test respectively (Table 3). However, insightful features were captured by our model. Type II inhibitors have higher log P (p-values < 0.001 compared to type I and I1/2) as shown in Figure S5 C, which means they are more hydrophobic. This is caused by their special interactions with the kinase hydrophobic back pocket. Additionally, fluorine and urea occur more frequently in type II inhibitors (p-values < 0.001, Figure S5 A and E). This may suggest both of them can contribute with interactions with the back pocket.
kinCSM Web Server
kinCSM has been made freely available through an easy-to-use web interface at http://biosig.unimelb.edu.au/kin_csm/. Users can identify CDK2 inhibitors, predict CDK2 pKi and possible binding modes irrespective of kinases by providing a single molecule or a list of molecules as SMILES strings (Fig. 5).