Performances of ML predictors
Herein, 10-times repeated ML modeling was performed based on the randomly divided training (60%) and validation (40%) sets (Figure 4). It can be observed that most of the ML models have satisfied prediction performances on the training and validation datasets. In consideration of the accuracy and balanced performances on the validation set, the LR model was chosen as the optimal predictor, of which the means of AUC, MCC, Acc, Spe and Sen are 0.90±0.008, 0.80±0.015, 0.90±0.008, 0.92±0.007, 0.88±0.014 for training set, and 0.75±0.012, 0.50±0.025, 0.75±0.013, 0.77±0.023, 0.73±0.025 for the validation set, respectively.
Then, 5-fold cross-validation and an independent external test by using 644 samples were also performed. The results showed that the optimal LR model achieved excellent prediction performances, of which the Acc for the 5-fold cross-validation and the independent test are 0.78±0.047 and 0.86, respectively (Table S4 and Table 1). Therefore, it can be concluded that the resulting LR model is a good predictor of the caspase-6 inhibitors.
Table 1. The performance of the optimal LR model on the 644 test samples
Confusion matrix
|
Performance
|
CP
|
CN
|
Acc
|
Spe
|
Sen
|
MCC
|
Independent test set
|
PCP
|
102
|
49
|
0.86
|
0.90
|
0.71
|
0.60
|
PCN
|
42
|
451
|
CP: condition positive; CN: condition negative; PCP: predicted condition positive; PCN: predicted condition negative.
The generative RNN modeling
Herein, 2393029 SMILES strings derived from Pubchem database were used for pre-training the RNN models. Firstly, the effect of the number of GRU layers on the performance of the generative RNN model was investigated based on the network architecture shown in Figure 3. It can be seen that, after 14000 steps of iterations, the loss values of the RNN models with one, two and three GRU layers reach the state of convergence (Figure 5a). At the mean time, the valid percentages of 128 SMILES strings sampled by the 3 RNN models reach to 0.85, 0.90 and 0.95, respectively. Besides, no significant improvement in the valid percentage was observed for the RNN models with more than 3 GRU layers. Thus, the RNN model with 3 GRU layers was chosen for the following transfer learning.
In this paper, the 433 caspase-6 inhibitors in the training and validation sets (Table S2) were used for the transfer learning of the pre-trained RNN model. From Figure 5b, it can be observed that, after 200 steps of fine-tuning, the loss value tends to converge and the valid percentage of the sampled SMILES strings reached to 99%. In order to evaluate the performance of the refined RNN model in generating potential caspase-6 inhibitors, a retrospective study was performed by using the 144 caspase-6 inhibitors in the test dataset (Table S2), which the RNN model had never seen before. At first, a total of 50,000 valid SMILES strings were randomly sampled by the fine-tuned RNN model. After structural description by using RDKit toolkit, the 50,000 molecules were then predicted by the LR predictor. Based on the predicted positive samples, the recall value of the 144 caspase-6 inhibitors was finally calculated. As shown in Table 2, it can be seen that the percentage of the predicted positive samples maintains at a relatively high level during the whole sampling process. Also, it can be noticed that the recall value of the 144 caspase-6 inhibitors increases gradually from the lowest value of 2.08% to the highest value of 13.19% (Table 2). Accordingly, it can be concluded that the RNN model can generate efficiently the potential caspase-6 inhibitors after transfer learning. It should be noted that the relatively low recall value is mainly caused by the small sample size of the test caspase-6 inhibitors.
Table 2. The recall value of the 144 caspase-6 inhibitors
Sampling Process
|
I
|
II
|
III
|
IV
|
V
|
VI
|
VII
|
VIII
|
IX
|
X
|
No. of SMILES strings
|
1000
|
2000
|
3000
|
4000
|
5000
|
10000
|
20000
|
30000
|
40000
|
50000
|
The predicted positive samples (%)
|
76.0
|
72.7
|
71.4
|
70.7
|
70.6
|
69.3
|
67.1
|
66.2
|
65.5
|
65.0
|
Recall (%)
|
2.08
|
2.08
|
3.47
|
5.55
|
6.94
|
8.33
|
10.41
|
11.80
|
13.19
|
13.19
|
The distribution in chemical space of the potential caspase-6 inhibitors
According to Table 2, a total of 6927 strings (69.3%) were predicted as positive samples from the 10,000 SMILES strings generated. Herein, based on the properties of H-Bond acceptor/donor, rotatable bonds, aromatic/aliphatic cycles, heterocycle atoms and molecular weight, the distribution of the potential 6927 caspase-6 inhibitors was explored by using t-distributed stochastic neighbor embedding (t-SNE) method.
From Figure 6, it can be inferred that the generated 6927 potential inhibitors have the similar chemical space as the known 577 caspase-6 inhibitors. Herein, three small clusters of the samples were selected to explore the structural features in detail. For each cluster, it can be observed that the generated molecules have similar molecular scaffolds with the known caspase-6 inhibitors (Figure 6). The structural modification mainly involves substituent modification, scaffold hopping, and chiral transformation etc., which are also the major means in traditional drug design.
Molecular docking-based ligand screening
Before docking-based screening of the caspase-6 inhibitors, the protocol of Surflex-dock was firstly validated by re-docking a co-crystallized ligand Ac-VEID-CHO into the binding pocket of caspase-6 (PDB: 3OD5). The results showed that Surflex-dock can reproduce the native ligand binding conformation with a docking score of 7.67 (Figure S1).
Based on the docking results of the 577 known caspase-6 inhibitors and the potential 6927 positive samples, the occurrence frequencies of the residues involved in the intermolecular interactions with the 577 caspase-6 inhibitors and 6927 potential inhibitors were investigated respectively. From Figure 7a, it can be clearly seen that the distributions in the occurrence frequencies of the binding residues are quite similar between the two cases, especially for the binding residues with the occurrence frequencies larger than 50%. Therefore, it can be deduced that the potential 6927 inhibitors have similar binding mode with the known 577 caspase-6 inhibitors.
Furthermore, Surflex-dock method was employed for predicting and ranking the generated potential inhibitors. Herein, take example for three representative positive samples (ID: 96, 2470 and 3262) with different scaffolds to explore the feasibility of molecular docking-based ligand screening. The docking scores of the 3 positive samples are higher than 9.0 (-logKD), which indicate potential inhibitory activities at nanomolar level. As shown in Figure 7b, both of the sample 96 and 2470 can form strong H-bond interactions with Arg220, while sample 3262 form 3 H-bonds with Arg64, His121 and Gln161. For sample 2470 and 3262, strong π-cation interactions with Arg220 can be also observed. Recent researches have proved that Arg64, Gln161, and Arg220 are closely related with the substrate-specificity of caspase-6, and that His121 is a key catalytic residue for substrate hydrolysis.[1] Besides, all the 3 samples can form strong hydrophobic interactions with the hotspot residue Tyr217, Val261, Cys264 and Ala269. Collectively, the 3 potential caspase-6 inhibitors with nanomolar-level activities are promising candidates for the further researches.