3.1. Dataset curation and feature selection
A total of 2854 compounds comprising in vitro activity data against RET protein were curated from the BindingDB repository for ML model development. Initially, these molecules along with external validation dataset were standardized using ChemSAR standardization toolkit. All the compounds were found to possess satisfactory stereochemical properties.
Notably, about 5358 features including 1875 molecular descriptors and 3483 bits of fingerprints were generated for each compound using PaDEL Descriptor. The average molecular weight of the compounds in training set varied between 6.195 g/mol and 12.368 g/mol indicating the drug likeliness of the compounds. While, the average molecular weight in the external test set varied between 7.632 g/mol and 12.335 g/mol. This depicts that both the training set and external validation set shared a similar chemical feature. Thus, the chemical space of external validation dataset lies within the scope of the training set.
To enhance the prediction abilities of the machine learning models, highly pertinent features were filtered out using feature extraction technique. Primarily, around 1321 features with a sum of zero were removed from the dataset. Later, 1061 features with variance less than 0.01 and 2919 features with correlation greater than 0.99 were removed during the study to prevent redundancy. Ultimately after pre-processing, 56 features containing 30 descriptors and 26 fingerprints were identified. Finally, the RFECV strategy was implemented to identify highly significant and relevant features. A subset of 29 features were chosen based on the RFE scoring function for further processing and model generation. It is evident from Fig. S1 that the selected 29 features have the probability of achieving predictive accuracy of 0.95 during 10-fold cross validation. Similarly, variable importance plot of the features selected using RFECV was also estimated and the results are demonstrated in Fig. 1. From the image, it is obvious that the descriptors MATS1c, AATS4e and ATSC0c were identified as highly important while GATS2c, AATS6e, ATSC3c and FP27 found to be least important among the selected features. It is certain that ML model built with these resultant features would discriminate the actives from inactive molecules with high precision.
3.2. Performance evaluation of machine learning models
Seven machine learning classifiers were generated using the selected molecular descriptors and fingerprints in the current investigation. Herein, the parameter tuning process of the ML models were executed using grid search 10-fold cross validation method. The models prior and after parameter tuning were evaluated using different performance metrices and the results are given in Table 1.
Table 1
Performance metrices of machine learning models prior and after parameter tuning.
S. No. | Machine Learning models | Prior Parameter Tuning | After Parameter Tuning |
Accuracy | Precision | Recall | F1 score | Accuracy | Precision | Recall | F1 score |
1 | Logistic regression | 0.652 | 0.622 | 0.758 | 0.683 | 0.908 | 0.906 | 0.913 | 0.91 |
2 | Random forest | 0.649 | 0.637 | 0.72 | 0.676 | 0.889 | 0.894 | 0.885 | 0.89 |
3 | Support Vector Machine | 0.668 | 0.607 | 0.809 | 0.694 | 0.912 | 0.913 | 0.914 | 0.913 |
4 | Naïve bayes | 0.517 | 0.508 | 0.926 | 0.656 | 0.909 | 0.91 | 0.911 | 0.91 |
5 | Gradient Boosting | 0.638 | 0.651 | 0.739 | 0.692 | 0.902 | 0.903 | 0.903 | 0.903 |
6 | Decision tree | 0.642 | 0.605 | 0.669 | 0.635 | 0.77 | 0.78 | 0.75 | 0.765 |
7 | Neural network | 0.656 | 0.612 | 0.718 | 0.66 | 0.872 | 0.859 | 0.892 | 0.876 |
It is evident from the Table 1 that the accuracy of models prior to parameter tuning ranged from 0.517 to 0.668. Notably, SVM achieved the highest accuracy while NB achieved the least accuracy during our analysis. In general, precision and recall indicate the capability of the models to classify the cases into true positive and negative respectively. Higher values of these performance metrices indicates the classifying capability and reliability of the models [34]. In the present study, the precision and recall score of the ML models ranged from 0.508 to 0.651 and 0.669 to 0.926, respectively. Moreover, scientific evidences suggest that F1 score is found to be a crucial parameter and served as a benchmark metric during the model selection process in many instances [35]. Hence, the F1 score of the models were determined and found to lie between 0.635 and 0.692.
On the other hand, the accuracy of models after parameter tuning are LR (0.908), RF (0.889), SVM (0.912), NB (0.909), XGB (0.902), DT (0.770) and NN (0.872) respectively. It is to be noted that parameter tuning increased the accuracy of all the ML models drastically during the analysis. For instance, the accuracy of SVM was raised from 0.668 to 0.912 after parameter tuning process. Similar pattern of rise in the values were also observed in other performance metrices. In specific, F1 score of SVM model hiked from 0.694 to 0.913 after tuning process. In addition, ROC curve was also plotted to demonstrate the ability of the classifier system to differentiate between inhibitors and non-inhibitors (Fig. S2). The results show that SVM model prior and after parameter tuning achieved the highest AUC of 0.81 and 0.95 respectively than the other investigated models. The modest performance of SVM than other models might be due to its capability to handle high dimension data and implementation of kernel-based technique to overcome complex problems including over fitting.
3.3. Cross and external validation of models
Five- and ten-fold CV were employed to evaluate the stratifying capability of the generated models after parameter tuning. The performance metrices of the ML classifiers with cross validation are represented in Fig. 2 and Table S2. It is to be emphasised that no discernible variation in accuracy was observed for any of the generated ML models during the five-fold cross-validation process. However, 10-fold CV improved the accuracy of DT and NB to more than 0.85. The improvement in accuracy may be attributable to the elimination of overfitting by averaging the errors during the validation process. Remarkably, SVM achieved the highest accuracy and F1 score of 0.98 and 0.90 respectively. Notably, the results correlate well with earlier findings of the study.
The fourth tier of model validation was accomplished with the external dataset containing 258 non-redundant structures and activity data retrieved from PubChem and ChEMBL database. The results are represented in Fig. 2 and Table S2. It is evident that the SVM had outperformed in stratifying the actives and inactives with an accuracy of 0.92 than other models in our study. On analysing the precision, recall and F1 score of the models in the current investigation, SVM achieved the highest scores of 0.893, 0.904 and 0.913 respectively illustrating the capability of classifying the chemical compounds as actives and inactives with high precision. Thus, the findings of our study suggest that the generated SVM could be a great choice for designing a virtual pipeline for screening larger databases.
3.4. Machine learning based virtual screening and molecular docking
The FDA approved subset of DrugBank database containing 2509 compounds were screened using the generated model to identify the potent RET inhibitor. The features were generated initially for all the 2509 compounds. The in-house SVM model was employed in the initial stages to screen out the actives from the data set. A total of 497 compounds were identified as active molecules by SVM model. The identified lead molecules were prepared and docked against the binding pocket of the RET protein using XP docking mode. The docking results highlights that about 381 compounds showed negative XP GScore revealing the effective binders against RET. Prominently, the reference ligand vandetanib exhibited XP GScore of -10.102 kcal/mol. Subsequently, all the compounds were rescored using MM-GBSA and RF score in the current investigation for the precise prediction of binding energies.
3.5. Rescoring using MM-GBSA and machine learning strategy
Prime MM-GBSA module of Schrödinger suite was employed to estimate the binding free energy for the docked poses of the protein-ligand complexes. The summary of MM-GBSA analysis of each RET-ligand complex is tabulated in Table S3. The results revealed that the ΔGbind of the complex ranges from − 100.8 kcal/mol to -8.71 kcal/mol. Note only two compounds DB09313 and DB00471 exhibited lower binding energy than vandetanib (-96.66 kcal/mol). Among the different energy contributors, Van der Waals energy and coulomb energy were found to be the primary contributors towards the total binding affinity, ΔGbind. The contribution of covalent interaction is very minimal, thus depicts the thermostability characteristic and stabilized association of compounds with the RET protein to a greater extent. In contrast, the other molecules despite having better XP GScore than vandetanib, had exhibited higher ΔGbind, suggesting these molecules might be unfavourable to interact with RET protein. In addition, RF-score was also calculated for vandetanib and the two lead compounds to estimate the binding affinities against RET protein. Interestingly, both the compounds found to have greater RF score than vandetanib. Hence, these two lead molecules were considered for further evaluation. The overall results of the two lead compounds along with the reference is consolidated in Table 2.
Table 2
MM-GBSA analysis of reference and lead compounds using Prime module of Schrödinger suite.
S. No. | Compound ID | XP Gscore (kcal/mol) | RF Score | dG Bind (kcal/mol) | Van der Waal’s Energy (kcal/mol) | Coulomb’s Energy (kcal/mol) | Covalent Interaction (kcal/mol) | Lipophilicity (kcal/mol) | Solvation Energy (kcal/mol) |
1 | Vandetanib | -10.102 | 5.998 | -96.66 | -57.27 | -16.23 | 1.71 | -46.97 | 22.85 |
2 | DB09313 | -7.757 | 6.186 | -100.8 | -55.27 | -44.24 | 8.76 | -43.22 | 34.52 |
3 | DB00471 | -8.012 | 6.017 | -99.68 | -45.48 | -42.29 | 2.89 | -44.98 | 31.14 |
3.6. Density function theory calculations
Jaguar module of Schrödinger was used to calculate energy gap, ionization potential, hardness, softness and electron donating capability of the lead compounds. Initially, the molecules were optimized using LACV3P++** basis set and B3LYP-D3 theory. Subsequently, the energy gap between HOMO and LUMO were estimated to determine the chemical reactivity and electron donating capability of the compounds. The energy of the compounds in their HOMO-LUMO orbitals were calculated and the results are represented in Fig. 3. The distribution of positive and negative charge in the reactivity plot is represented in blue and red respectively. It is evident from figure that the compound DB00471 have minimal energy gap of 0.139 eV whereas DB09313 had equivalent energy gap (0.152 eV) to that of vandetanib (0.150 eV). Lower energy gap between the frontier orbitals indicates the favourable potential reaction and high chemical reactivity of the lead compounds against RET protein [36]. In addition, the higher energy in HOMO orbital than the LUMO shows the electron donating capability of the compounds, thus facility the hydrogen bond formation during interaction.
Besides, the stability and chemical reactivity of reference and the lead compounds were also assessed in terms of ionization potential, softness, hardness, electron negativity and electron affinity. It is evident from Table 3 that the scrutinized lead compounds exhibited higher ionization potential than electron affinity, which in turn depicts the electron donating nature of the molecules. Moreover, higher hardness and lower softness of the lead compounds than vandetanib reveals the better chemical reactivity characteristic of the molecules. The preceding results from our analysis reveals that the identified compounds DB09313 and DB00471 are highly stable and possess high chemical reactivity and electron donating capability.
Table 3
Global descriptors of lead molecules calculated using the energy level LACV3P++**.
S. No. | Compound ID | Ionization Potential (IP) | Electron Affinity (EA) | Hardness (η) | Softness (S) | Chemical Potential (χ) |
1 | Vandetanib | 9.262 | 2.099 | 7.163 | 0.139 | 5.861 |
2 | DB09313 | 8.522 | 1.155 | 7.367 | 0.136 | 4.839 |
3 | DB00471 | 8.586 | 1.069 | 7.517 | 0.133 | 4.828 |
3.7. Electrostatic potential map
In an effort to comprehend the anti-cancer activity of compounds, electrostatic potential mapping (ESP) was plotted for the reference and the lead compounds (Fig. 4). In general, the electrostatic potential map represents the repulsive interaction of positive charged nuclei and negative charged electrons. Literature evidences report that higher negative potential of the compounds, greater will be the reactivity and vice-versa [37]. The red colour in the figure indicates high electron density area i.e., negative potential and blue colour denotes positive potential of the compounds. On comparing the ESP map of the molecules, the compounds DB09313 and DB00471 exhibited increased negative potential on the surface than vandetanib. In specific, lower negative chemical potential of the lead molecules DB09313 (4.839) and DB00471 (4.828) shows the higher stability than the reference compound vandetanib (5.861). Thus, the results from our study propose that the selected compounds might exhibit higher anti-cancer activity than vandetanib.
3.8. Pharmacokinetic and toxicity analysis
The pharmacokinetic and toxicity of the compounds were analysed using QikProp and ProTox II server to avoid elimination of the compounds in clinical trials. It is evident from Table 4 that the identified lead compounds have satisfactory ADMET properties. Notably, the human oral absorption of the lead compounds is similar to the reference compound vandetanib. The LD50 of the lead compounds DB09313 (20000 mg/kg) and DB00471 (1350 mg/kg) were found to be greater than vandetanib (1000 mg/kg) indicating less harmfulness of the molecules in human body. Undeniably, the lead compound DB00471 did not exhibit any toxic effect including hepatotoxicity and immunogenicity. In general, the half-life a drug plays a significant role in research and development for determining the dosing regimen. Higher the half-life lower will be the dosage frequency. While, shorter half-life period results in inadequate safety and efficacy [38]. It is evident from the table that DB00471 had higher half-life period than the other compounds. Thus, DB00471 is found to possess greater capability to achieve safety and drug efficacy than DB09313 and vandetanib with minimal dosage frequency. Overall, the scrutinized compounds were less toxic and had satisfactory half-life than vandetanib.
Table 4
Pharmacokinetic, toxicity and inhibitory activity analysis of vandetanib and the lead compounds.
S. No | Compound Name | CNS | HOA | Hepatotoxicity | Carcinogenicity | Immunogenicity | Mutagenicity | cytotoxicity | LD50 (mg/kg) | Toxicity Class | T1/2 (Hours) | IC50 µM |
1 | Vandetanib | 1 | 1 | Active | Inactive | Active | Inactive | Inactive | 1000 | Class 4 | 1.06 | 11.924 |
2 | DB09313 | -2 | 1 | Inactive | Inactive | Active | Inactive | Active | 20000 | Class 6 | 0.99 | 5.454 |
63 | DB00471 | -2 | 1 | Inactive | Inactive | Inactive | Inactive | Inactive | 1350 | Class 4 | 9.92 | 1.436 |
3.9. Ligand-protein interaction analysis
The interaction pattern of the ligands to the RET protein binding pocket residues is visualized in Fig. S3. In addition, the key scaffold of the ligands is illustrated in Fig. S4. It is evident from Fig. S3 (a) that vandetanib forms one hydrogen bond between positively charged ALA807 residue and the anilinoquinazoline group. On analyzing the interaction pattern of DB09313, four hydrogen bonds were found to be formed between the compound and the RET binding pocket. The hydrogen bonds formed between: (i) polar SER811 residue and O of tri-ionated benzoate scaffold, (ii) NH functional group and hydrophobic LEU730 residue, (iii) NH functional group and positively charged LYS808 residue and (iv) H-bond between OH group and polar ASN879 residue of RET protein. In addition, one halogen bond was found between hydrophobic TYR806 residue and iodine in the tri-ionated scaffold of DB09313. The literature evidences highlight that the formation of halogen bond between the ligand and the aromatic protein residue will significantly increase the residence time of the inhibitor in the human body [39]. Thus, the increased half-life period of DB09313 reported in Table 4 might be due to the halogen bond formation during the interaction. In the case of compound DB00471, three hydrogen bonds were observed in amino acid such as SER811, LEU730 and THR729. In addition, one halogen bond was observed between Cl group of chloroquinoline scaffold and negatively charged ASP892 residue. Likewise, two π – Cation interaction was found to be formed between the aromatic moiety of DB00471 and the cation LYS728 residue. It is evident from literature that the formation of π-cation interaction between protein and ligand not only offers high specificity, selectivity and increased recognition by the protein target but also improves bioavailability of the compound [40]. Since two π – Cation bonds were formed during the interaction of DB00471 against RET, we hope this compound might have increased bioavailability than DB09313 and vandetanib. Thus, the results of our study postulates that the screened lead compounds have significant interaction profile against RET protein than vandetanib.
3.10. Dynamic behaviour analysis of protein-ligand complexes
3.10.1. Stability analysis of RET-ligand complexes
The dynamic behaviour of protein-ligand complexes was analysed for 100 ns using gmx_rms utility of gromacs. The extent of deviation of atoms within the RET-lead complex systems is outlined in Fig. 5. In the case of RET-vandetanib complex system, deviation of about ~ 0.4 nm RMSD was observed at 70 ns, demonstrating loss of stability in this region. Whereas, in the case of RET-DB00471 and RET-DB09313, no major deviations have been observed during the 100ns simulation period which in turn depicts the greater stability of the complex systems throughout the whole dynamics simulation. Towards the end of 100ns simulation, the average RMSD of RET apoprotein, RET-vandetanib, RET-DB00471 and RET-DB09313 was observed to be 0.349, 0.341, 0.326 and 0.313 nm respectively. Hence, RMSD analysis reveals that the RET protein did not undergo any major confirmational changes and the lead molecules remained within the catalytic diad. A study by Guterres et al revealed that a minimal mean RMSD lower than 0.4 nm is required for the ligand to reside within the binding site of the target protein [41]. Notably, the RMSD of the obtained lead molecules were observed to be lie within 0.3 nm and were apparently lower than RET apoprotein and RET-vandetanib complex. From the results, we hypothesize that the obtained lead compounds might act as a potential inhibitor against the RET target with greater stability.
3.10.2. Fluctuation analysis of RET-ligand complexes
The fluctuation and mobility of protein residues within the complex systems were analysed using gmx rmsf tool (Fig. 6). In general, the high RMSF value denotes the loosely organized region whereas low RMSF value denotes well-structured ends [42]. In RET-vandetanib complex, GLY700, VAL706, ASP707, ALA708, PHE709, LYS710, ILE711, GLY949, ASN950 and GLU958 exhibited high flexibility. While LYS740, PHE776, VAL804 and PTR905 were found express subtle fluctuations. Similarly, in RET-DB00471 complex, around four residues namely THR754, GLY798, GLU902 and PRO953 displayed high flexibility whereas the residues TRP935, GLU943 and MET984 showed least fluctuation during the simulation. In case of RET-DB09313 complex, the residues GLY700, PRO715, GLY748, ASN763, SER795, ASP903, PRO957 and GLU958 disclosed higher fluctuation on binding of the ligand to the RET receptor. Conversely, LEU779, LEU802 and GLU1006 exhibited feeble fluctuation during the analysis. On examining the RMSF curve of RET apoprotein in Fig. 6, several residual fluctuations were observed in the protein structure. These fluctuations were found to be minimized at the residues LEU702, LEU704, SER705, LYS722 and GLY751 upon the binding of DB00471 and DB09313. At the end of the analysis, the RET-DB00471 and RET-DB09313 complexes showed decreased average RMSF values of 0.0556 and 0.0423 nm than RET-vandetanib complex (0.0563 nm) and RET apoprotein (0.0696 nm) which shows the ability of the lead molecules to bind against the important catalytic residues in protein with least fluctuation.
3.10.3. Stability analysis by hydrogen bond
Hydrogen bonds play a crucial role in maintaining the stability and compactness of the protein structure [32]. The number of H-bonds between protein and ligands were determined using gmx hbond and plotted against the duration of the simulation period (Fig. 7). The average number H-bonds formed during the 100ns simulation, between RET protein and the ligands vandetanib, DB00471 and DB09313 are 0–2, 0–4 and 0–7 respectively. Although the RET-ligand complexes displayed similar binding pattern during docking, this study reveals that the number of H-bonds produced in aqueous solution is significantly higher for all the ligands except DB00471. This might possibly be attributed to the fact that docking is performed in the absence of solvent, whereas the simulation carried out in the presence of solvent can have a significant impact on ligand binding. From the hydrogen bond analysis, we predicted that the RET-DB00471 and RET-DB09313 complexes displayed more stability and compactness than RET-vandetanib complex due to increased number of H-bonds. This result correlates well with the RMSD and RMSF results.
3.10.4. Compactness analysis
The level of compactness and the level of RET protein folding in the presence of ligands were determined using inbuilt gyrate utility of gromacs. In general, Rg indicates the weighted RMSD of all atoms calculated from its centre of mass [43]. Higher values of Rg indicates the protein unfolding within the system. Figure 8 demonstrates the Rg of RET apoprotein and RET-ligand molecule complexes. The average Rg value for RET apoprotein, RET-vandetanib, RET-DB00471 and RET-DB09313 were found to be 1.862, 1.953 nm, 1.858 nm and 1.929 nm respectively. RET-DB09313 and RET-DB00471 showed similar pattern of Rg till the end of 100 ns simulation. Moreover, DB00471 showed least Rg than other systems, illustrating more compactness in this complex. Altogether, this analysis reveals that RET-DB00471 was more stable and compact than RET apoprotein and other RET-ligand complexes.
3.10.5. Solvent accessible surface area
The solvent accessible surface area (SASA) of RET apoprotein and RET-ligand complexes were analysed to determine the interacting surface area of RET protein with its solvent molecules [44]. This analysis was accomplished using gmx sasa utility of gromacs. From Fig. 9, the average SASA values for RET apoprotein, RET-vandetanib, RET-DB00471 and DB09313 were found to be 162.011 nm2, 169.747 nm2, 163.483 nm2 and 168.879 nm2 respectively. There were no major variations spotted in SASA values due to ligands binding. Moreover, all the RET-ligand complexes exhibited equivalent SASA values during the 100 ns simulation. Moreover, the RET apoprotein, RET-vandetanib, RET-DB00471 and RET-DB09313 complexes displayed solvation free energy of 1084.566 nm2, 1136.356 nm2, 1094.419 nm2 and 1130.456 nm2 respectively. The equivalent solvation free energy of RET-ligand complexes to that of RET apoprotein demonstrates the stability of the system upon ligand binding. Thus, the overall results highlight the compactness and stability of the protein-ligand complexes in the solvent area.
3.10.6. Essential dynamics and Gibbs free energy landscape
Essential dynamics describes the overall expansion of RET protein during the entire simulation period. Initially, gmx_covar and gmx_anaeig module of gromacs was implemented to construct the covariance matrix comprising eigen values. The traces of covariance matrix for RET apoprotein, RET-vandetanib, RET-DB00471 and RET-DB09313 were found to be 7.121 nm2, 10.535 nm2, 8.018 nm2 and 7.54 nm2 respectively. The eigen values for RET apoprotein are comparatively lower than the other complexes which clearly demonstrates the rise of random fluctuation of protein upon ligand binding. Further, the gibbs free energy landscape between first two principal components were plotted in Fig. 10. The yellow colour in the corresponding contour map indicates the unfavourable conformation whereas dark blue colour denotes the energy minima favoured complex conformation. The energy landscape reveals a collection of distinct minima that are related to the metastable conformational states and are separated from one another by a negligible energy barrier. The more stable conformation states were observed in the binding sites of RET-DB00471, DB09313 and RET-DB09313, with local minima dispersed throughout three to four energy landscape areas. It is evident from Fig. 10 that RET -DB00471 and DB09313 possessed minimum of one deep and narrow energy basin whereas RET apoprotein and RET-vandetanib contained shallow energy basins. Thus the FEL analysis it is clear that the RET-lead molecule complexes were found to be thermodynamically stable than RET-apoprotein and RET-vandetanib complex.
3.11. Inhibitory activity prediction of ligands
Despite significant investments in drug development and repurposing, 97% of anticancer drugs fail in clinical trials due to off-target cytotoxicity and low target efficacy [45]. Hence, the effect of candidate drug molecules on specific biomolecular profiles were determined using PaccMann server. In the current investigation, the inhibitory activity of vandetanib and the lead molecules were determined over 2022 different cancer cell lines. Particularly, the activity of compounds was analysed against RET protein expressing LC-2/ad cell lines. In general, lower IC50 value indicates inhibition of 50% of cells with minimal concentration of drugs. From Table 4, it is evident that DB00471 have the least IC50 than vandetanib and DB09313. Thus, the results clearly shows that DB00471 have the potential to inhibit RET positive cancer cell in minimal concentration and with lower systemic toxicity.
On deciphering the existing application of the drugs, DB00471 denotes the drug montelukast, a selective leukotriene antagonist. It is widely used as an orally dosed drug for asthma, exercise-induced bronco-construction and seasonal allergic rhinitis [46]. Interestingly, a recent study revealed the apoptosis mediated cell death of lung cancer cells in A549 and CL1-5 cell lines [47]. Similarly, montelukast inhibited the growth of myeloid leukaemia cells with IC50 of 2µM through apoptosis [48]. In the light of these evidences, we propose from our study that montelukast (DB00471) could also be repurposed against RET driven NSCLC patients.