Dataset collection
The dataset of 14690 compounds was collected and downloaded as a “.csv” file from the ChemBl Database [32]. In the pre-treatment step with raw dataset, the following operations were done:
- Removing all rows (molecules) when the IC50 is not defined.
- Convert all units to nanomolar concentrations.
- All duplicate compounds are removed using discovery studio software [33].
- Then, the dataset was splitted into two groups active and inactive depending on the values of the IC50 column, where we have selected 200 nM as threshold. The molecules with values inferior or equal to 200 nM are designated as active compounds, elsewhere as inactive compounds [34], [35].
- IC50 inhibitory activity values are converted to logarithmic pIC50 values.
The obtained cleaned dataset (6953 compounds) is divided into two subsets: training set (75%: 5215 compounds composed of 2485 as actives and 2730 as inactive) and prediction set (25%: 1738 compounds with 828 as actives and 910 tagged as inactive) [34].
Generation of molecular descriptors and dataset subdivision
The molecular descriptors are usually used to represent physicochemical, geometric properties and explore the structural characteristics of the compounds [36] that may influence ErbB1 inhibitory activity [37]. In this work, a total of 4887 molecular descriptors, were calculated using the Dragon 06 software [38], [29]. The molecular descriptors with missing, constant values, highly correlated were deleted [37], using the Molegro Data Modeler module (MVD v 06 software) [39].
XGBoost and model construction for ErbB1
Extreme Gradient Boosting, commonly known as XGBoost, is an advanced and efficient implementation of the gradient boosting algorithm. It is an open-source software library in several programming languages, including Python, R, Java, [40] and others. XGBoost is a popular choice among data scientists for both regression and classification tasks because it is computational efficiency without sacrificing performance.
The strength of XGBoost lies in its ability to automatically capture and model complex and nonlinear patterns in the data. It uses a set of decision trees to create an ensemble model where newly added trees correct the errors made by the existing ones in the set. This characteristic makes it an ideal choice for tasks involving complex and non-linear relationships between input features and target variables, as commonly observed in drug design studies.
In this study, we have used XGBoost to construct a QSAR model based on molecular descriptors for binary classification. The target variable has two values, namely 'active' and 'non-active,' indicating whether the molecules exhibit activity or not against our target, which is the ErbB1 protein. To achieve this, we have used a Python notebook within the Google Colab environment [41], implementing the following methodology [34].
Data normalisation
The input matrix X_train is normalised by subtracting the minimum value from each column and then dividing by the range of values in that column (max-min). Normalisation is an important pre-processing step before training models because it scales the input features to a similar range, which accelerates the model convergence during training.
During the testing phase, the test data, X_test, is normalised using the same scale that was used on the training data. This is important to prevent data leakage from the test set, which can lead to an overoptimistic evaluation of the model's performance [42].
Model Building with XGBoost
Hyperparameter tuning is a critical aspect of building efficient machine learning models. To do this, we have tested multiple configurations of the XGBoost classifier (XGBClassifier), each with different hyperparameters, to identify the one with the highest accuracy.
Model Evaluation
To assess the predictive performance of our XGBoost classifier in drug design, we have used several metrics for evaluating classification models [43], [44]. These metrics are derived from four fundamental outcomes of a binary classification model: True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN). These outcomes are visually represented in a confusion matrix (Table 2)[45]. The matrix's rows represent the molecules in the real class, while each column represents molecules in the predicted class. Furthermore, several key performance metrics derived from these outcomes were used, explained in Table 01 [27].
Table 01
metrics of XGBoost classifier
Metric | Description | Formula |
Accuracy | Represents the overall correctness of the model's predictions. | \(ACC = \frac{(TP + TN)}{(TP + TN + FP + FN)}\) |
Precision | Measures the proportion of positive predictions that were correct. | \(PR=\frac{TP}{TP+FP}\) |
Sensitivity (Recall) | Evaluates the model's ability to correctly identify positive cases. | \(SE=\frac{TP}{TP+FN}\) |
Specificity | Measures the model's ability to correctly identify negative cases. | \(SP= \frac{TN}{TN+FP}\) |
F-Measure (F-Score) | Provides a balanced measure of precision and sensitivity. | \(FM=\frac{2\times Recall \times Precision)}{Recall + Precision}\) |
Receiver Operating Characteristic (ROC) curve
This curve is used to evaluate the performance of our binary classification model. The ROC curve is a graphical tool that plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The TPR is the ratio of correctly classified positive molecules, while the FPR is the ratio of negatively classified molecules that were incorrectly identified as positive. A good ROC curve tends toward the upper left corner, indicating a high TPR and a low FPR. This means that the model correctly identifies positive molecules and does not misidentify negative molecules. The ROC curve of an ineffective model will closely follow the diagonal, indicating that the predictive ability is no better than random guessing [35], [46], [47].
Virtual Screening
Natural products are rich in their structural diversities and constitute an ideal source for new active entities in drug discovery. As well, natural products may be drug candidates themselves or be the starting point for an optimization program. Therefore, a number of natural products databases such as (NANDB, COCONUT online server) are built in order to be used as a pertinent bridge to identify and select new hits and lead compounds that will be considered as candidate drugs and valorise the plants kingdom.
Currently, the ZINC natural products database is one of the most used libraries in VS [48]. In this work, we are interested in screening the ZINC natural products database based on the developed and validated QSAR model. Hence, 80617 natural products are downloaded from the ZINC database and filtered based on the Lipinski rules using LigPrep and Ligand Filtering Commands embedded in Maestro modelling environment software [49]. As an additional filtering, the next rules were integrated:
- Molecular weight > 250,
- Number of chiral centres < 2, and
- Number of negative atoms = 0 and number of positive atoms = 1.
In the next step, the compounds that respected the above-mentioned criteria were subject to be filtered with the QSAR model and select only the actives for further calculations as described in the workflow [28].
Molecular docking
The active molecules resulting from the screening of Natural Products ZINC database by the application of QSAR model are ranked by molecular docking approach as more restricted tools. Molecular docking will guide us to select the most promising compounds based on the external and internal interactions between the protein and the selected molecules, and also based on the visual inspection.
The aims of molecular docking techniques are double: to predict the conformation, orientation, and position of the ligand within its target binding site and to give an estimation of the affinity energy of the system (protein - ligand).
Molecular docking approach generally based on two pillars: search algorithm and scoring function and it is composed of three main steps: ligand and protein preparation, search the best poses according to the vectors of orientation, position and conformation of the ligand inside the binding site (search space), and finally the scoring of the resultant poses in order to rank these poses and select the best ones.
In order to rank the molecules declared as actives in the QSAR model were subject to be docked into ErbB1 target represented by the PDB ID: 3POZ and compared with the reference ligand TAK285 (molecule in clinical trials). We have used the Molegro Virtual Docker software [39]as a molecular docking tool.
Molegro Virtual Docker is composed of different search algorithms, such as MolDock optimizer and MolDock SE, while the scoring functions are MolDock Score and PLANTS score.
To validate the protocol with the selected target, the redocking procedures have been tested on the 3POZ.pdb file and the optimal parameters obtained are configured according to the values of the RMSD parameter.
After validating the molecular docking protocol with MVD software, the selected from the former step were docked in the binding site of 3POZ.pdb according to the parameters obtained with the redocking step.
The 05 best ranking poses were visually and energetically discussed and analysed and compared to the reference ligand (TAK285).
Before declaring these molecules as hits, we have decided to do molecular dynamics simulation as the final step in order to confirm the molecular docking results [50].
Molecular Dynamics Simulations
MD simulations are a helpful and widely applied computational method for understanding biological macromolecule behaviour. Since MD is based on classical mechanics, Newton’s equations of motion are applied to calculate the position and speed of each atom of the studied system. Therefore, MD simulations carry out a more exhaustive conformational search than molecular docking methods do and give a more accurate representation of protein motions [51].For this purpose; there are 04 calculated essential parameters as, Root-mean‐square deviation (RMSD) which is the suitable, common technique to verify molecular dynamics simulation stability of the simulated structure over time. There more the Radius of gyration is an imperative parameter in protein stability during simulation. If the protein was stable during the simulation, the radius of gyration would plateau on average. And finally the Root‐mean square fluctuation (RMSF) and Binding energy and hydrogen bonds [52].
To get insights into the ErbB1 & protein-ligand interaction stability, molecular dynamics (MD) calculations were carried out on the best docking poses. The input files for MD calculations were generated using CHARMM-GUI solution builder using CHARMM force field parameters for protein [6]. The topology of the ligands was generated by the CHARMM General Force Field through the Param-Chem server. The CHARMM-GUI solution builder includes five steps. In the first step, the coordinates of the protein-ligand complex are read by the tool. The second step involves solvation of the protein-ligand complex as well as determining the shape and size of the system. Na+ and Cl− ions are added in this step to neutralise the system. Periodic Boundary Conditions (PBC) are set in the third step, which are used for approximation of a large system by using a unit cell which is then replicated in all directions. The simulation takes place only for the atoms that are present inside the PBC box. Bad contacts are removed in this step by running short minimization. Fourth and fifth step involves equilibration of the system and production. Equilibration is done in two phases-NVT ensemble and NPT ensemble to ensure that the system has achieved the desired temperature and pressure. The input files for equilibration and production are then downloaded and desired changes were made which include number of steps of MD run, frequency of saving of trajectories and calculation of energy etc. GROMACS 2020.2 [53]was used for both equilibration and production run during all MD calculations. All complexes were initially solvated in a cubic box of TIP3P waters and then Na+ and Cl− ions were added to neutralise the net atomic charge of the whole system by random replacement of water molecules. The periodic boundary conditions (PBC) were imposed considering the system shape and size. Non-bonded interactions were treated with a 12˚A cutoff distance and the neighbour searching list was buffered with the Verlet Cutoff-scheme and the long-range electrostatic interactions were treated with the particle mesh Ewald (PME) method. CHARMM36 force field was applied on the protein-ligand complex. Prior to production simulation, energy minimization of the system was carried out by using the steepest descent algorithm (5000 steps). The complex was then equilibrated for stabilising its temperature and pressure by subjecting it to NVT and NPT ensemble and simulating for 125 ps at 300.15 K temperature using 400 kJ mol − 1 nm − 2 and 40 kJ mol − 1 nm − 2 positional restraints on the backbone and side chains, respectively. Finally, the complex is subjected to production simulation run for 100 ns in NPT ensemble at 300.15 K and 1 bar. To maintain the temperature Nose-Hoover thermostat was used and similarly for maintaining the pressure Parrinello-Rahman barostat was used. The LINCS algorithm was used for constraining H-bonds using the inputs provided by CHARMM-GUI. The V-rescale thermostat at 300 K with a coupling constant of 01 ps was used. The trajectories were stored every 2 ps. Simulations of 100 ns in NPT assembly were performed for the production stage [54].
Trajectory analysis
GROMACS utilities were used for the analysis of the MD simulations. The root mean square deviation (RMSD) of atom position for ligand and protein was calculated by fitting the protein backbone atom with the gmx_rms subprogram. Similarly, root mean square fluctuations (RMSF) based on the protein C-alpha atoms were calculated using gmx_rmsf. Radius of gyration of all protein atoms was calculated with the gmx_gyrate and the number of hydrogen bonds was calculated (in-side the protein-ligand interface) with the gmx_hbond. The utility gmx_distance was used to calculate the centre of mass distance between the protein and the ligand during the simulation. The VMD molecular graphics program was used for trajectory visualisation and protein-ligand contact frequency analysis [55].
Binding Free Energy (MM/PBSA Calculations)
For systems which were chosen for further analysis, MM/PBSA (Molecular Mechanics/Poisson − Boltzmann Surface Area) calculations were done using g_mmpbsa, a GROMACS tool used to calculate an estimated binding affinity. In general terms, the binding free energy of the protein with ligand in solvent can be expressed as:
\(\Delta {G_{binding}}=\Delta {G_{complex}} - \,\,(\Delta {G_{protein}}+\Delta {G_{ligand}})\)
Where, \(\Delta {G_{complex}}\) is the total free energy of the protein–ligand complex, and \(\Delta {G_{protein}}\)and\(\Delta {G_{ligand}}\) are total free energies of the isolated protein and ligand in solvent, respectively. g_mmpbsa can also be used to estimate the energy contribution per residue to the binding energy. To decompose the binding energy, at first\(\Delta {E_{MM}}\), \(\Delta {G_{polar}}\) and \(\Delta {G_{non - polar}}\) were separately calculated for each residue and were then summed up to obtain the contribution of each residue to the binding energy. Considering that g_mmpbsa only read the files of some specific GROMACS versions, the binary run input file (.tpr) required for MM-PBSA calculation through the g_mmpbsa was regenerated by GROMACS 5.1.4. The molecular structure file (.gro), topology file (.top) and MD-parameter file (.mdp) were necessary to generate the binary run input file, and they all came from the MD process [56].
ADMET properties prediction
The most vulnerable causes to failure in the sequential stages of drug development are the unfavourable ADMET properties [2]. Therefore, poor absorption, distribution, metabolism, and excretion (ADME) properties and toxicities (Tox) are the main reasons for these failures. However, current approaches for evaluating ADME-Tox properties are expensive and time-consuming and usually require extensive animal testing. Since, ADME-Tox prediction based on computer aide techniques has become the preferred approach in drug discovery [24]. Once the hits or lead compounds are obtained, a series of tests and evaluations are carried out on the pharmacokinetic properties (absorption, distribution, metabolism, and excretion) and toxicity (ADME/T) of these compounds [57]. For this purpose, QSPR regression or classification models relating molecular descriptors to a target property of interest, have been developed to predict various pharmacokinetic and biopharmaceutical properties [28]. However, the model architecture has gradually changed from the original multivariate linear models, such as MLR and PLS method, to the nonlinear multivariate method based on AI algorithms [57].
Similarity test using Principal Component Analysis and K-means algorithms
In order to assess the similarity between the selected five hits and known ErBB1 drugs and drug candidates, we constructed a small dataset based on the CHEMBL database, which was then merged with the five structures validated by molecular dynamics simulations. The dataset comprises 17 molecules recognized as active compounds against the ErBB1 enzyme.
Initially, molecular descriptors were generated using Mordred software (more than 1870 descriptors). Subsequently, preprocessing steps were applied, including the removal of descriptors with missing values or constant values, and the elimination of one descriptor from pairs exhibiting a pair correlation coefficient exceeding 0.9. Following preprocessing, principal component analysis (PCA) and K-means clustering algorithms were performed on the reduced dataset respectively.