Virtual screening, XGBoost based QSAR modelling, Molecular Docking and Molecular Dynamics Simulation approach to discover a new inhibitor targeting ErbB1 Protein

doi:10.21203/rs.3.rs-4477079/v1

Download PDF

Research Article

Virtual screening, XGBoost based QSAR modelling, Molecular Docking and Molecular Dynamics Simulation approach to discover a new inhibitor targeting ErbB1 Protein

https://doi.org/10.21203/rs.3.rs-4477079/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

ErbB1 is a protein found on certain types of human cells that binds to a substance called epidermal growth factor (EGFR). The ErbB1 protein is involved in cell signalling pathways that control cell division, proliferation, and survival. Sometimes, mutations in the EGFR gene cause ErbB1 proteins to be made in higher-than-normal amounts on some types of cancer cells.

The aim of this study is using the virtual screening based on ligand and structure drug design using respectively QSAR, molecular docking & molecular dynamics simulations approaches to identify novel antitumor small molecules. Therefore, the QSAR model was developed and validated using XGBOOST as a learning algorithm classifier trained on 5215 compounds. The validated model is used for screening of more than 80k natural products downloaded and prepared from the ZINC database to offer us only 36 as potent predicted inhibitors against ErbB1. The selected active compounds were docked against the target represented by the PDB ID: 3POZ. The obtained top five scoring compounds were compared to the reference ligand TAK285, to the Lapatinib and the Erlotinib drugs, after this phase their stability into the ErbB1 protein binding site has been validated using the molecular dynamics simulation.

Cancer

ErbB1

virtual screening

QSAR

Molecular Docking

Molecular dynamics

ErbB1’s role in cancer signalling: Mutated in EGFR result in increased ErbB1 levels in cancer cells.
Advanced Drug Design Methods: QSAR, docking, and simulations identify novel anti-tumor compounds.
Robust QSAR Model Development: XGBOOST - based model validated with over 7000 compounds.
Effective screening of Natural Compounds: 80k compounds screened, yielding 36 potent inhibitors.
Molecular Dynamics Validation: Top compounds validated for stability in ErbB1 binding site.

Cancer is the second principal cause of human death in the world, and more than eight million people are dying from cancer each year. According to the more recent statistics, cancer’s incidence is expected to increase by 50% in the future decades [1]. Therefore, cancer is a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the human body [2], [3].

Many causes are involved in the human cancers, one of these is the aberrant signalling of ErbB family members, includes ErbB1 (EGFR; HER), ErbB2 (Neu; HER-2), ErbB3 (HER3), and ErbB4 (HER4) [4]–[10]. These receptor family members are essential in the aetiology of several tumours, including those of the breast, ovary, lung, colon, nervous system, head and neck, prostate, and pancreas[11]–[14].

The three-dimensional structure of the ErbB1 protein is built up of three domains, namely respectively extracellular ligand binding domain region, transmembrane domain and cytoplasmic or an intracellular kinase domain[15]. Consequently, the catalytic activity of ErbB1 initiates downstream regulation of many receptors signalling pathways, which are responsible for several critical processes, including cancer cell proliferation, arresting of the apoptosis process and stimulation of metastasis[16]–[18].

The inhibition of ErbB1 takes place by competing with adenosine triphosphate (ATP) for its binding site on the intracellular domain [15], [19]–[21]. Therefore, the development of small molecular weight compounds to inhibit ErbB1 is an important therapeutic approach for treating a variety of cancers [22]. At present there are two common classes of ErbB1 drugs approved by the U.S FOOD & DRUG ADMINISTRATION[23], including monoclonal antibodies (mAbs) such as Cetuximab (Erbitux) and Pertuzumab, and small-molecule tyrosine kinase inhibitors, as Gefitinib (Iressa®), Erlotinib (Tarceva®), Lapatinib, and Afatinib)[12], [15].

The search for these inhibitors, potential drugs and bioactive compounds recently engaged via computer aided drug design (CADD) methods, provides a wide range of opportunities to speed up drug development and reduce the associated risks and costs. Therefore, CADD [8], [24], [25]is one of the pivotal approaches of hits identification and lead optimization, when various computational techniques and software programs are typically used in combination [26], [27]. Generally, it is divided into two groups: ligand based drug design (QSAR, pharmacophore) and structure based drug design (molecular docking, molecular dynamics simulation) techniques [26], [28].

Many researchers have [12], [14], [29] constructed QSAR models with a small series of compounds in order to model the ErbB1 inhibitory activities. There are also several studies were performed based on structure-based drug design (molecular docking and pharmacophore) for designing new inhibitors against the ErbB1 target, and they were compared with Erlotinib as a reference to predict their interactions into the binding site and identify structural features necessary for competitive inhibition.

Concettina La Motta, & all and Riccardo Concul& all, [14], [30] have developed a receptor-based 3D-QSAR model for ErbB1 inhibition and molecular docking procedure using a dataset composed of 200 compounds. The obtained validated 3D-QSAR model has been used for screening the Maybridge database in order to identify new ErbB1 inhibitors. The obtained top compounds have verified their stability into the binding site by using the molecular dynamics simulation methods to offer two hits as potential leads against HER2 breast cancers.

In the present study, we apply computational screening methods to identify new potential compounds as ErbB1 inhibitors. The approach contains three main phases, firstly, a data set of more than 14000 ErbB1 inhibitory compounds was used to build a classification QSAR model using XGBoost machine learning algorithm. This model will be used for screening of about 80K natural compounds from the Zinc Natural Products Database [31]. Secondly, the hit compounds from the first stage were passed through molecular docking simulations against ErbB1 targets. Finally, molecular dynamics simulations were performed for further validation of our screening results and to investigate their binding interactions and binding energies with the ErbB1 binding site and they were compared with TAK285 as clinical trial molecule.

Dataset collection

The dataset of 14690 compounds was collected and downloaded as a “.csv” file from the ChemBl Database [32]. In the pre-treatment step with raw dataset, the following operations were done:

- Removing all rows (molecules) when the IC50 is not defined.

- Convert all units to nanomolar concentrations.

- All duplicate compounds are removed using discovery studio software [33].

- Then, the dataset was splitted into two groups active and inactive depending on the values of the IC50 column, where we have selected 200 nM as threshold. The molecules with values inferior or equal to 200 nM are designated as active compounds, elsewhere as inactive compounds [34], [35].

- IC50 inhibitory activity values are converted to logarithmic pIC50 values.

The obtained cleaned dataset (6953 compounds) is divided into two subsets: training set (75%: 5215 compounds composed of 2485 as actives and 2730 as inactive) and prediction set (25%: 1738 compounds with 828 as actives and 910 tagged as inactive) [34].

Generation of molecular descriptors and dataset subdivision

The molecular descriptors are usually used to represent physicochemical, geometric properties and explore the structural characteristics of the compounds [36] that may influence ErbB1 inhibitory activity [37]. In this work, a total of 4887 molecular descriptors, were calculated using the Dragon 06 software [38], [29]. The molecular descriptors with missing, constant values, highly correlated were deleted [37], using the Molegro Data Modeler module (MVD v 06 software) [39].

XGBoost and model construction for ErbB1

Extreme Gradient Boosting, commonly known as XGBoost, is an advanced and efficient implementation of the gradient boosting algorithm. It is an open-source software library in several programming languages, including Python, R, Java, [40] and others. XGBoost is a popular choice among data scientists for both regression and classification tasks because it is computational efficiency without sacrificing performance.

The strength of XGBoost lies in its ability to automatically capture and model complex and nonlinear patterns in the data. It uses a set of decision trees to create an ensemble model where newly added trees correct the errors made by the existing ones in the set. This characteristic makes it an ideal choice for tasks involving complex and non-linear relationships between input features and target variables, as commonly observed in drug design studies.

In this study, we have used XGBoost to construct a QSAR model based on molecular descriptors for binary classification. The target variable has two values, namely 'active' and 'non-active,' indicating whether the molecules exhibit activity or not against our target, which is the ErbB1 protein. To achieve this, we have used a Python notebook within the Google Colab environment [41], implementing the following methodology [34].

Data normalisation

The input matrix X_train is normalised by subtracting the minimum value from each column and then dividing by the range of values in that column (max-min). Normalisation is an important pre-processing step before training models because it scales the input features to a similar range, which accelerates the model convergence during training.

During the testing phase, the test data, X_test, is normalised using the same scale that was used on the training data. This is important to prevent data leakage from the test set, which can lead to an overoptimistic evaluation of the model's performance [42].

Model Building with XGBoost

Hyperparameter tuning is a critical aspect of building efficient machine learning models. To do this, we have tested multiple configurations of the XGBoost classifier (XGBClassifier), each with different hyperparameters, to identify the one with the highest accuracy.

Model Evaluation

To assess the predictive performance of our XGBoost classifier in drug design, we have used several metrics for evaluating classification models [43], [44]. These metrics are derived from four fundamental outcomes of a binary classification model: True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN). These outcomes are visually represented in a confusion matrix (Table 2)[45]. The matrix's rows represent the molecules in the real class, while each column represents molecules in the predicted class. Furthermore, several key performance metrics derived from these outcomes were used, explained in Table 01 [27].

Table 01

metrics of XGBoost classifier
Metric	Description	Formula
Accuracy	Represents the overall correctness of the model's predictions.	\(ACC = \frac{(TP + TN)}{(TP + TN + FP + FN)}\)
Precision	Measures the proportion of positive predictions that were correct.	\(PR=\frac{TP}{TP+FP}\)
Sensitivity (Recall)	Evaluates the model's ability to correctly identify positive cases.	\(SE=\frac{TP}{TP+FN}\)
Specificity	Measures the model's ability to correctly identify negative cases.	\(SP= \frac{TN}{TN+FP}\)
F-Measure (F-Score)	Provides a balanced measure of precision and sensitivity.	\(FM=\frac{2\times Recall \times Precision)}{Recall + Precision}\)

Receiver Operating Characteristic (ROC) curve

This curve is used to evaluate the performance of our binary classification model. The ROC curve is a graphical tool that plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The TPR is the ratio of correctly classified positive molecules, while the FPR is the ratio of negatively classified molecules that were incorrectly identified as positive. A good ROC curve tends toward the upper left corner, indicating a high TPR and a low FPR. This means that the model correctly identifies positive molecules and does not misidentify negative molecules. The ROC curve of an ineffective model will closely follow the diagonal, indicating that the predictive ability is no better than random guessing [35], [46], [47].

Virtual Screening

Natural products are rich in their structural diversities and constitute an ideal source for new active entities in drug discovery. As well, natural products may be drug candidates themselves or be the starting point for an optimization program. Therefore, a number of natural products databases such as (NANDB, COCONUT online server) are built in order to be used as a pertinent bridge to identify and select new hits and lead compounds that will be considered as candidate drugs and valorise the plants kingdom.

Currently, the ZINC natural products database is one of the most used libraries in VS [48]. In this work, we are interested in screening the ZINC natural products database based on the developed and validated QSAR model. Hence, 80617 natural products are downloaded from the ZINC database and filtered based on the Lipinski rules using LigPrep and Ligand Filtering Commands embedded in Maestro modelling environment software [49]. As an additional filtering, the next rules were integrated:

- Molecular weight > 250,

- Number of chiral centres < 2, and

- Number of negative atoms = 0 and number of positive atoms = 1.

In the next step, the compounds that respected the above-mentioned criteria were subject to be filtered with the QSAR model and select only the actives for further calculations as described in the workflow [28].

Molecular docking

The active molecules resulting from the screening of Natural Products ZINC database by the application of QSAR model are ranked by molecular docking approach as more restricted tools. Molecular docking will guide us to select the most promising compounds based on the external and internal interactions between the protein and the selected molecules, and also based on the visual inspection.

The aims of molecular docking techniques are double: to predict the conformation, orientation, and position of the ligand within its target binding site and to give an estimation of the affinity energy of the system (protein - ligand).

Molecular docking approach generally based on two pillars: search algorithm and scoring function and it is composed of three main steps: ligand and protein preparation, search the best poses according to the vectors of orientation, position and conformation of the ligand inside the binding site (search space), and finally the scoring of the resultant poses in order to rank these poses and select the best ones.

In order to rank the molecules declared as actives in the QSAR model were subject to be docked into ErbB1 target represented by the PDB ID: 3POZ and compared with the reference ligand TAK285 (molecule in clinical trials). We have used the Molegro Virtual Docker software [39]as a molecular docking tool.

Molegro Virtual Docker is composed of different search algorithms, such as MolDock optimizer and MolDock SE, while the scoring functions are MolDock Score and PLANTS score.

To validate the protocol with the selected target, the redocking procedures have been tested on the 3POZ.pdb file and the optimal parameters obtained are configured according to the values of the RMSD parameter.

After validating the molecular docking protocol with MVD software, the selected from the former step were docked in the binding site of 3POZ.pdb according to the parameters obtained with the redocking step.

The 05 best ranking poses were visually and energetically discussed and analysed and compared to the reference ligand (TAK285).

Before declaring these molecules as hits, we have decided to do molecular dynamics simulation as the final step in order to confirm the molecular docking results [50].

Molecular Dynamics Simulations

MD simulations are a helpful and widely applied computational method for understanding biological macromolecule behaviour. Since MD is based on classical mechanics, Newton’s equations of motion are applied to calculate the position and speed of each atom of the studied system. Therefore, MD simulations carry out a more exhaustive conformational search than molecular docking methods do and give a more accurate representation of protein motions [51].For this purpose; there are 04 calculated essential parameters as, Root-mean‐square deviation (RMSD) which is the suitable, common technique to verify molecular dynamics simulation stability of the simulated structure over time. There more the Radius of gyration is an imperative parameter in protein stability during simulation. If the protein was stable during the simulation, the radius of gyration would plateau on average. And finally the Root‐mean square fluctuation (RMSF) and Binding energy and hydrogen bonds [52].

To get insights into the ErbB1 & protein-ligand interaction stability, molecular dynamics (MD) calculations were carried out on the best docking poses. The input files for MD calculations were generated using CHARMM-GUI solution builder using CHARMM force field parameters for protein [6]. The topology of the ligands was generated by the CHARMM General Force Field through the Param-Chem server. The CHARMM-GUI solution builder includes five steps. In the first step, the coordinates of the protein-ligand complex are read by the tool. The second step involves solvation of the protein-ligand complex as well as determining the shape and size of the system. Na⁺ and Cl⁻ ions are added in this step to neutralise the system. Periodic Boundary Conditions (PBC) are set in the third step, which are used for approximation of a large system by using a unit cell which is then replicated in all directions. The simulation takes place only for the atoms that are present inside the PBC box. Bad contacts are removed in this step by running short minimization. Fourth and fifth step involves equilibration of the system and production. Equilibration is done in two phases-NVT ensemble and NPT ensemble to ensure that the system has achieved the desired temperature and pressure. The input files for equilibration and production are then downloaded and desired changes were made which include number of steps of MD run, frequency of saving of trajectories and calculation of energy etc. GROMACS 2020.2 [53]was used for both equilibration and production run during all MD calculations. All complexes were initially solvated in a cubic box of TIP3P waters and then Na⁺ and Cl⁻ ions were added to neutralise the net atomic charge of the whole system by random replacement of water molecules. The periodic boundary conditions (PBC) were imposed considering the system shape and size. Non-bonded interactions were treated with a 12˚A cutoff distance and the neighbour searching list was buffered with the Verlet Cutoff-scheme and the long-range electrostatic interactions were treated with the particle mesh Ewald (PME) method. CHARMM36 force field was applied on the protein-ligand complex. Prior to production simulation, energy minimization of the system was carried out by using the steepest descent algorithm (5000 steps). The complex was then equilibrated for stabilising its temperature and pressure by subjecting it to NVT and NPT ensemble and simulating for 125 ps at 300.15 K temperature using 400 kJ mol − 1 nm − 2 and 40 kJ mol − 1 nm − 2 positional restraints on the backbone and side chains, respectively. Finally, the complex is subjected to production simulation run for 100 ns in NPT ensemble at 300.15 K and 1 bar. To maintain the temperature Nose-Hoover thermostat was used and similarly for maintaining the pressure Parrinello-Rahman barostat was used. The LINCS algorithm was used for constraining H-bonds using the inputs provided by CHARMM-GUI. The V-rescale thermostat at 300 K with a coupling constant of 01 ps was used. The trajectories were stored every 2 ps. Simulations of 100 ns in NPT assembly were performed for the production stage [54].

Trajectory analysis

GROMACS utilities were used for the analysis of the MD simulations. The root mean square deviation (RMSD) of atom position for ligand and protein was calculated by fitting the protein backbone atom with the gmx_rms subprogram. Similarly, root mean square fluctuations (RMSF) based on the protein C-alpha atoms were calculated using gmx_rmsf. Radius of gyration of all protein atoms was calculated with the gmx_gyrate and the number of hydrogen bonds was calculated (in-side the protein-ligand interface) with the gmx_hbond. The utility gmx_distance was used to calculate the centre of mass distance between the protein and the ligand during the simulation. The VMD molecular graphics program was used for trajectory visualisation and protein-ligand contact frequency analysis [55].

Binding Free Energy (MM/PBSA Calculations)

For systems which were chosen for further analysis, MM/PBSA (Molecular Mechanics/Poisson − Boltzmann Surface Area) calculations were done using g_mmpbsa, a GROMACS tool used to calculate an estimated binding affinity. In general terms, the binding free energy of the protein with ligand in solvent can be expressed as:

\(\Delta {G_{binding}}=\Delta {G_{complex}} - \,\,(\Delta {G_{protein}}+\Delta {G_{ligand}})\)

Where, \(\Delta {G_{complex}}\) is the total free energy of the protein–ligand complex, and \(\Delta {G_{protein}}\)and\(\Delta {G_{ligand}}\) are total free energies of the isolated protein and ligand in solvent, respectively. g_mmpbsa can also be used to estimate the energy contribution per residue to the binding energy. To decompose the binding energy, at first\(\Delta {E_{MM}}\), \(\Delta {G_{polar}}\) and \(\Delta {G_{non - polar}}\) were separately calculated for each residue and were then summed up to obtain the contribution of each residue to the binding energy. Considering that g_mmpbsa only read the files of some specific GROMACS versions, the binary run input file (.tpr) required for MM-PBSA calculation through the g_mmpbsa was regenerated by GROMACS 5.1.4. The molecular structure file (.gro), topology file (.top) and MD-parameter file (.mdp) were necessary to generate the binary run input file, and they all came from the MD process [56].

ADMET properties prediction

The most vulnerable causes to failure in the sequential stages of drug development are the unfavourable ADMET properties [2]. Therefore, poor absorption, distribution, metabolism, and excretion (ADME) properties and toxicities (Tox) are the main reasons for these failures. However, current approaches for evaluating ADME-Tox properties are expensive and time-consuming and usually require extensive animal testing. Since, ADME-Tox prediction based on computer aide techniques has become the preferred approach in drug discovery [24]. Once the hits or lead compounds are obtained, a series of tests and evaluations are carried out on the pharmacokinetic properties (absorption, distribution, metabolism, and excretion) and toxicity (ADME/T) of these compounds [57]. For this purpose, QSPR regression or classification models relating molecular descriptors to a target property of interest, have been developed to predict various pharmacokinetic and biopharmaceutical properties [28]. However, the model architecture has gradually changed from the original multivariate linear models, such as MLR and PLS method, to the nonlinear multivariate method based on AI algorithms [57].

Similarity test using Principal Component Analysis and K-means algorithms

In order to assess the similarity between the selected five hits and known ErBB1 drugs and drug candidates, we constructed a small dataset based on the CHEMBL database, which was then merged with the five structures validated by molecular dynamics simulations. The dataset comprises 17 molecules recognized as active compounds against the ErBB1 enzyme.

Initially, molecular descriptors were generated using Mordred software (more than 1870 descriptors). Subsequently, preprocessing steps were applied, including the removal of descriptors with missing values or constant values, and the elimination of one descriptor from pairs exhibiting a pair correlation coefficient exceeding 0.9. Following preprocessing, principal component analysis (PCA) and K-means clustering algorithms were performed on the reduced dataset respectively.

In order to identify novel hit compounds as inhibitors against the ErbB1 protein, we have performed a virtual screening based on ligand and structure approaches and based on the screening of the natural products downloaded from the ZINC database. As we have opted to explore the large diversity of the natural products scaffold, we are forced to build a QSAR classification model based on the large number of scaffolds. The ligand-based approach principle is to construct a QSAR model based on an appropriate number of molecules, molecular descriptors, and the best learning algorithm.

In this study, the number of the collected molecules that are targeted ErbB1 and stored in the ChEMBL database was 14690 molecules, which was reduced to 6953 molecules after pretreatment as described in the material section. After that, this dataset is splitted into training and test sets and molecular descriptors are generated using Dragon version 06 software.

We have adopted to use classification approach according to these two main following reasons:

- The dataset is composed from different scaffolds,

- The quantitative QSAR models are more exigent than the qualitative models, while in our approach is to develop qualitative models able to classify each new molecule to active or inactive based on the model prediction decision.

The classification QSAR model is built in order to determine the structural features that affect the molecule class in the training dataset. Our main objective is to build a robust QSAR model able to classify the molecules into active and inactive ones with high prediction accuracy.

XGBoost model was parameterized using several hyperparameters, as explained in Table 01. These parameters control various aspects of the gradient boosting algorithm. For instance, max_depth determines the complexity of the individual trees, learning_rate dictates the rate at which the model learns, and n_estimators sets the number of trees to be constructed.

Therefore, the classifier learning algorithm XGBOOST has been successfully applied on the training and test datasets, and the results are resumed in the confusion matrix shown in Table 02.

Table 02

Confusion matrix for the XGBoost model
	Actually positive	Actually negative
Predicted positive	716	112
Predicted negative	144	766
	860	878

The accuracies of the tested models varied between 84.8–85.3%, dependent on the hyperparameters used. Notably, we found that leveraging the computational power of the graphical processing unit (GPU) in training — by setting the tree_methodhyperparameter to 'gpu_hist' — led to a slight improvement in accuracy (from 84.8–85.3%). We also experimented with varying the number of trees (n_estimators), testing values of 1000 and 10000. However, we observed that increasing n_estimators did not significantly improve accuracy beyond the default value of 100.

Table 03

Performance of the individual XGBoost classifiers
Model parameters	TP	FP	TN	FN	Sensitivity	Specificity	F-Measure	Precision	Accuracy	Recall
Values	716	112	766	144	0.8647	0.8724	0.8647	0.8647	0.8527	0.86

Finally, our selected XGBoost Classifier was configured with the hyperparameters shown in Table 1. This configuration achieved an accuracy of 85.3%, but with only 80 trees (n_estimators = 80), making the model less prone to overfitting. This highlights the importance of hyperparameter tuning in achieving an optimal balance between model complexity and performance (Table 03).

Therefore, to show or to measure the performance of the model across all possible classification thresholds, the area under the ROC curve (AUC) is used in the present work, using a data from the ChemBL database consisting of active compounds (identified as drugs against the ErbB1 target) as well as compounds filtered out because of unreliable IC50 values and decoys sourced from the dude.docking.org database [58]. These compounds were then classified into active and inactive categories. The data were then processed in Maestro and Dragon software to calculate descriptors, which are features that can be used to describe the compounds. The descriptors were then used to plot the ROC curve. The AUC provides a probability estimate that our model will classify a randomly selected positive instance higher than a randomly selected negative instance. An AUC of 100% indicates a perfect model, while an AUC of 50% indicates that the model performs no better than a random classification.

In summary, the ROC curve and AUC of 0.92 as shown in Fig. 01 provides a comprehensive evaluation of our model's ability to distinguish between active and inactive compounds. An AUC of 0.92 (too great then 0.5) denotes no discriminatory power of our developed model.

The use of the XGBOOST model was to select new molecules from the virtual library of natural products stored in the ZINC database. Hence, we have filtered the downloaded natural products based on the Lipinski rule as explained above. The virtual screening of the 10460 molecules gave us only 36 structures as potential active compounds.

Molecular Docking Analysis

In order to know the interaction profiles of the selected active compounds using the QSAR ErbB1 model, their best-scored poses from molecular docking were analysed [55].Molegro Virtual Docker was used for predicting the ErbB1–ligand interactions and visualising the best docked ligands.

The 36 compounds were docked successfully into the binding site of ErbB1, where the best top five poses compounds are depicted in Fig. 02. The results were compared with those of known reference ErbB1 inhibitors, TAK285, Erlotinib PDB ID 4I22 [59] and Lapatinib PDB ID 1XKK [60].

Energetically, the clinical trial compound TAK285 that represents the reference ligand has the best score. Therefore, the reference ligand scoring energy for the TAK285 (-203.749 kcal/mol) is lower by 20 kcal/mol than the subsequent ranked screened natural product, and the difference increases to 40 kcal/mol with last fifth compounds as mentioned in Table 4 and shown in Fig. 02.

The scoring energy of the two known ErbB1 competitive inhibitors (Lapatinib and Erlotinib), have also presented low values as compared to the TAK285 reference ligand.

Table 04

Presentation of the five best compound poses, reference compound, Lapatinib and Erlotinib, and their energy of H-bond, MolDock score energy and global docking results detail.
Compounds	MolDock Score (kcal/mol)	H-bond score (kcal/mol)	H-bond value	H-bond length (Å)	interacting Amino Acid
TAK285	-203,749	-5	-3.095	1.962	MET793
				2.907	ARG841
				2.579	ASN842
Lapatinib	-167.934	-2.5	-2.5	2.960	MET793
Erlotinib	-107.486	-2.5	-2.5	2.853	MET793
ZINC576	-171,898	-7.081	-7.079	2.020	LYS745
				2.391	MET793
				2.496	ALA743
				2.169	LEU788
ZINC469	-183,491	-5.440	-6.631	2.177	MET793
				2.123	ASP855
ZINC699	-162,889	-5.372	-6.550	2.426	MET793
				2.136	ASP855
ZINC901	-178,456	-4.617	-5.747	2.417	MET793
				2.101	ASP855
ZINC908	-176,287	-0.790	-2.049	2.428	THR854
				2.617	ASP855
				2.807	ASP855

According to these results illustrated in Fig. 03 and Table 04, there were 03 compounds (ZINC2_469, ZINC4_699 and ZINC6_ 901) that made hydrogen bonds with MET793 and ASP855 of ErbB1. It is also observed that one of these compounds (ZINC2_908) makes 02 hydrogen bonds with ASP855 and one with THR854. In addition, one compound (ZINC9_576) also formed hydrogen bonds with LYS745, MET793, ALA743 and LEU788. It is remarkable that the key residue MET793 also makes hydrogen bonds with the three reference ligands (TAK285, Lapatinib and Erlotinib). It is also noted that the compound ZINC_908 makes only three hydrogen bonds with respectively THR854 and ASP855.

Molecular Dynamics Simulations

In the aim to analysing the binding stability of protein-ligand interactions the best complexes of ErbB1 protein with bound ligands (3POZ and complexes ZINC908, ZINC576, ZINC901, ZINC469, ZINC699) were evaluated for their binding stability using MD calculation run for 100 ns simulations at natural room temperature conditions. Visualisation of the trajectories post simulation run revealed that all ligands except ZINC901 remain bound to the ligand binding groove of the protein pocket. Calculation of RMSD, RMSF, radius of Gyration, Hydrogen Bonding, average center of mass (COM) distance between protein and ligand and binding free energy (MMPBSA) were carried out to assess the stability of each structure.

RMSD graphs in (Fig. 04, A and B) shows the protein backbone and ligand RMSD for each complex. Backbone RMSD curves are generally very stable after 20ns of simulation for all complexes with an average value of 2 Å. Complex RMSD shows the fluctuation of the protein-ligand system as a whole. Complex RMSD values have the same behaviour as the Backbone RMSD except for 3POZ_ZINC901, where at 40ns of simulation time, complex RMSD has a sudden increase due to the ligand leaving the binding site. This is also confirmed by visual observation and other analysis. Ligand RMSD also shows very stable values of 1 Å or less for ligands of native, 2, 4 and 5 complexes, and values at 2 Å for complexes 1 and 3. Ligand RMSD for ligand 3 shows great fluctuations during the whole simulations which caused the ligand to leave its binding site.

The radius of gyration analysis (Fig. 05, A) is also consistent with RMSD results for the complexes, showing very little fluctuations for all compounds with value range between 19.5 Å and 20.1 Å (around 0.5 Å of fluctuation) which indicates the compactness and the stability of the protein ligand systems.

RMSF was calculated for the protein complex based on ‘C-alpha’ atoms using the GROMACS program. Overall, the fluctuation intensity remains below 3.0 Å for all compounds except for some residues which represent a loop or turn in the protein (Fig. 04, C).

The total number of hydrogen bonds, formed between ligand and protein during 100 ns of the simulation time, are shown in (Fig. 04, D). All the ligands have a stable network of hydrogen bonds with the protein with an average of 2 hydrogen bonds at any time, except for ligand ZINC6_6901which has gaps after 40ns of simulation time, which explains its disconnection from the protein. Average Center-of-Mass Distance between ligand and protein during 100 ns of the simulation time is shown in (Fig. 05, B). All ligands maintain a stable COM-COM distance of 1 nm (Around 10 Å) with the protein except ligand 3 which moves out of its binding site after 40ns of time. The potential energy, pressure and temperature of the system during 100 ns of MD simulation as obtained from GROMACS edr file are shown in (Fig. 06).

The graph shows converged potential energy, pressure and temperature throughout the 100 ns simulations. The Molecular Mechanics/Poisson Boltzman Surface Area (MM/PBSA) method was selected for rescoring complexes because it is the fastest force field-based method that computes the free energy of binding, as compared to the other computational free energy methods, such as free energy perturbation (FEP) or thermodynamic integration (TI) methods. The MM/PBSA calculation was performed using g-mmpbsa software. The calculated binding free energies are shown in Table 05.

Table 05

Calculated binding free energies of five best compounds [kJ/mol]
Complex ID	\(\varDelta \varvec{G}\)	Vander Waal energy	Electrostatic energy	Polar solvation energy	SASA energy
3POZ	-182.403 ± 30.870	-277.486 ± 16.092	-127.047 ± 25.487	249.699 ± 18.781	-27.568 ± 0.905
ZINC908	-145.363 ± 16.620	-216.692 ± 23.298	-162.473 ± 24.031	257.148 ± 25.670	-23.346 ± 1.677
ZINC576	-162.892 ± 15.215	-253.616 ± 6.602	-79.840 ± 27.931	196.864 ± 30.899	-26.300 ± 0.566
ZINC901	-83.131 ± 55.386	-122.486 ± 71.560	-72.154 ± 60.576	126.113 ± 57.415	-14.604 ± 7.889
ZINC469	-116.500 ± 37.462	-239.014 ± 10.474	-55.185 ± 30.466	204.147 ± 13.902	-26.448 ± 1.067
ZINC699	-149.743 ± 19.561	-246.377 ± 10.317	-126.160 ± 25.120	247.882 ± 26.676	-25.088 ± 0.983

Pharmacokinetics ADMET properties prediction

In the present study, we have explored the properties of our MDS validated compounds with the free and easily accessible online web sites, such us pkCSM[61]and Swiss ADME [62]in order to verify their ADMET and medicinal properties[34], [63] and to spot possible and safer drug candidates[2], [24], where the results are presented in Table 06.

Table 06

ADMET evaluation results of the three screened natural products
Model Name	Unit	ZINC908	ZINC576	ZINC469	ZINC699
Absorption
Human Intestinal absorption	% Absorbed	100	93,321	100	100
Distribution
BBB permeability	log BB	-1,253	0,028	-0,779	-0,825
CNS permeability	log PS	-2,816	-1,984	-2,529	-2,791
Metabolism
CYP2D6 substrate	Yes/No	No	No	No	No
CYP3A4 substrate	Yes/No	Yes	Yes	Yes	Yes
CYP1A2 inhibitor	Yes/No	No	Yes	No	No
CYP2C19 inhibitor	Yes/No	Yes	Yes	Yes	Yes
CYP2C9 inhibitor	Yes/No	Yes	No	Yes	Yes
CYP2D6 inhibitor	Yes/No	No	No	No	No
CYP3A4 inhibitor	Yes/No	Yes	Yes	Yes	Yes
Excretion
Total Clearance	log ml/min/kg	1,260	1,245	0,639	0,838
Toxicity
AMES toxicity	Yes/No	No	Yes	No	No

As depicted in Table 06, the screened natural products respect all the ADMET rules except the compound ZINC_576, where it violates the two ADMET vectors distribution and metabolism respectively. Generally, in Silico ADMET prediction procedures are still limited in their precision and coverage for estimating biomedically related compound properties, nevertheless may provide useful initial filters to eliminate subsets of compounds with high probability of being toxic or having insufficient bioavailability.

To assess the novelty of the captured hits, we opted to perform similarity analysis to compare them with potent ErbB1 inhibitors.The PCA algorithm was applied to a dataset consisting of 23 molecules, including the five selected hits and 18 established inhibitors, with 353 descriptors generated by Mordred software. The first three principal components (PCs) collectively describe 46% of the total data variance, while the first two PCs individually represent 33%. Consequently, the total variance explained by the principal components is 0.56, indicating that together, they account for 56% of the total variance in the dataset.

As depicted in Fig. 07, the two principal components notably separateur hits from the remaining molecules, which encompass both commercialized drugs and those in clinical trial phases (refer to Table 07). Moreover, we applied the K-means clustering algorithm to categorize the molecules into four groups (Table 08 and Fig. 08). Our five hits were clustered together and were represented by the molecule ZINC8764724 which is significantly different from cluster centers of other potent known inhibitors in Table 07.

Table 07

Result of clustering best five compounds with a number of known commercial drugs
Molecule_ID	Name	Cluster	PC1	PC2	Com._Name	Max Phase
0	CHEMBL4068839	2	-0.83331	-4.97943
1	CHEMBL939	2	7.305561	-5.67869	GEFITINIB	4
2	CHEMBL3590106	2	1.594856	-5.16333	ULIXERTINIB	2
3	CHEMBL4297865	2	6.115256	3.981508	ABIVERTINIB	3
4	CHEMBL553	2	1.716325	-5.67832	ERLOTINIB	4
5	CHEMBL5095167	1	3.356906	10.18783	BEFOTERTINIB	2
6	CHEMBL2110732	2	6.875112	-3.32659	DACOMITINIB	4
7	CHEMBL1173655	2	7.694708	-2.50953	AFATINIB	4
8	CHEMBL4558324	1	-4.12614	7.374259	LAZERTINIB	3
9	CHEMBL2087361	2	1.970834	-9.44488	ICOTINIB	3
10	CHEMBL545315	2	9.385163	-4.30508	CANERTINIB 2HCl	3
11	CHEMBL3989970	4	5.214741	-7.15509	MAVELERTINIB	2
12	CHEMBL3544964	2	7.946961	-10.01	RAVOXERTINIB	1
13	CHEMBL4650319	1	-3.47326	13.25095	MOBOCERTINIB	4
14	CHEMBL3545308	2	16.84791	11.98766	ROCILETINIB	3
15	CHEMBL4761468	1	-8.24079	8.339925	AUMOLERTINIB	3
16	CHEMBL3545063	1	-0.60507	12.72156	OSIMERTINIB MESYLATE	4
17	CHEMBL3786343	2	2.943855	3.408124	OLMUTINIB	2
18	ZINC8764724	3	-14.5535	-1.33959	2_469
19	ZINC8764728	3	-12.1882	-3.61347	2_908
20	ZINC8792448	3	-12.3917	-4.06662	4_699
21	ZINC8764260	3	-14.6102	-1.52185	6_901
22	ZINC96116466	3	-7.94605	-2.45933	9_576

Table 08

The four compounds representing the centroid clusters
Name	N° of cluster	cluster_centroid = Molecule ID
CHEMBL4650319	1	13
CHEMBL1173655	2	7
ZINC8764724	3	18
CHEMBL3989970	4	11

In the present study, a virtual screening approach has been performed using QSAR, molecular docking and molecular dynamics simulation, in order to identify novel anticancer natural products targeting ErbB1 protein. The use of the XGBoost classifier algorithm in QSAR model building has proven its robust and predictive power to screen new molecules as ErbB1 inhibitors. The QSAR classifier model showed good accuracy of 85% and AUC of 92%.

The 36 natural products selected based on QSAR classifier after screening the ZINC natural products, have been passed through the molecular docking. The selected and analysed natural products and based on the scoring energies of molecular docking and the hydrogen bonds formed with the key residues of the ErbB1 protein, they show a comparable result with the reference ligand TAK285 and known inhibitors Erlotinib and Lapatinib.

Then, the obtained five hits from molecular docking have been processed through molecular dynamics simulation. All the hits have shown good stability into the binding site during simulation time except the compound ZINC_901.

The ADMET filters have been applied on the four selected hits, and the molecules selected have shown good ADMET properties except the molecule ZINC_576 where the distribution and metabolism have been violated.

Author Contribution

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Fatah Bouchama, Khairedine Kraim,Youcef SAIHI, Mohammed Brahimi, Khorief Nacereddine Abdelmalek, Karima Mezghiche, Djerourou Abdelhafidh and Taha Mutasem. The first draft of the manuscript was written by Fatah Bouchama, Khairedine Kraim, Taha Mutasem and Mohammed Brahimi and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding Declaration:

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

A. H. Abbas et al., “New picolinic acid derivatives: Synthesis, docking study and anti-EGFR kinase inhibitory effect,” Materials Today: Proceedings, 2022, doi: 10.1016/j.matpr.2021.05.354.
A. H. Abbas et al., “WITHDRAWN: New picolinic acid derivatives: Synthesis, docking study and anti-EGFR kinase inhibitory effect,” Materials Today: Proceedings, May 2021, doi: 10.1016/j.matpr.2021.05.354.
S. K. Gautam et al., “Blocking c-MET/ERBB1 Axis Prevents Brain Metastasis in ERBB2+ Breast Cancer,” Cancers, vol. 12, no. 10, 2020, doi: 10.3390/cancers12102838.
K. Aertgeerts et al., “Structural analysis of the mechanism of inhibition and allosteric activation of the kinase domain of HER2 protein.,” J Biol Chem, vol. 286, no. 21, pp. 18756–18765, May 2011, doi: 10.1074/jbc.M110.206193.
J. B. Hu, M. J. Dong, and J. Zhang, “A holistic in silico approach to develop novel inhibitors targeting ErbB1 and ErbB2 kinases,” Tropical Journal of Pharmaceutical Research, vol. 15, no. 2, pp. 231–239, Feb. 2016, doi: 10.4314/tjpr.v15i2.3.
J. Kästner, H. H. Loeffler, S. K. Roberts, M. L. Martin-Fernandez, and M. D. Winn, “Ectodomain orientation, conformational plasticity and oligomerization of ErbB1 receptors investigated by molecular dynamics.,” J Struct Biol, vol. 167, no. 2, pp. 117–128, Aug. 2009, doi: 10.1016/j.jsb.2009.04.007.
R. Xu, G. K. Povlsen, V. Soroka, E. Bock, and V. Berezin, “A peptide antagonist of the ErbB1 receptor inhibits receptor activation, tumor cell growth and migration in vitro and xenograft tumor growth in vivo.,” Cell Oncol, vol. 32, no. 4, pp. 259–274, Jan. 2010, doi: 10.3233/CLO-2010-0515.
S. Fatima and S. M. Agarwal, “Exploring structural features of EGFR–HER2 dual inhibitors as anti-cancer agents using G-QSAR approach,” Journal of Receptors and Signal Transduction, vol. 39, no. 3, pp. 243–252, May 2019, doi: 10.1080/10799893.2019.1660896.
A. E. Maennling et al., “Molecular Targeting Therapy against EGFR Family in Breast Cancer: Progress and Future Potentials,” Cancers, vol. 11, no. 12, 2019, doi: 10.3390/cancers11121826.
P. Wee and Z. Wang, “Epidermal Growth Factor Receptor Cell Proliferation Signalling Pathways.,” Cancers (Basel), vol. 9, no. 5, May 2017, doi: 10.3390/cancers9050052.
R. C. Feiner and K. M. Müller, “Recent progress in protein-protein interaction study for EGFR-targeted therapeutics,” Expert Review of Proteomics, vol. 13, no. 9, pp. 817–832, Sep. 2016, doi: 10.1080/14789450.2016.1212665.
T. T. H. Hajalsiddig, A. B. M. Osman, and A. E. M. Saeed, “2D-QSAR Modeling and Molecular Docking Studies on 1H-Pyrazole-1-carbothioamide Derivatives as EGFR Kinase Inhibitors,” ACS Omega, vol. 5, no. 30, pp. 18662–18674, Aug. 2020, doi: 10.1021/acsomega.0c01323.
S. Rampogu et al., “Targeting natural compounds against HER2 kinase domain as potential anticancer drugs applying pharmacophore based molecular modelling approaches,” Computational Biology and Chemistry, vol. 74, pp. 327–338, Jun. 2018, doi: 10.1016/j.compbiolchem.2018.04.002.
R. Concu1 and M. N. Diassoeiro, “MDPI MOL2NET, International Conference Series on Multidisciplinary Sciences A novel QSAR model to predict epidermal growth factor inhibitors,” 2017, doi: 10.3390/mol2net-03-xxxx.
M. Al-Anazi, B. O. Al-Najjar, and M. Khairuddean, “Structure-Based Drug Design Studies Toward the Discovery of Novel Chalcone Derivatives as Potential Epidermal Growth Factor Receptor (EGFR) Inhibitors,” Molecules, vol. 23, no. 12, 2018, doi: 10.3390/molecules23123203.
X. Ding, C. Tong, R. Chen, X. Wang, D. Gao, and L. Zhu, “Systematic molecular profiling of inhibitor response to the clinical missense mutations of ErbB family kinases in human gastric cancer,” Journal of Molecular Graphics and Modelling, vol. 96, p. 107526, May 2020, doi: 10.1016/j.jmgm.2019.107526.
A. J. Shih, J. Purvis, and R. Radhakrishnan, “Molecular systems biology of ErbB1 signalling: bridging the gap through multiscale modelling and high-performance computing,” Mol. BioSyst., vol. 4, no. 12, pp. 1151–1159, 2008, doi: 10.1039/B803806F.
G. S. Omenn, Y. Guan, and R. Menon, “A new class of protein cancer biomarker candidates: Differentially expressed splice variants of ERBB2 (HER2/neu) and ERBB1 (EGFR) in breast cancer cell lines,” Journal of Proteomics, vol. 107, pp. 103–112, Jul. 2014, doi: 10.1016/j.jprot.2014.04.012.
S. J. Kaspersen et al., “Synthesis and in vitro EGFR (ErbB1) tyrosine kinase inhibitory activity of 4-N-substituted 6-aryl-7H-pyrrolo[2,3-d]pyrimidine-4-amines,” European Journal of Medicinal Chemistry, vol. 46, no. 12, pp. 6002–6014, Dec. 2011, doi: 10.1016/j.ejmech.2011.10.012.
M. J. Akhtar et al., “Design, synthesis, docking and QSAR study of substituted benzimidazole linked oxadiazole as cytotoxic agents, EGFR and erbB2 receptor inhibitors,” European Journal of Medicinal Chemistry, vol. 126, pp. 853–869, 2017, doi: 10.1016/j.ejmech.2016.12.014.
V. Ratushny, I. Astsaturov, B. A. Burtness, E. A. Golemis, and J. S. Silverman, “Targeting EGFR resistance networks in head and neck cancer,” Cellular Signalling, vol. 21, no. 8, pp. 1255–1268, Aug. 2009, doi: 10.1016/j.cellsig.2009.02.021.
C. A. Carter, R. J. Kelly, and G. Giaccone, “Small-molecule inhibitors of the human epidermal receptor family.,” Expert Opin Investig Drugs, vol. 18, no. 12, pp. 1829–1842, Dec. 2009, doi: 10.1517/13543780903373343.
Food and Drug Administration, “U. S FOOD & DRUG ADMINISTRATION,” Sep. 2016. https://www.fda.gov/
H. Andleeb et al., “Theoretical and computational insight into the supramolecular assemblies of Schiff bases involving hydrogen bonding and CH…π interactions: Synthesis, X-ray characterization, Hirshfeld surface analysis, anticancer activity and molecular docking analysis,” Journal of Molecular Structure, vol. 1235, p. 130223, Jul. 2021, doi: 10.1016/j.molstruc.2021.130223.
R. Ancuceanu, B. Tamba, C. S. Stoicescu, and M. Dinu, “Use of QSAR Global Models and Molecular Docking for Developing New Inhibitors of c-src Tyrosine Kinase.,” Int J Mol Sci, vol. 21, no. 1, Dec. 2019, doi: 10.3390/ijms21010019.
V. T. Sabe et al., “Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques: A review,” European Journal of Medicinal Chemistry, vol. 224, p. 113705, Nov. 2021, doi: 10.1016/j.ejmech.2021.113705.
A. Ogunleye and Q. -G. Wang, “XGBoost Model for Chronic Kidney Disease Diagnosis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 6, pp. 2131–2140, Dec. 2020, doi: 10.1109/TCBB.2019.2911071.
E. Glaab, “Building a virtual ligand screening pipeline using free software: a survey.,” Brief Bioinform, vol. 17, no. 2, pp. 352–366, Mar. 2016, doi: 10.1093/bib/bbv037.
M. Zhao et al., “2D-QSAR and 3D-QSAR Analyses for EGFR Inhibitors,” BioMed Research International, vol. 2017, p. 4649191, May 2017, doi: 10.1155/2017/4649191.
C. La Motta, S. Sartini, T. Tuccinardi, E. Nerini, F. Da Settimo, and A. Martinelli, “Computational Studies of Epidermal Growth Factor Receptor: Docking Reliability, Three-Dimensional Quantitative Structure−Activity Relationship Analysis, and Virtual Screening Studies,” J. Med. Chem., vol. 52, no. 4, pp. 964–975, Feb. 2009, doi: 10.1021/jm800829v.
J. J. Irwinet al., “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery,” J. Chem. Inf. Model., vol. 60, no. 12, pp. 6065–6073, Dec. 2020, doi: 10.1021/acs.jcim.0c00675.
M. Davies et al., “ChEMBL web services: streamlining access to drug discovery data and utilities,” Nucleic Acids Research, vol. 43, no. W1, pp. W612–W620, Jul. 2015, doi: 10.1093/nar/gkv352.
Dassault Systèmes, “BIOVIA Discovery Studio Visualizer.” San Diego: Dassault Systèmes, 175 Wyman Street Waltham, Massachusetts 02451-1223 USA, 2021. [Online]. Available: https://www.3ds.com/products-services/biovia/products/molecular-modeling-simulation/biovia-discovery-studio/
D. Jiang, T. Lei, Z. Wang, C. Shen, D. Cao, and T. Hou, “ADMET evaluation in drug discovery. 20. Prediction of breast cancer resistance protein inhibition through machine learning,” Journal of Cheminformatics, vol. 12, no. 1, p. 16, Mar. 2020, doi: 10.1186/s13321-020-00421-y.
O. Hermansyah, A. Bustamam, and A. Yanuar, “Virtual screening of dipeptidyl peptidase-4 inhibitors using quantitative structure–activity relationship-based artificial intelligence and molecular docking of hit compounds,” Computational Biology and Chemistry, vol. 95, p. 107597, Dec. 2021, doi: 10.1016/j.compbiolchem.2021.107597.
R. Todeschini and V. Consonni, Handbook of Molecular Descriptors. Wiley, 2000. doi: 10.1002/9783527613106.
S. E. Fioressi, D. E. Bacelo, and P. R. Duchowicz, “QSAR study of human epidermal growth factor receptor (EGFR) inhibitors: conformation-independent models,” Medicinal Chemistry Research, vol. 28, no. 11, pp. 2079–2087, Nov. 2019, doi: 10.1007/s00044-019-02437-y.
Talete srl, “Dragon 7.” https://www.talete.mi.it/, TALETE srl, Via V. Pisani, 13 - 20124 Milano - Italy. [Online]. Available: https://www.talete.mi.it/contact/contact.htm
Molexus Computational Drug Discovery, “Molegro Data Modeler.” Molexus IVS, Rorth Ellevej 3, Rorth DK-8300 Odder Denmark VAT no. 39905442, Aug. 2014.
Python Software Foundation, “Python Programming languages,” SPDX short identifier: PSF-2.0, 2023, [Online]. Available: https://www.python.org/
Produits payants Colab, “https://colab.research.google.com/,” Google Research, [Online]. Available: https://colab.research.google.com/
C. Chen et al., “Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier,” Computers in Biology and Medicine, vol. 123, p. 103899, Aug. 2020, doi: 10.1016/j.compbiomed.2020.103899.
N. Q. Le, D. T. Do, F.-Y. Chiu, E. K. Yapp, H.-Y. Yeh, and C.-Y. Chen, “XGBoost Improves Classification of MGMT Promoter Methylation Status in IDH1 Wild Type Glioblastoma,” Journal of Personalized Medicine, vol. 10, no. 3, 2020, doi: 10.3390/jpm10030128.
B. Noh et al., “XGBoost based machine learning approach to predict the risk of fall in older adults using gait outcomes,” Scientific Reports, vol. 11, no. 1, p. 12183, Jun. 2021, doi: 10.1038/s41598-021-91797-w.
J. Dj Novakovi, A. Veljovi, S. S. Ili, ˇ Zeljko Papi, and M. Tomovi, “Evaluation of Classification Models in Machine Learning,” 2017.
C. Chen et al., “DNN-DTIs: Improved drug-target interactions prediction using XGBoost feature selection and deep neural network,” Computers in Biology and Medicine, vol. 136, p. 104676, Sep. 2021, doi: 10.1016/j.compbiomed.2021.104676.
L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu, “Machine learning–XGBoost analysis of language networks to classify patients with epilepsy,” Brain Informatics, vol. 4, no. 3, pp. 159–169, Sep. 2017, doi: 10.1007/s40708-017-0065-7.
F. López-Vallejo et al., “Integrating virtual screening and combinatorial chemistry for accelerated drug discovery.,” Comb Chem High Throughput Screen, vol. 14, no. 6, pp. 475–487, Jul. 2011, doi: 10.2174/138620711795767866.
N. Schrödinger 2023., “Maestro Software,” 2023, [Online]. Available: https://www.schrodinger.com/
A. Boudjedir, K. Kraim, Y. Saihi, O. Attoui-Yahia, F. Ferkous, and A. Khorief Nacereddine, “A computational molecular docking study of camptothecin similars as inhibitors for topoisomerase 1,” Structural Chemistry, vol. 32, no. 2, pp. 689–697, Apr. 2021, doi: 10.1007/s11224-020-01633-6.
L. H. S. Santos, R. S. Ferreira, and E. R. Caffarena, “Integrating Molecular Docking and Molecular Dynamics Simulations,” in Docking Screens for Drug Discovery, W. F. de Azevedo Jr., Ed., New York, NY: Springer New York, 2019, pp. 13–34. doi: 10.1007/978-1-4939-9752-7_2.
V. Assadollahi, B. Rashidieh, M. Alasvand, A. Abdolahi, and J. A. Lopez, “Interaction and molecular dynamics simulation study of Osimertinib (AstraZeneca 9291) anticancer drug with the EGFR kinase domain in native protein and mutated L844V and C797S,” Journal of Cellular Biochemistry, vol. 120, no. 8, pp. 13046–13055, Aug. 2019, doi: 10.1002/jcb.28575.
GROMACS development team, “GROMACS 2020.2,” 2020.
F. Wang, W. Yang, H. Liu, and B. Zhou, “Identification of the structural features of quinazoline derivatives as EGFR inhibitors using 3D-QSAR modelling, molecular docking, molecular dynamics simulations and free energy calculations,” Journal of Biomolecular Structure and Dynamics, vol. 40, no. 21, pp. 11125–11140, Dec. 2022, doi: 10.1080/07391102.2021.1956591.
F. Sangande, E. Julianti, and D. H. Tjahjono, “Ligand-Based Pharmacophore Modeling, Molecular Docking, and Molecular Dynamic Studies of Dual Tyrosine Kinase Inhibitor of EGFR and VEGFR2,” International Journal of Molecular Sciences, vol. 21, no. 20, 2020, doi: 10.3390/ijms21207779.
N. Moussa, A. Hassan, and S. Gharaghani, “Pharmacophore model, docking, QSAR, and molecular dynamics simulation studies of substituted cyclic imides and herbal medicines as COX-2 inhibitors,” Heliyon, vol. 7, no. 4, Apr. 2021, doi: 10.1016/j.heliyon.2021.e06605.
L. Wang, J. Ding, L. Pan, D. Cao, H. Jiang, and X. Ding, “Artificial intelligence facilitates drug design in the big data era,” Chemometrics and Intelligent Laboratory Systems, vol. 194, p. 103850, Nov. 2019, doi: 10.1016/j.chemolab.2019.103850.
M. M. Mysinger, M. Carchia, John. J. Irwin, and B. K. Shoichet, “Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking,” Journal of Medicinal Chemistry, vol. 55, no. 14, pp. 6582–6594, Jul. 2012, doi: 10.1021/jm300687e.
K. S. Gajiwala et al., “Insights into the aberrant activity of mutant EGFR kinase domain and drug recognition.,” Structure, vol. 21, no. 2, pp. 209–219, Feb. 2013, doi: 10.1016/j.str.2012.11.014.
E. R. Wood et al., “A unique structure for epidermal growth factor receptor bound to GW572016 (Lapatinib): relationships among protein conformation, inhibitor off-rate, and receptor activity in tumor cells.,” Cancer Res, vol. 64, no. 18, pp. 6652–6659, Sep. 2004, doi: 10.1158/0008-5472.CAN-04-1168.
D. E. V. Pires, T. L. Blundell, and D. B. Ascher, “pkCSM: Predicting Small-Molecule Pharmacokinetic and Toxicity Properties Using Graph-Based Signatures,” Journal of Medicinal Chemistry, vol. 58, no. 9, pp. 4066–4072, May 2015, doi: 10.1021/acs.jmedchem.5b00104.
A. Daina, O. Michielin, and V. Zoete, “SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules,” Scientific Reports, vol. 7, no. 1, p. 42717, Mar. 2017, doi: 10.1038/srep42717.
S. H. Abdullahi, A. Uzairu, G. A. Shallangwa, S. Uba, and A. B. Umar, “In-silico activity prediction, structure-based drug design, molecular docking and pharmacokinetic studies of selected quinazoline derivatives for their antiproliferative activity against triple negative breast cancer (MDA-MB231) cell line,” Bulletin of the National Research Centre, vol. 46, no. 1, p. 2, Jan. 2022, doi: 10.1186/s42269-021-00690-z.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Virtual screening, XGBoost based QSAR modelling, Molecular Docking and Molecular Dynamics Simulation approach to discover a new inhibitor targeting ErbB1 Protein

Status:

Version 1

Abstract

Figures

Highlights

Introduction

Material and methods

Dataset collection

Generation of molecular descriptors and dataset subdivision

XGBoost and model construction for ErbB1

Data normalisation

Model Building with XGBoost

Model Evaluation

Receiver Operating Characteristic (ROC) curve

Virtual Screening

Molecular docking

Molecular Dynamics Simulations

Trajectory analysis

Binding Free Energy (MM/PBSA Calculations)

ADMET properties prediction

Similarity test using Principal Component Analysis and K-means algorithms

Results and discussion

Molecular Docking Analysis

Molecular Dynamics Simulations

Pharmacokinetics ADMET properties prediction

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1