Virtual screening of PEBP1 inhibitors by combining 2D/3D-QSAR analysis, hologram QSAR, homology modeling, molecular docking analysis, and molecular dynamic simulations

Human phosphatidylethanolamine binding protein 1 (hPEBP1) is a novel target affecting many cellular signaling pathways involved in the formation of metastases. It can be used in the treatment of many cases of cancer. For these reasons, pharmaceutical companies use computational approaches, including multi-QSAR (2D, 3D, and hologram QSAR) analysis, homology modeling, molecular docking analysis, and molecular dynamic simulations, to speed up the drug discovery process. In this paper, QSAR modeling was conducted using two quantum chemistry optimization methods (AM1 and DFT levels). As per PLS results, we found that the DFT/B3LYP method presents high predictability according to 2D-QSAR, CoMFA, CoMSIA, and hologram QSAR studies, with Q2 of 0.81, 0.67, 0.79, and 0.67, and external power with R2pred of 0.78, 0.58, 0.66, and 0.56, respectively. This result has been validated by CoMFA/CoMSIA graphics, which suggests that electrostatic fields combined with hydrogen bond donor/acceptor fields are beneficial to the antiproliferative activity. While the hologram QSAR models show the contributions of each fragment in improving the activity. The results from QSAR analyses revealed that ursolic acids with heterocyclic rings could improve the activities. Ramachandran plot validated the modeled PEBP1 protein. Molecular docking and MD simulations revealed that the hydrophobic and hydrogen bond interactions are dominant in the PEBP1’s pocket. These results were used to predict in silico structures of three new compounds with potential anticancer activity. Similar molecular docking stability studies and molecular dynamics simulations were conducted.


Introduction
According to GLOBOCAN 2018 (Global Cancer Statistics), cancer continues to spread at an alarming rate around the world, with 18.1 million new cases and 9.6 million deaths in 2018 [1]. Human phosphatidylethanolamine-binding protein 1 (hPEBP1), a novel member of the PEBP family of newly genes recently updated in the Universal Protein Resource UniProt (https:// www. unipr ot. org/), has been implicated in a variety of cellular processes, including "signal transduction, cell cycle, inflammation, adhesion, proliferation, differentiation, apoptosis, autophagy, circadian rhythm, and mitotic spindle checkpoint" [2]. Furthermore, this gene encodes a member of the PEBP1 family of proteins and regulates multiple signaling pathways, namely the MAPK, NF-kappa B [3], and glycogen synthase kinase-3 (GSK-3 signaling pathways) [4]. In addition, it can also inhibit the RAF1 kinase activity through inhibiting its activation and by dissociating the RAF1/ MEK complex and acting as a competitive inhibitor of MEK phosphorylation [5]. Also, it is associated with many human cancers by acting as a metastasis suppressor gene. Likewise, with multiple query sequences in the genome, it is necessary for the commitment of scientists and organizations to explore novel approaches for discovering the specific role of PEBP1 protein, which would give us new ways in cancer prevention and therapy.
Quantitative structure-activity relationship (QSAR) is a mathematical model describing the relationship between biological activities and the descriptors. The descriptors include physicochemical properties, various stereoelectronic characteristics, topological indices, and presence/absence of functional groups, and quantum chemistry (QC) descriptors [6]. The selection of accurate quantum chemical methods is of central importance in QSAR studies [7]. Semiempirical methods (like AM1) use empirical or experimental parameters to deal with the Schrödinger equation and omit some molecular integral calculations, so they are much faster than the DFT method, which is why it is utilized more widely. But the accuracy of their results is inferior to the DFT method. The QC calculation methods are rapidly improving as computer technology and quantum chemistry advance. Higher level methods such as DFT are being used more and more widely now. AM1 does not include the effects of instant electronic correlation, while the DFT-B3LYP method takes them into account [8]. Therefore, the choice of the quantum chemical method may produce errors in the calculation of the molecular descriptor and, ultimately, a lack of adjustment in QSAR model development. Whereas in the CoMFA approach, we calculate the steric and electrostatic potentials around the molecules and we relate the variations of the field calculated to the biological activity [9]. While the CoMSIA strategy is introduced as an extension of the CoMFA analysis [10], with the five similarity indices, namely hydrophobic effects and hydrogen-bond donor/acceptor potentials, however, as a result of the CoMFA and CoMSIA analyses, contour areas are undertaken to locate the favored or unfavored regions surrounding molecules that would be needed to increase the biological activity. On the other hand, CoMFA and CoMSIA require the best alignment of molecules to get better results, which poses a major problem in many cases of 3D-QSAR studies. To overcome this problem, the newly developed hologram QSAR (HQSAR) approach [11], which does not take into account molecular alignment in the QSAR model development, is used solve this problem. Homology modeling is a commonly known as one of the structure-based methods that can predict with high accuracy the 3D structure of a protein from its amino acid sequence. Therefore, computational structural determination methods are needed to bridge this growing gap between the number of available sequences and the 3D structures of experimentally resolved proteins [12]. Furthermore, to explore the binding affinity of the comparative protein model, molecular docking simulation was used to select the best ligand-protein interactions [13], while molecular dynamics simulation was investigated their stability into binding pocket [14].
Ursolic acid (3β-hydroxy-urs-12-en-28-oic acid) is a pentacyclic terpenoid, usually present in the stem bark, leaves, or fruit peel [15]. Recently, it has been reported that ursolic acid is one of the naturally abundant pentacyclic triterpene acids and has vast pharmacological activities including antidiabetic, antioxidative [16], antiHIV, antihepatodamage, antimalarial, antimicrobial, cardiovascular, immunomodulatory effects, anti-inflammatory, and antitumor activities [17,18]. Ursolic acid has the following characteristics: low toxicity, liver protection, and potential anti-cancer metastasis [19,20]. In addition, recent progress in the clinical trial proved that ursolic acid exhibits many anticancer effects with acceptable toxicity [21][22][23]. However, it is imperative to continue the research into the mechanism of action and signaling pathway studies of ursolic acid as a new anticancer drug.
The aim of this study was to select a good quantum chemistry (QC) method to obtain the most relevant descriptors and establish the best 2D-QSAR model. Meanwhile, the 3D-QSAR analysis has been undertaken to select the structural requirements needed to enhance the antiproliferative activities of ursolic acids through CoMFA, CoMSIA, and HQSAR analyses based on the studied quantum chemistry methods. Whereas, the homology modeling was carried out on the PEBP1 query sequence as a new target protein and its binding pocket was investigated by molecular docking simulation to explore the potential affinity of ursolic acid heterocyclic derivatives against the modeled PEBP1 protein.
At the end of this study, we predicted the design of three new heterocyclic ursolic acid derivatives with higher in silico activities than the most active compound (M30) of the series studied, and then evaluated their binding affinity and stability, respectively through molecular docking and molecular dynamics simulations.

Dataset treatment
A series of new synthesized ursolic acid (UA) derivatives (32 compounds of which only 30 have determined antiproliferative activities) has been selected from the literature [24] in order to develop, among others, 2D/3D and hologram-QSAR models. In this study, the antiproliferative activity IC 50 values (in μM) were converted to − Log IC 50 , as reported in Table 1. Then, all compounds were sketched and initially optimized by the MM2 force field in Chem-BioOffice software (15.1.0, PerkinElmer, Waltham, MA, USA). After that, we randomly selected 25 compounds as training set for QSAR development and 5 compounds as the test set to check the predictability of the retained QSAR models.

Computational details
Ursolic acids geometry optimization was performed with Gaussian 09 software [25] using two methods. The first optimisation was carried out by the semi-empirical method (AM1), followed by calculating excitation energies through the semi-empirical ZINDO method for structures optimized by the AM1 method. The second method adopted is empiric with the optimization that was determined by the DFT (B3LYP) method [26] with the 6-31G basis set [27,28], and the time-dependent density functional theory (TD-DFT) method [29,30], which was used to calculate the absorption energy. Based on the calculation methods used, the descriptors determined are the energy of the lowest unoccupied molecular orbital (E LUMO ), the energy of the highest occupied molecular orbital (E HOMO ), the frontier orbital energy (Egap), the dipole moment (DM), the energies E (ZINDO) and E (TD-DFT) , the oscillator strength (f), the excitation energy (ΔE), and the maximum wavelength (λmax) (Supplementary Data: Table S1). Supplementary topological descriptors were determined by using the ChemBioOffice 15.1 program. Consequently, two separate matrices were obtained, each comprising 38 descriptors, and then statistical tools were employed to select the most relevant among them. Those data sets were applied to determine the influence of quantum chemical methods on the QSAR models' performance.

Statistical analysis
Partial least squares (PLS) method Statistical analyses were conducted through QSARINS [31] and XLSTAT software. Then 2D-QSAR studies were investigated using PLS analysis at the 99% confidence level. The model's quality was assessed with R-squared (R 2 coefficient), R 2 adj (R 2 adjustment), the RMSE (root mean square error), the Studenttest and its probability (t-test, p-value), and the Fisher-test (F value, p-value) [32,33]. Furthermore, the collinearity between descriptors in the model was fixed by the correlation matrix and the variance inflation factor (VIF) [32,34].

2D-QSAR model assessment
Before any employment of the 2D-QSAR study in the prediction of new compounds, the relevant statistical metrics must check the predictive power of the retained 2D-QSAR model [35,36]. The criteria commonly employed to check the internal reliability of the QSAR model are the leave-one-out (LOO) strategy [37]. The latter allows the Q 2 coefficient to be determined as shown in Eq. 1. For an acceptable 2D-QSAR model, the Q 2 must be greater than 0.5 [37]. Furthermore, the Y-randomization test [38] was examined to avoid chance correlation between the retained descriptors and antiproliferative activity in the selected QSAR model. However, the external predictive capability was tested by the R 2 pred metric (Eq. 2).
Here, y i and y ipred are respectively the experimental and predicted pIC 50 values of each molecule (training set/test set), y Train represents the average pIC 50 values of the molecules in the training set.
Applicability domain (AD) Before any in silico prediction of the activities of new molecules, we should define the AD as reported by the Organization for Economic Cooperation and Development (OECD) QSAR validation. The AD of the developed QSAR model is defined as the chemical space in which the biological activity of new molecules can be predicted with confidence [39]. The prediction is considered reliable only if the predicted (1) compound is situated inside the domain of applicability. The data set was examined for the possible existence of outliers through William plot analysis. Herein, the plot shows the standardized residuals in relation to leverage. In this case, if the high lever (h) is greater than the warning lever (h*), the compounds have a significant influence on the predicted model [40,41].

Molecular alignment
Since the quality of a 3D-QSAR model is based on the choice of the correct molecular alignment [42], for the comparison and the development of the optimal 3D-QSAR models, we have investigated alignment by two different optimization methods to reach the best quantum level that will allow having the best alignment. Alignment I, as depicted in Fig. 1a, was studied by the semi-empirical (AM1) method, and the DFT/B3LYP optimization method was investigated to get alignment II (Fig. 1b). From both alignments, the distill-rigid strategy as introduced in SYBYL-X 2.0 software [43] was carried out to select the common scaffold. For alignment, the highly active compound 30 (pIC 50 = 5.72) was used as a molecule model, and the remaining 29 molecules were superimposed on it.

CoMFA strategy
The CoMFA approach was used to develop the reliable 3D-QSAR model based on the aligned compounds. The steric (S) field was calculated by Lennard-Jones potentials and Coulombic (electrostatic) potentials were calculated in a 3D grid regularly spaced (2.0 Å between two consecutive points) [40,44]. A carbon-probe sp3 hybridized with + 1 charge was fixed to calculate the steric (S)/electrostatic (E) potential effects [40]. The column filtering was set at 1.0 kcal/mol to reduce noise and improve the results of the constructed 3D-QSAR model. The cut-off value for steric and electrostatic fields was set to 30 kcal/mol.

CoMSIA strategy
The CoMSIA strategy was done with the standard grid box as employed in CoMFA studies. Besides the (S) and (E) potentials, the CoMSIA hydrophobic (H) and hydrogen-bond donor (HBD) and hydrogen-bond acceptor (HBA) descriptors were calculated to assess the similarity between compounds. The CoMSIA descriptors were calculated using a carbon-probe [40] with a + 1 charge for HBD and HBA at each lattice point. We have set column filtering at 1.0 kcal/mol to improve the performance of the CoMSIA model.

Partial least squares (PLS) analysis and 3D-QSAR models assessment
The PLS method was used to correlate the CoMFA and CoMSIA descriptors to biological activities to develop reliable 3D-QSAR models [40]. The Q 2 metric (Eq. 1) has been employed to set the optimum number of components (NOC) to achieve the high predictive power of the established 3D-QSAR models. As a result, the Q 2 metric (Q 2 > 0.5) allowed us to check the internal reliability of 3D-QSAR models [45,46]. Furthermore, the external assessment of the developed 3D-QSAR models was assessed by the external set (test set: 5 molecules). Therefore, the predictive R-squared coefficient (R 2 pred ) was calculated to assess the external reliability of the retained 3D-QSAR model.

Hologram QSAR (HQSAR)
3D-QSAR methods have been demonstrated to be reliable tools in drug design. However, the complexity of the choice of optimization methods (stable geometries) and the structural alignment method causes serious problems in the development of the 3D-QSAR model [11]. Furthermore, hologram QSAR (HQSAR) is a new approach developed to solve the problems facing 3D QSAR analysis. HQSAR uses molecular holograms (fragment fingerprints) as descriptors to determine which fragments contribute to the activity [11]. In this study, HQSAR analysis was carried out by using the following parameters: hologram length (HL) values (53 to 401) and the fragment distinction; atoms (A), bonds (B), connections (C), hydrogen atoms (H), chirality (Ch), and donor/ acceptor (D/A). The setting atom count in fragments was set at between 4 and 7 atoms in each fragment. The PLS technique was used to develop the HQSAR model and then validated by Q 2 , R 2 , and RMSE values.

Templates searching and modeled PEBP1 protein quality
The amino acid sequence of PEBP1 homo sapiens (ID: P30086) has been uploaded from further homology modeling. The BLASTp program was enrolled to search the best template protein with a high identity with the query sequence. After an alignment search, we have chosen the crystalized structure (PDB ID: 6ens) as the template protein. This template was uploaded from PDB the database (https:// www. rcsb. org/). The homology model of PEBP1 was built with the SWISS-MODEL web server [47], and the accuracy of the developed homology model was checked by Ramachandran plot [48], the GMQE score (Global Model Quality Estimation), and Qmean score [49].

PEBP1 model -ursolic acid docking
Once the homology model validation is completed, we have used the CavityPlus web server to identify the binding sites necessary to get the best binding affinity. The cavity pocket is constructed with the following amino acid residues (  used to prepare the entry PDBQT files for the Autodock vina docking program. Grid box was set to cover the maximum part of the predicted binding pocket of the homologous PEBP1 protein and to cover the studied triterpenoid heterocyclic compounds. The binding location was defined by a grid box including the whole amino acids included in the predicted binding pocket, with dimensions XYZ of 60 Å × 60 Å × 60 Å, with a grid spacing of 0.375 Å and grid center was designated at dimensions (Center x = 17.0, Center y = -6.928, Center z = 20.733). AutoDock/Vina was employed for docking using protein and ligand informations along with grid box properties in the configuration file. The best poses were carried out based on energy score (E in Kcal/mol). The generated conformations of PEBP1 protein and the studied compounds were analyzed using Discovery Studio molecular visualization software. The PEBP1-ursolic acid interactions were visualized using the Discovery Studio software.

Molecular dynamic simulations (MDs) and system building
For better understanding of the stability of the PEBP1-ligand complex, molecular dynamic simulations (MDs) were carried out using the conformations obtained by molecular docking studies. The molecular dynamics (MD) simulations were carried out for the most active compound, M30 (pIC 50 = 5.72) and the novel predicted ligands (X1, X2, and X3) in the binding pocket of the modeled PEBP1 receptor. Molecular dynamic (MD) simulation studies were carried out using the software package GROMACS [50] with the CHARMM36 force field [51,52]. The topology for the ligands was created by the CHARMM General Force Field (CGenFF) server [51]. The protein topology files were developed using the pdb2gmx module. The protein ligand complex system was built inside a dodecahedron box and was solvated with three-point water model (TIP3 water) [53]. Na + ions were added as counter-ions to neutralize the charge of the system. Steepest descent minimization was performed to remove bad van der Waals contacts. To remove any steric clashes or unusual geometry, which would artificially raise the energy of the system, we must relax the structure by running an energy minimization (EM) algorithm. After that, the system was relaxed through energy minimization process by using steepest descent until reaching a tolerance of 1000 kJ/ mol. Restrained constant number of particles, volume, and temperature (NVT) ensemble equilibration was performed for 100 picoseconds (ps) at 300 K with Berendsen thermostat temperature coupling [54]. A constant number of particles, pressure, and temperature (NPT) ensemble equilibration of 100 picoseconds (ps) followed this. Parrinello-Rahman barostat was used for pressure coupling at 1 bar [55]. Particle Mesh Ewald (PME) was applied for calculations of the longrange electrostatics [56]. The SHAKE method [57], with a tolerance of 10 −5 Å, was applied to constrain all covalent bonds involving hydrogen atoms. Linear Constraint Solver (LINCS) algorithm was applied to constraint the covalent bonds, including heavy atom-H bonds during the molecular dynamics (MD) simulations. Finally, the systems were simulated in water under the biological conditions, namely 300 K, approximately 1000 kg/m 3 water density and average pressure of 1 bar. The run time for each mode of complex was 10 ns. The gmx rms, rmsf, and gyrate were used for the calculation of RMSD, RMSF, and Rg.

2D-QSAR analysis
We calculated a data set of 38 descriptors based on quantum mechanics methods (AM1 and DFT) and molecular mechanic methods (MM2) (for details, see Table A1, ESI †). The partial least squares (PLS) method allowed us to select a set of 14 descriptors from the 38 initial descriptors.

Selection of descriptors
Due to the multicollinearity problem inherent in the development of 2D-QSAR model, correlation analysis is an essential tool for refining the most correlated descriptors with the antiproliferative activity. From the analysis of the results in Fig. 3a, we can deduce that the two principal axes, F1 and F2 obtained by the DFT-based method represent 71.12% and 15.47%, respectively, of the total variance, while those obtained by the AM1-based method represent only 46.19 and 27.88, respectively (Fig. 3b). The correlation matrices Based on those results, we have selected four descriptors determined by the AM1-based method and the topological index: Egap, C3, HBA, and TV connectivity, whereas, four descriptors were selected by using the DFT and topological index methods: Egap, E TD-DFT , f, and HBA . The matrix dataset for the retained descriptors in both models is tabulated in Supplementary Material (Table A4).

PLS regression
Using the antiproliferative activity values as the dependent variable and the quantum mechanics descriptors as independent variables, the established 2D-QSARs models through the PLS regression method are shown below: AM1-based method: DFT-based method: With N representing the number of molecules (in the training set), R 2 is the coefficient of determination, R 2 adj represents the coefficient of determination adjusted, F is the global significance of the retained 2D-QSAR model, and RMSE its root mean square error. The significance level α was set at 1%. (3) From the above results, we can observe that the statistical quality of the model in Eq. (4) is more significant than that obtained for the model in Eq. (3). The coefficients R 2 and R 2 adj in the DFT-based model are remarkably superior to those obtained in the AM1-based model with a lower error RMSE too. However, the F value of each model indicates that the selected model in Eq. (4) is more significant than the observed model in Eq. (3). The values of the Student test and its resulting probability (p-value) for each descriptor included in the retained 2D-QSAR models are listed in Table 2. We can note that the p values of the t-test are all less than 0.01. Therefore, the coefficients of both models are acceptable. Moreover, the VIF values in Eqs. (3) and (4) indicate that there is no collinearity between descriptors in both 2D-QSAR models ( Table 2). The observed versus predicted activities from both retained 2D-QSAR models are shown in Fig. 4. The predicted antiproliferative activities of the compounds with their residues from Eqs. (3) and (4) are given in Supplementary Material (Table A5).

Predictive powers of retained 2D-QSAR models
The internal predictive capacity of the two selected 2D-QSAR models was tested by a cross validation (LOO) procedure. Q 2 value (0.55) of Eq. (3) is relatively lower than that observed in Eq. (4) (0.81). These results show that the AM1-based model has lower predictive power than the DFT-based model. The correlation between observed and predicted activity values through internal and external validations are illustrated in Fig. 4.
The randomization test was used to avoid the chance correlation between activities and descriptors [58]. As we can see in Table 3, the initial models show high stability compared to randomized models, which allows us to conclude that there is a low probability of finding a chance correlation in the retained 2D-QSAR models. However, the selected test set was used in this study to prove the external predictive power of both selected 2D-QSAR models. The model obtained by DFT/B3LYP level has better external predictive power (R 2 pred = 0.78) than that obtained by the AM1 level (R 2 pred = 0.68).
The theory level of descriptor calculation is a critical step in 2D-QSAR model development. Thus, from the above presented results, we can see the effect of the chosen level of theory on the performance of 2D-QSAR models. The DFT-based method (Eq. 4) produced the best statistical quality with an R-squared (R 2 ) value of 0.87 and the best predictive internal and external power with Q 2 = 0.81 and R 2 pred = 0.78, respectively. The second 2D-QSAR model is based on the AM1-based method, with the significant coefficients R 2 = 0.77, Q 2 = 0.55, and R 2 pred = 0.68. The significant difference in the results obtained for both developed 2D-QSAR models could probably be explained by the fact that the semi-empirical AM1 method does not include the effects of instantaneous electronic correlation [59], compared to DFT/B3LYP method. Our results are in agreement with those of the other comparative studies [60,61], which have proved that the DFT/B3LYP method is the most appropriate method to determine the various properties of molecules.
From both 2D-QSAR models, the energy Egap is negatively correlated with antiproliferative activity, indicating that a decrease in Egap of ursolic acid derivatives is favorable to increase the antiproliferative activity. To ensure the significance of the energy Egap = E LUMO -E HOMO in the DFT-based model. As shown in Fig. 5, the E HOMO (AM1) energies are weakly correlated with E HOMO (DFT) energies (R 2 = 0.40), while the relation between E LUMO (AM1) and E LUMO (DFT) energies is highly correlated (R 2 = 0.95).
The main objective of this analysis is to prove that the E LUMO energies calculated by the two methods (AM1    which indicates that the presence of heterocyclic groups like imidazole is favorable for reducing Egap energy, thus leading to an increase in the antiproliferative activity. In addition, the E TD-DFT descriptor presents a positive contribution to the antiproliferative activity, which means that the lower the energy of the molecule, the more the molecule is stable. The active compound 30 could validate this analysis, which presents the lowest E TD-DFT = − 44,620 eV compared to the less active compound M1 (E TD-DFT = − 38,035 eV). The observed E TD-DFT energies of the two compounds could be due to heterocyclic rings grafted to ursolic acid (compound 1), which contribute favorably to improving its antiproliferative activities.

Training set Test set
Other topological descriptor retained in the model (Eq. (3)) is the Mulliken charge (C3), calculated by AM1 method, negatively correlated to pIC 50 , indicating that the presence of the N-alkyltriazole in R4 conjugated to an unsaturated ketone in R1 affords a positive charge of the compound 30, which is favorable to improving the antiproliferative activity. Whereas TVconn and oscillator strength (f) correlate positively to pIC 50 . Another descriptor involved in the AM1-based model is the hydrogen bond acceptor (HBA), which is a topological descriptor derived from the acceptor of nitrogen and oxygen atoms (in this study, the heterocyclic group).

Applicability domain (AD)
The plot of the applicability domain of the developed 2D-QSAR model-based DFT method is depicted in Fig. 7. The analysis of this figure shows that compound 1 has leverage above the threshold h* (h* = 0.6) and therefore, these compounds are considered outside of the applicability domain.

CoMFA results
The statistical metrics of the obtained CoMFA models for both the studied alignment methods are illustrated in Table 4. The CoMFA model generated by align II (DFTbased optimization method) shows better performance than that obtained by align I (AM1-based optimization method), with values of Q 2 , R 2 , and R 2 pred , respectively, equal to 0.67, 0.92, and 0.58, and a low value of the RMSE train of 0.1. In both alignment methods, we find a slight difference between the steric and electrostatic fields' contributions. The observed and predicted activities with their residual values are presented in Supplementary Material (Table A6) and the relationship between the observed versus predicted pIC 50 values of the training/test set is shown in Fig. 8.

CoMSIA results
In the case of the CoMSIA study, the statistical results are presented in Table 4. From the five similarity indices of CoM-SIA, we have identified that E, HBD, and HBA descriptors are favorable for predicting the antiproliferative activity of ursolic acid derivatives. As we can see in Table 4, the align II showed better results than align I with Q 2 , NOC, R 2 , F, RMSE train , and R 2 pred of 0.79, 2, 0.88, 86.46, 0.12, and 0.66, respectively, and a slight difference observed between E, HBD, and HBA field contributions. The observed and predicted activities with their residual values are presented in the supplementary material (Table A6) and the relationship between the observed versus predicted pIC 50 values of training and test sets is shown in Fig. 8.

HQSAR results
According to HQSAR analysis (Table 4), the identification of the fragment that best predicts the antiproliferative activity indicates that the model using the best fragment distinction (atoms (A), bonds (B), connections (C), and donor/ acceptor (DA)) gives significant results with a hologram length equal to 83. The selected hologram shows high-validated coefficients with values of Q 2 , NOC, R 2 , RMSE, and R 2 pred equal to 0.67, 4, 0.87, 0.15, and 0.56 respectively. As per Table 4, we can see that there is no significant change in the observed results of the two alignment methods. From  Training set Test set this analysis, we can conclude that the HQSAR model is independent of molecular alignment, which is in accordance with the literature [11,62]. The plot of the observed versus predicted activities of training/test sets is shown in Fig. 9.

Contour maps of CoMFA, CoMSIA, and HQSAR
The CoMFA and CoMSIA contour plots were shaped to show the contributions of each selected field in enhancing the antiproliferative activity of ursolic acid derivatives. The maps obtained represent regions with standard coefficients of over 80% (favorable) or under 20% (unfavorable). The active compound 30 (pIC 50 = 5.72) is taken as a template molecule to elucidate the contour maps.
CoMFA contour plots As per the CoMFA contour areas, the green colors show the regions near to the R4 position, indicating that this region must be occupied by bulky steric groups (Fig. 10a), which would enhance the antiproliferative activity. Another green region is observed around the R3 position. These results could be validated, on the one hand, by the active conformation 29 (pIC 50 = 5.67) that contains the 1-methylimidazole group at R4 and the methyl group at R3, which explains its potential activity. On the other hand, the lack of steric groups at the R3 and R4 positions considerably reduces the antiproliferative activity. Furthermore, the least active compounds (1, 2, 3, 4, and 5) contain no steric groups at these positions, reflecting the weaker antiproliferative activities. As a result, the yellow region at R1 indicates an area where an increase in steric volume would reduce antiproliferative activity. In the case of the CoMFA electrostatic contour maps, the blue-colored areas represent the favorable regions and the red-colored areas represent the unfavorable regions for antiproliferative activity. As you see in Fig. 10b, the big contour of blue is around the triterpenoid scaffold, which indicates that hydrogen bonds would be the reason for the high electropositive interactions around the triterpenoid. The red region near to R3 and R1 positions in the active compound 30 denotes that the electronegative group (ketone group) has likely contributed to the enhanced activity.
CoMSIA contour maps In Fig. 11a, we observe that the CoMSIA/E contour map is located at the R3 position, which indicates that electropositive groups like methyl is favorable (blue color) to the increase of the activity. Similarly, the electronegative (red color) region observed near R4 position provides that the presence of nitrogen atoms (N atom) provided by the heterocyclic rings (imidazole, triazole, 1-methyl imidazole) is favorable to enhance the conducive activities enhancement. Moreover, the favored regions of CoMSIA/HBD are displayed in cyan, while the unfavorable regions are shown in purple. Figure 11b shows a big cyan region around the R4 position suggesting that heterocyclic rings like the imidazole group may contribute favorably to the HBD interactions, consequently improving the antiproliferative activity. Furthermore, the CoMSIA/HBA presented in Fig. 11c indicates that the favored areas are visualized in magenta and the red color represents the unfavorable regions for the activity. From Fig. 11c, we can deduce that the ketone group at R1 and R3 positions are favorable to enhancing the activity, while the nitrogen atom acceptor is detrimental for the activity.

HQSAR contribution maps
In addition to the CoMFA and CoMSIA contour maps, HQSAR provides complete contribution maps showing where molecule fragments could contribute to the antiproliferative activity of ursolic acid derivatives. Figure 12 shows the colored contributions for the highly active compounds 30 (pIC 50 = 5.72) and 26 (pIC 50 = 5.67). The contribution map is categorized by different colors. The extreme green color represents positive

Training set Test set
contributions to the antiproliferative activity, the extreme red color represents negative contributions, and the observed white color represents neutral contributions to the activity.

Homology-based modeling
The sequence identity between the PEBP1 target and the protein (template PDB ID: 6ENS) has a value of 86.49%, which fulfills the condition proposed by Rost.B et al. (identity > 25%) [63]. Therefore, the crystal structure of the template is more suitable for homology modeling of the PEBP1 sequence. The initial alignment of the PEBP1 sequence with the template protein was examined using T-Coffee (http:// tcoff ee. crg. cat/) (Fig. 13a). The quality of the PEBP1 model was examined by the Ramachandran plot (Fig. 13b) and Table A7 (Electronic Supplementary Material, ESI †), indicating that 100% of residues are positioned in the favored regions. In addition, the GMQE score of 0.97, which is close to 1, indicates the accuracy of the alignment to build the PEBP1 model. In addition, the QMEAN value of − 0.10 indicates the best accuracy of the obtained PEBP1 model with the experimental data. Those results indicate that the PEBP1 model is stable and reliable to explain the further binding modes of ursolic acids using molecular docking.
Structurally, residue Trp84 interacts with the carbonyl group at the R3 position of compound M30, while residues Trp173 and Lys113 interact with the imidazole group at the R3 position via hydrogen bond interactions. A strong hydrophobic network is observed around the triterpenoid scaffold by Pi-Pi T-shaped, alkyl, and Pi-alkyl interactions. On the other hand, the less active compound (1) shows Electronic Supplementary Material (Fig. A6) no hydrogen bond interactions, which could explain its low antiproliferative activity compared to the active compound 30. These results are in accordance with the 2D-QSAR and 3D-QSAR analyses, which suggest that the heterocycle groups like imidazole contribute favorably to improving the antiproliferative activity. The 2D and 3D representations of the docked molecules M26 and M29 are displayed in the ESI † (Figs. A1 and A2).

Molecular electrostatic potential (MEP) analysis
The importance of MEP lies in its ability to predict the binding affinity of new candidate drugs [64]. This approach indicates the intermolecular specificity of Ramachandran plot of the modeled protein, c final 3D structure of PEBP1 protein molecules, necessary to predict their chemical reactivity and also some types of ligand-protein interactions [65]. Electrostatic molecular potential maps are formed by moving a charged probe across the surface of the molecule. At each point, the potential electrostatic energy of the probe is computed, and the surface of the molecule is coded accordingly by the specific colors. The blue color indicates the extreme electropositive region susceptible to reacting with a nucleophilic group, and the red color represents the extreme electronegative region likely to react with an electrophile group. The observed medium color regions represent the middle of the charge between both extremes' cases. In this study, the molecular potential electrostatic maps (MEP) were generated for M1: ursolic acid, less compound (pIC 50 = 4.89) and for the active compound in the series (30: pIC 50 = 5.72) using DFT (B3LYP/6-31G(d,p)) level. Analysis of the results shows, on the one hand, that the ESP surfaces of compound 30 show that extremely negative regions exist around N38, N39, O29, and O26 atoms, while the extremely positive surfaces are near H85, H83, and H42 atoms (see Fig. 15a). Moreover, the observed electronegative maps on atoms N38, N39, and O26 are connected with residues Trp173, Lys113, and Trp84, respectively, of the homology model PEBP1, and the electropositive surfaces observed on atoms H85, H83, and H42 are grafted with the amino acids of Pro112 and Tyr181, which is in agreement with the molecular docking study. On the other hand, the MEP maps of compound 1 show that extreme electropositive surfaces are around atoms H62 and H69 (see electronic supplementary material, Fig. A7), while electronegative surfaces are observed around atoms O22 and O27. From this analysis, we can see that heterocyclic ursolic acids (highest active compound 30) keep better stability in the active pocket compared to the less active compound 1 (ursolic acid without heterocycle group), these results are in accordance with CoMFA and CoMSIA analysis.

Newly predicted compound with potent anticancer activity
Based on the results obtained by 2D-QSAR, 3D-QSAR, hologram QSAR (HQSAR), and molecular docking simulations, a set of new molecules were predicted based on DFT/B3LYP optimization method as depicted in Table 6. Also, the predicted compounds (X1, X2, X3) were subjected to molecular docking simulations to test their stability against the modeled PEBP1 protein. The results of these simulations are presented in Table 6. From the analysis of this table, we can notice that all predicted compounds (X1, X2, X3) present a higher binding affinity than that observed for the compound with the highest activity in the series of molecules studied (M30). While the stability of those predicted compounds was also performed by ulterior molecular dynamics simulations. The 3D and 2D interactions of the predicted compounds with PEBP1 protein were shown in Figs. A3, A4, and A5 (Supplementary materials).

Structural deviation analysis
To evaluate the stability of the studied complexes and characterize the molecular interactions between ligands and PEBP1, we calculated the root-mean-square deviation (RMSD) through 10,000 ps of simulations. The RMSD calculation was based upon the complex backbone atoms over a 10,000 ps of molecular dynamic simulations. In our analysis, we have seen that all the complexes reach the equilibration state and produce a stable trajectory through 10,000 ps of MD simulations  (Fig. 16a). The average RMSD value for PEBP1-M30, PEBP1-X1, PEBP1-X2, and PEBP1-X3 was 0.12, 0.09, 0.11, and 0.10 nm, respectively. In the comparison of all predicted hits, the PEBP1-X1 showed a lower RMSD value. It means that this is a more stable complex than other hits. Compared to the active compound (M30), the RMSD graphs reveal that complexes with predicted ligands (X1, X2, X3) tend to have fewer deviations than complexes with the active compound (M30). In addition, the results of the present study do not show any larger conformational changes in the structure of the PEBP1 protein when subjected to molecular dynamic simulations.

Compactness analysis
Radius of gyration (Rg) was determined to understand the level of compactness in the structure of PEBP1 in the presence of three hits (X1, X2, X3) and the most active compound in the series (M30). The Rg is defined as the mass-weighted root mean square distance of a collection of atoms from their common center of mass [66]. The Rg values for each protein-ligand complex have been calculated as shown in Fig. 16b.The average Rg values for

Flexibility analysis
The root-mean-square fluctuations (RMSF) of the Cα atoms of all residues of protein were calculated by taking an average of all the conformations obtained during the entire simulations to  Fig. 16c. The average RMSF values for PEBP1-M30, PEBP1-X1, PEBP1-X2, and PEBP1-X3 were 0.064, 0.058, 0.06, and 0.06 nm, respectively. The average value showed that the PEBP1-X1 showed a very low RMSF value compared to other complexes. From this analysis, we have seen that all the protein-ligand complexes are good in terms of fluctuation analysis. However, N-terminal and C-terminal residues showed large conformational changes. This is due to the hanging position of the terminal residues that tend to fluctuate. The nominal variations in RMSF may be due to the dynamic nature of ligands inside the binding pocket.

Protein-ligands interactions analysis
To investigate the stability of the conformation obtained during molecular docking analysis, we proceeded to compare the binding interactions before and after MD simulations. This analysis is employed to explore more details about interactions between Fig. 17 Comparison between the conformation of the predicted ligand (X1) before and after the molecular dynamics simulations the modeled PEBP1 protein and lead compounds (X1, X3). Figures 17 and 18 show that all of the studied ligands are bound to the model PEBP1 protein with the same amino acids obtained from molecular docking analysis, indicating that this pocket is the binding site. The predicted compound (X1), with a binding energy of − 10.3 kcal/mol, has shown the same interactions with the amino acid residues as Trp84, Leu184, Asp70, Ala73, and Tyr181 during the MD simulations. Meanwhile, the predicted compound (X3) with a binding energy of − 11.1 kcal/mol showed better stability with the amino acid residues as Trp84, Tyr81, Tyr181, Ala73, Pro74, Pro112, and Pro178. As you can see in Fig. 18, the PEBP1-X3 complex obtained after molecular dynamics simulations shows additional hydrogen bond interactions with His118, and Asp72 residues and charge-charge interaction with Asp70 residue, which indicates more stability in the binding site of the studied PEBP1 protein. This analysis indicates that the conformations obtained by molecular docking analysis are kept constant and conserving their stability before and after molecular dynamic simulations.

Conclusion
In this work, we studied the anticancer activity of a series of 30 molecules of ursolic acid derivatives using the semiempirical method AM1 and the empirical method DFT. These methods have allowed us to build predictive 2D, 3D, Fig. 18 Comparison between the conformation of the predicted ligands (X3) before and after the molecular dynamics simulations and hologram QSAR models that link biological activities to molecular descriptors, using statistical analysis. The retained 2D-QSAR results proved that the DFT (B3LYP(d, p)) method is more accurate than the AM1 method, with a high Q 2 value of 0.81 and a maximum external validity R 2 pred value of 0.78. These results suggest that the DFT (B3LYP(d, p)) method is the most appropriate approach to predict the antiproliferative activities of ursolic acid derivatives. Thus, the same quantum level was used to select the appropriate molecular alignment, giving the best 3D and hologram QSAR models. The results found that the DFT/B3LYP-based method is more appropriate than AM1-based method to build the 3D-QSAR model. However, the contour maps of the CoMFA and CoMSIA analysis proved that steric, electrostatic, and hydrogen bond donor/acceptor fields are beneficial to the antiproliferative activity. Moreover, we found no significant change in the performance of the HQSAR model for the two theoretical levels. Homology modeling was used to build the 3D structure of target protein PEBP1 and then validate it with Ramachandran plots and Qmean scores for further molecular docking studies. As a result, the best binding affinities between the active ursolic acid derivatives and the modeled PEBP1 protein were obtained and validated by the molecular potential maps to select each atomic contribution responsible for the observed binding affinity. Consequently, the above studies allowed us to predict three new triterpenoid derivatives with potent anticancer activity. In addition, the molecular docking and molecular dynamics simulations suggested the high conformational stability of the predicted compounds against the modeled PEBP1 protein.
Author contribution All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript.
Funding This work was supported by the National Center for Scientific and Technical Research (CNRST -Morocco) as part of the Research Excellence Awards Program (no. 34USMBA2017).

Data availability
The data is integrated into the manuscript.
Code availability Not applicable.