4D-QSAR Analyses for EGFR Inhibitors Based on CDDA-OPS-GA Method

Epidermal growth factor receptor is a preferred target for treating cancer. Compared to 3D-QSAR, 4D-QSAR has the feature of conformational exibility and free alignment for individual ligands. In present studies, the 4D-QSAR of 131 analogs of 4-anilino quinazoline for EGFR inhibitors was built. The GROMACS package was employed to yield the conformational ensemble prole. The eld descriptors of Coulomb and Lennard−Jones potentials were calculated by LQTA-QSAR. The lter descriptors and variable selection is very important, which was performed by means of comparative distribution detection algorithm (CDDA), ordered predictors selection (OPS) and genetic algorithm (GA) method. Best 4D-QSAR model yielded satisfactory statistics (R 2 = 0.71), good performance in internal (Q 2LOO = 0.60) and external prediction (R 2pred = 0.69, k = 0.97, k′ = 1.01). The 4D-QSAR was shown to be robust (Q 2LMO = 0.59) and was not built by chance (R 2YS = 0.17, Q 2YS = −0.25). The model has a good potential for rational design new EGFR inhibitors.


Introduction
Epidermal growth factor receptor (EGFR) is the prototype of receptor tyrosine kinases (TKs) family ( . The EGFR inhibitor, lapatinib, is approved for the treatment of breast cancer by the FDA (Oda et al. 2005). Other EGFR inhibitors, such as lomustine, temozolomide, ge tinib, and erlotinib, have also been approved by the FDA for the treatment of glioma (Minkovsky and Berezov 2008). However, in many cancer types as breast cancer, hepatocellular carcinoma, pancreatic cancer, non-small cell lung cancer, colorectal carcinoma, glioblastoma, and melanoma, there are signi cant resistance to the used EGFR inhibitors (Burotto et al. 2014). All these ndings make requires to design and synthesize new potent EGFR inhibitors.
Before the synthesis, it is necessary to develop a prediction method for biological activities. Quantitative structure-activity relationship (QSAR) is an important part for modern drug design, including risk assessment, drug discovery and predictive toxicology (Dearden 2017;Cherkasov et al. 2014). Initially, two-dimensional quantitative structure-activity relationship (2D-QSAR) and three-dimensional quantitative structure activity relationship (3D-QSAR) were extensively explored in medicinal chemistry study.
However, the major constraint of 3D-QSAR is its dependency and sensitivity to conformations and alignments of compounds (Shim and Mackerell Jr 2011), because only one conformation, not a conformational ensemble pro le, is considered for each compound (Ghasemi et al. 2011). To overcome inherent constraint of 3D-QSAR, the four-dimensional quantitative structure activity relationship (4D-QSAR) was originally developed which includes the freedom of alignment and the conformational exibility to build 3D-QSAR by performing molecular state ensemble averaging, i.e., the fourth "dimension" (Hop nger et al. 1997).
The LQTA-QSAR (Laboratório de Quimiometria Teórica e Aplicada), a 4D-QSAR approach, often calculates a large number of descriptors (variable), frequently several thousands    , the stepwise heating method was run which included heating the system at 50 K, 100 K, 200 K and 350 K for 20 ps in 1 fs step size. The system was then backed to 300 K for 500 ps. The trajectory le was recorded every 10 ps simulation time.
The compound 126 was chosen as the reference of alignment due to the most active compound among all compounds. All conformations generated in MD simulations at 300 K were superimposed to the reference using the index number of common atoms. The atoms, which were selected for alignment, are shown in Fig. 1. During the alignment, the initial conformer generated at 20 ps was selected, then other trajectories, which were generated up to 100 ps times with 2 ps increment, were subjected to alignment using the least squares method to compute the minimum root mean-square of the distances (RMSD). The aligned CEPs of the most active compound 126 (reference) and alignment with CEPs of least active compound 47 are shown in Fig. 2

Descriptor Selection And Model Construction
First, it is necessary to truncate both LJ and C descriptors, in order to avoid large values with high orders of magnitude, and to keep information in the region close to the compounds (Ma et al. 2019). When the distance between the atoms of compound and probe is close to zero, interaction energy generates a large value which do not bene t to the model. Based on equation (1) and (2) (Fig. 3), if the absolute value of interaction energy was more than 30 kcal/mol, the logarithmic value of residual was added to 30 kcal/mol.
Second, if the variance of descriptors is below of 0.01, then the descriptors are excluded. Because the descriptors are far from compounds, and contains very little information.
Third, the pearson correlation coe cients between descriptors and dependent variables were calculated (r vector) using correlation coe cient cut-off according to the equation ( Forth, the CDDA was performed to exclude descriptors whose distribution is inconsistent with dependent variables [31]. The descriptors were sorted in descending order according to their absolute value of correlation coe cients. The hyperparameter m (0.05 -1, step 0.01) was applied to adjust the number of descriptors which were used to build 4D-QSAR model.

Model Evaluation
The internal validation of 4D-QSAR model was performed to establish robustness and internal stability.
For internal validation, leave-one-out cross-validation (Q 2 LOO ) is the most preferred technique in which each compound of the training set was removed once from the dataset, and the biological activity of removed compound was predicted from the model. Leave-many-out cross-validation (Q 2 LMO ) method was also used which carried out for 30% of data out of training each run. If the average R 2 LMO and Q 2 LMO is close to R 2 and Q 2 , the model is considered robust (Patil et al. 2018). In order to check the chance correlation, Y-Scrambling testing was performed in which the dependent variable vector, Y-vector, is randomly shu ed many times, then a new QSAR model is built making use of the original independent

Analysis of MD trajectories
Following successful simulation, the MD trajectories were investigated for dynamic behavior. The last trajectories, obtained for the 500 ps simulations, were analyzed. For some more active compounds  (compounds 73, 110, 125, 126, 130 and 131) and less active compounds (compounds 12, 36, 47, 48, 61 and 67), the root mean square deviations (RMSD) values, which were calculated from the reference trajectories obtained at the end of simulations of 350 K with the trajectories obtained during simulations of 300 K, stayed within 0.20 nm range (Fig. 4). The RMSD uctuation of compound 125 is greater than that of other compound, due to compound 125 containing the exible ester group at position 3. The conformations of target compounds did not drastically change during the MD simulations. This indicates that an equilibrium state of target compounds was reached characterized by the RMSD pro le.

4d-qsar Model
The 4D-QSAR model with 18 variables was shown to be the best model by the leave-one-out crossvalidation, which resulted in R 2 = 0.71 and Q 2 LOO = 0.60 and R 2 Pred = 0.69. The details of the 4D-QSAR model statistics are shown in Tab. 2. It can be concluded that the model presents relatively high predictive power, as the model satis ed the external and internal validation criteria: R 2 > 0.6, Q 2 LMO and Q 2 LOO > 0.5, CCC > 0.85, Q 2 F1 , Q 2 F2 and Q 2 F3 > 0.6, r 2 m > 0.5, △r 2 m < 0.2 and 0.85 ≤ k ≤ 1.15 or 0.85 ≤ k′ ≤ 1.15 (Balupuri et al. 2020). The difference between R 2 and Q 2 LOO is 0.11 units, less than 0.2, indicating the absence of over ting (Kiralj and Ferreira 2009). In addition, other validation parameters were within acceptable limits. favorably contributing to the model, whereas the variable with negative coe cients is inversely contributing to the model. The plot for experimental pIC 50 against predicted pIC 50 is shown in Fig. 5.  was lowest in the Y-scrambling test which implies no chance correlation in the model (Fig. 6).

Model Applicability Domain
The leverage method was used to verify the chemical applicability domain (AD) and the robustness of model. The leverage, h * , for each molecule was calculated by this method. The warning leverage is generally xed at 3LV/m, where LV is the latent variable and m is the number of training set compounds. It can be seen from the AD analysis results presented in Fig. 7 that there are no outliers for all compounds. More importantly, the test compounds which were not applied to build model are predicted with similar accuracy of the training compounds.
Graphical representations of the 4D-QSAR model are shown in Fig. 8. Blue regions are the electrostatic descriptors corresponding to negative regression coe cients and red regions are the electrostatic descriptors related to positive regression coe cients. Likewise, Yellow regions and green regions denote steric descriptors with negative and positive regression coe cients, respectively. The descriptors LJ7 and LJ11 at 3 position (the numbers of compounds as shown in Fig. 1

Conclusion
In this paper, the CEPs were constructed by using GROMACS dynamics simulation and interaction energy descriptors, i.e. Lennard-Jones interaction energy descriptors and Coulomb interaction energy descriptors were calculated using LQTAgrid program. Through the combination of CDDA-OPS-GA method for ltration and selection descriptors, this 4D-QSAR model had achieved satisfactory results.

Declarations
Con ict of interest: The author(s) con rm that this article content has no con ict of interest.
Page 18/22 3 . Yewale C, Baradia D, Vhora I, Patil S, Misra A (2013) Epidermal growth factor receptor targeting in cancer: a review of trends and strategies. Biomaterials 34:8690-8707. Figure 1 The structure of data set and red atoms used for alignment The equation for truncating both LJ and C descriptors