Data set
All inhibitors of EGFR were taken from literature (Bridges et al. 1996; Li et al. 2012; Suzuki et al. 2011; Waterson et al. 2009). In order to provide numerically larger data values, the biological activities expressed as IC50 values in units of molarity were transformed to pIC50 (−logIC50) which were used as dependent variables. All of the compounds were divided into the training set of 105 compounds and the test set of 26 compounds taking into account both the distribution of dependent variables and the structural diversity. The training set was used to construct 4D-QSAR model, and the test set was used to evaluate the predictive quality. The chemical structures and IC50 values of the data set are presented in Fig. 1 and Tab. 1.
Table 1
Structures of training set and test set
Comp.
|
R1
|
R2
|
R3
|
X
|
Y
|
Z
|
pIC50
|
1
|
H
|
H
|
H
|
N
|
N
|
NH
|
6.46
|
2
|
H
|
2’-F
|
H
|
N
|
N
|
NH
|
7.25
|
3a
|
H
|
2’-Cl
|
H
|
N
|
N
|
NH
|
7.64
|
4
|
H
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.57
|
5
|
H
|
2’-I
|
H
|
N
|
N
|
NH
|
7.10
|
6
|
H
|
2’-CF3
|
H
|
N
|
N
|
NH
|
6.24
|
7
|
7-OMe
|
H
|
H
|
N
|
N
|
NH
|
6.92
|
8 a
|
7-OMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.00
|
9
|
7-NH2
|
H
|
H
|
N
|
N
|
NH
|
7.00
|
10
|
7-NH2
|
2’-F
|
H
|
N
|
N
|
NH
|
8.70
|
11
|
7-NH2
|
2’-Cl
|
H
|
N
|
N
|
NH
|
9.60
|
12
|
7-NH2
|
2’- Br
|
H
|
N
|
N
|
NH
|
10.00
|
13 a
|
7-NH2
|
2’-I
|
H
|
N
|
N
|
NH
|
9.46
|
14
|
7-NH2
|
2’-CF3
|
H
|
N
|
N
|
NH
|
8.48
|
15
|
7-NO2
|
H
|
H
|
N
|
N
|
NH
|
4.92
|
16
|
7-NO2
|
2’-F
|
H
|
N
|
N
|
NH
|
5.21
|
17
|
7-NO2
|
2’-Cl
|
H
|
N
|
N
|
NH
|
6.09
|
18 a
|
7-NO2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.00
|
19
|
7-NO2
|
2’-I
|
H
|
N
|
N
|
NH
|
6.27
|
20
|
7-OH
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.33
|
21
|
7-NHAc
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.40
|
22
|
7-NHMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.15
|
23 a
|
7-NHEt
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.92
|
24
|
7-NMe2
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.96
|
25
|
6-OMe
|
H
|
H
|
N
|
N
|
NH
|
7.26
|
26
|
6-OMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.52
|
27 a
|
6-NH2
|
H
|
H
|
N
|
N
|
NH
|
6.11
|
28
|
6-NH2
|
2’-CF3
|
H
|
N
|
N
|
NH
|
6.24
|
29
|
6-NH2
|
2’-Br
|
H
|
N
|
N
|
NH
|
9.11
|
30
|
6-NO2
|
H
|
H
|
N
|
N
|
NH
|
5.30
|
31
|
6-NMe2
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.08
|
32
|
6-NHCO2Me
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.92
|
33 a
|
6, 7-diOMe
|
H
|
H
|
N
|
N
|
NH
|
7.54
|
34
|
6, 7-diOMe
|
2’-F
|
H
|
N
|
N
|
NH
|
8.42
|
35
|
6, 7-diOMe
|
2’-Cl
|
H
|
N
|
N
|
NH
|
9.51
|
36
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
10.60
|
37
|
6, 7-diOMe
|
2’-I
|
H
|
N
|
N
|
NH
|
9.05
|
38 a
|
6, 7-diOMe
|
2’-CF3
|
H
|
N
|
N
|
NH
|
9.62
|
39
|
6-NO2, 7-NH2
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.28
|
40
|
6-NO2, 7-NHAc
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.55
|
41
|
6-NO2, 7-OMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.82
|
42
|
6-NO2, 7-Cl
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.60
|
43 a
|
6-NH2, 7-NHMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.96
|
44
|
6-NH2, 7-OMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.42
|
45
|
6-NH2, 7-Cl
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.19
|
46
|
6, 7-diOH
|
2’-Br
|
H
|
N
|
N
|
NH
|
9.77
|
47
|
6, 7-diOEt
|
2’-Br
|
H
|
N
|
N
|
NH
|
11.22
|
48
|
6, 7-diOPr
|
2’-Br
|
H
|
N
|
N
|
NH
|
9.77
|
49 a
|
H
|
2’-OMe
|
H
|
N
|
N
|
NH
|
6.07
|
50
|
H
|
2’-Me
|
H
|
N
|
N
|
NH
|
6.04
|
51
|
5, 6-diOMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
5.86
|
52
|
6, 7-diOMe
|
2’-Br
|
NH2
|
N
|
N
|
NH
|
6.33
|
53 a
|
5, 6, 7-triOMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
9.17
|
54
|
6, 7-diOMe
|
1’-Br
|
H
|
N
|
N
|
NH
|
6.89
|
55
|
6, 7-diOMe
|
2’,4’-diBr
|
H
|
N
|
N
|
NH
|
6.95
|
56
|
5-NO2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.45
|
57 a
|
8-NH2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.98
|
58
|
8-OMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.01
|
59
|
6-NO2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.05
|
60
|
6-NHMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
8.40
|
61
|
6, 7-diNH2
|
2’-Br
|
H
|
N
|
N
|
NH
|
9.92
|
62
|
6-NH2, 7-NMe2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.80
|
63 a
|
6-NO2, 7- NHMe
|
2’-Br
|
H
|
N
|
N
|
NH
|
7.17
|
64
|
6-NO2, 7- NMe2
|
2’-Br
|
H
|
N
|
N
|
NH
|
5.70
|
65
|
6, 7-diOBu
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.98
|
66
|
6, 7-diOMe
|
3’-Br
|
H
|
N
|
N
|
NH
|
9.02
|
67
|
6, 7-diOMe
|
2’, 3’-diBr
|
H
|
N
|
N
|
NH
|
10.14
|
68 a
|
5-NH2
|
2’-Br
|
H
|
N
|
N
|
NH
|
6.36
|
69
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
N
|
NMe
|
6.82
|
70
|
H
|
H
|
H
|
N
|
N
|
NH(CH2)2
|
5.39
|
71
|
H
|
H
|
H
|
N
|
N
|
NHCH2
|
6.49
|
72 a
|
H
|
3’-OMe
|
H
|
N
|
N
|
NHCH2
|
5.00
|
73
|
H
|
H
|
H
|
N
|
N
|
NMe
|
4.00
|
74
|
5-NO2
|
H
|
H
|
N
|
N
|
NHCH2
|
5.10
|
75
|
5-OMe
|
H
|
H
|
N
|
N
|
NHCH2
|
5.31
|
76
|
6-OMe
|
H
|
H
|
N
|
N
|
NHCH2
|
6.70
|
77
|
7-NO2
|
H
|
H
|
N
|
N
|
NHCH2
|
5.23
|
78 a
|
7-OMe
|
H
|
H
|
N
|
N
|
NHCH2
|
7.24
|
79
|
6-OMe, 7-OH
|
H
|
H
|
N
|
N
|
NHCH2
|
6.23
|
80
|
6-OH, 7-OMe
|
H
|
H
|
N
|
N
|
NHCH2
|
7.25
|
81
|
H
|
3’-Cl
|
H
|
N
|
N
|
NHCH2
|
5.15
|
82
|
6-NO2
|
H
|
H
|
N
|
N
|
NHCH2
|
5.00
|
83 a
|
6-NH2
|
H
|
H
|
N
|
N
|
NHCH2
|
5.85
|
84
|
6, 7-diOMe
|
H
|
H
|
N
|
N
|
NHCH2
|
8.00
|
85
|
H
|
2’-Br
|
H
|
N
|
N
|
O
|
6.12
|
86
|
H
|
2’-Br
|
H
|
N
|
CH
|
NH
|
5.26
|
87
|
6, 7-diOH
|
2’-Br
|
H
|
N
|
C-CN
|
NH
|
6.46
|
88 a
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CN
|
NH
|
6.72
|
89
|
6, 7-diOMe
|
2’-Cl
|
H
|
N
|
C-CN
|
NH
|
6.74
|
90
|
6, 7-diOMe
|
2’-CF3
|
H
|
N
|
C-CN
|
NH
|
5.84
|
91
|
6, 7-diOMe
|
2’, 3’-diOMe
|
H
|
N
|
C-CN
|
NH
|
4.30
|
92
|
6, 7-diOMe
|
2’-F
|
H
|
N
|
C-CN
|
NH
|
6.13
|
93 a
|
6, 7-diOMe
|
3’-CN
|
H
|
N
|
C-CN
|
NH
|
4.85
|
94
|
6, 7-diOMe
|
3’-F
|
H
|
N
|
C-CN
|
NH
|
4.32
|
95
|
6, 7-diOMe
|
2’-COMe
|
H
|
N
|
C-CN
|
NH
|
5.13
|
96
|
6, 7-diOMe
|
2’-NH2
|
H
|
N
|
C-CN
|
NH
|
6.09
|
97
|
6, 7-diOMe
|
2’-N(Me)2
|
H
|
N
|
C-CN
|
NH
|
6.07
|
98 a
|
6, 7-diOMe
|
2’-NO2
|
H
|
N
|
C-CN
|
NH
|
6.06
|
99
|
6, 7-diOMe
|
2’-Cl, 3’-F
|
H
|
N
|
C-CN
|
NH
|
6.27
|
100
|
6, 7-diOMe
|
3’-Cl, 1’-F
|
H
|
N
|
C-CN
|
NH
|
4.86
|
101
|
6, 7-diOMe
|
2’-OH
|
H
|
N
|
C-CN
|
NH
|
5.20
|
102
|
6, 7-diOMe
|
3’-Me
|
H
|
N
|
C-CN
|
NH
|
5.56
|
103 a
|
6, 7-diOMe
|
2’-CONH2
|
H
|
N
|
C-CN
|
NH
|
5.11
|
104
|
6, 7-diOMe
|
2’-Br, 3’-Me
|
H
|
N
|
C-CN
|
NH
|
5.75
|
105
|
6, 7-diOMe
|
2’-Cl, 3’-OH
|
H
|
N
|
C-CN
|
NH
|
5.04
|
106
|
6, 7-diOMe
|
2’, 4’-diCl, 3’-OH
|
H
|
N
|
C-CN
|
NH
|
5.50
|
107
|
6, 7-diOMe
|
1’-OH, 4’-Cl
|
H
|
N
|
C-CN
|
NH
|
4.77
|
108
|
6, 7-diOMe
|
2’-SMe
|
H
|
N
|
C-CN
|
NH
|
6.12
|
109 a
|
6, 7-diOMe
|
2’-Cl, 3’-Me
|
H
|
N
|
C-CN
|
NH
|
5.83
|
110
|
6, 7-diOMe
|
1’-F, 3’- Br
|
H
|
N
|
C-CN
|
NH
|
4.04
|
111
|
6, 7-diOMe
|
3’- CH(Me)2
|
H
|
N
|
C-CN
|
NH
|
6.64
|
112
|
6, 7-diOMe
|
1’- CH(Me)2
|
H
|
N
|
C-CN
|
NH
|
6.44
|
113 a
|
6, 7-diOMe
|
2’- CH(Me)2
|
H
|
N
|
C-CN
|
NH
|
4.21
|
114
|
6, 7-diOMe
|
3’-Br
|
H
|
N
|
C-CN
|
NH
|
5.18
|
115
|
6, 7-diOMe
|
1’-Br
|
H
|
N
|
C-CN
|
NH
|
5.38
|
116
|
6, 7-diOMe
|
2’-CF3, 3’-F
|
H
|
N
|
C-CN
|
NH
|
6.80
|
117
|
6, 7-diOMe
|
1’-Me, 2’- Br
|
H
|
N
|
C-CN
|
NH
|
6.08
|
118
|
6, 7-diOMe
|
2’-Me, 3’- Br
|
H
|
N
|
C-CN
|
NH
|
6.21
|
119 a
|
6, 7-diOMe
|
2’- N3
|
H
|
N
|
C-CN
|
NH
|
5.66
|
120
|
6, 7-diOMe
|
2’- NHCOMe
|
H
|
N
|
C-CN
|
NH
|
4.16
|
121
|
6, 7-diOEt
|
2’-Br
|
H
|
N
|
C-CN
|
NH
|
6.38
|
122
|
6, 7-diOEt
|
2’-Cl, 3’-F
|
H
|
N
|
C-CN
|
NH
|
6.01
|
123
|
6-OEt, 7-OMe
|
2’-Br
|
H
|
N
|
C-CN
|
NH
|
4.67
|
124 a
|
6-OMe, 7-OEt
|
2’-Br
|
H
|
N
|
C-CN
|
NH
|
4.48
|
125
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CO2Et
|
NH
|
4.09
|
126
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CH2OH
|
NH
|
3.99
|
127
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CHO
|
NH
|
5.47
|
128
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CO2H
|
NH
|
4.47
|
129 a
|
6, 7-diOMe
|
2’-Br
|
H
|
N
|
C-CONH2
|
NH
|
4.82
|
130
|
6, 7-diOMe
|
2’-Br
|
H
|
C-CN
|
N
|
NH
|
4.15
|
131
|
6, 7-diOMe
|
2’-Cl
|
H
|
C-CN
|
N
|
NH
|
4.05
|
a The test set compounds |
4d-qsar Study
The 3D structures of all of the compounds were built by means of Ghemical program (Hassinen and Peräkylä 2001). The structures were optimized with the ffG43a1 force field. Then partial atomic charges of AM1-BCC method were computed with AMBER ff03 atom types using UCSF Chimera (Pettersen et al. 2004). The topology files of compounds were obtained by the topobuild program. In order to obtain conformational ensemble profile (CEP), the molecular dynamics (MD) simulations of all compounds were performed by the GROMACS software (Van Der Spoel et al. 2005). All compounds were put in the dodecahedron box which filled with water molecules. Long-range electrostatics and van der Waals interaction energies were computed by means of Particle Mesh Ewald method with a cut off radius of 10 Å (Darden et al. 1993). System temperature was controlled by Berendsen thermostat coupling, and pressure was kept by Parrinelloe Rahman coupling (Berendsen et al. 1984). System was optimized by the steepest descent and conjugated gradient method. Using the script of LQTAgrid software (Patil and Sawant 2015), the stepwise heating method was run which included heating the system at 50 K, 100 K, 200 K and 350 K for 20 ps in 1 fs step size. The system was then backed to 300 K for 500 ps. The trajectory file was recorded every 10 ps simulation time.
The compound 126 was chosen as the reference of alignment due to the most active compound among all compounds. All conformations generated in MD simulations at 300 K were superimposed to the reference using the index number of common atoms. The atoms, which were selected for alignment, are shown in Fig. 1. During the alignment, the initial conformer generated at 20 ps was selected, then other trajectories, which were generated up to 100 ps times with 2 ps increment, were subjected to alignment using the least squares method to compute the minimum root mean- square of the distances (RMSD). The aligned CEPs of the most active compound 126 (reference) and alignment with CEPs of least active compound 47 are shown in Fig. 2.
The grid box was defined as 20 × 19 × 17 Å which was large enough to accommodate all conformers. The aligned molecules were submitted to the LQTAgrid program to calculate the energy descriptors of intermolecular interaction every grid point of a 1 Å grid cell lattice. The NH3+ probe was selected to simulated N-terminus moiety of protein. The Coulomb interaction descriptors (C descriptors) and Lennard-Jones potential descriptors (LJ descriptors) were calculated. The dimension of the descriptor matrix was 131 × 15,120, where each row is a compound and each column is a descriptor.
Descriptor Selection And Model Construction
First, it is necessary to truncate both LJ and C descriptors, in order to avoid large values with high orders of magnitude, and to keep information in the region close to the compounds (Ma et al. 2019). When the distance between the atoms of compound and probe is close to zero, interaction energy generates a large value which do not benefit to the model. Based on equation (1) and (2) (Fig. 3), if the absolute value of interaction energy was more than 30 kcal/mol, the logarithmic value of residual was added to 30 kcal/mol.
Second, if the variance of descriptors is below of 0.01, then the descriptors are excluded. Because the descriptors are far from compounds, and contains very little information.
Third, the pearson correlation coefficients between descriptors and dependent variables were calculated (r vector) using correlation coefficient cut-off according to the equation (3) (Barbosa and Ferreira 2012), where Z0.99 is the number of standard deviations equal to 2.33, extending from the mean of normal distribution (µ = 0) required to contain 99% of the area, and σ is the standard deviation of rrand. When the absolute value of |r|cut−off was lower than 0.3, the descriptors were excluded. This method can eliminate most of noise.
|r|cut−off = Z0.99σ (3)
Forth, the CDDA was performed to exclude descriptors whose distribution is inconsistent with dependent variables [31]. The descriptors were sorted in descending order according to their absolute value of correlation coefficients. The hyperparameter m (0.05 - 1, step 0.01) was applied to adjust the number of descriptors which were used to build 4D-QSAR model.
Fifth, the OPS method attaches importance to each descriptor based on a vector; then the matrix of descriptors is rearranged according to their relevancy. The most relevant/important descriptors are represented by the first column of the matrix (de Campos and de Melo 2014). Successive partial least squares (PLS) regressions are performed by increasing the descriptors number in order to select the set that build the best latent variables (LVs) for correlation with the endpoints (de Melo et al. 2016). The process is repeated in an iterative manner
Finally, GA in QSRINS package (Gramatica et al. 2013), which is a software for the development and validation of multiple linear regression (MLR) model, was applied to choose the most appropriate descriptors for model. The GA performs its optimization make use of variation and selection via the evaluation of the fitness function. GA is a stochastic technique well suited to the problem of variable selection and optimization, and is proved to be effective as a variable selection method.
Model Evaluation
The internal validation of 4D-QSAR model was performed to establish robustness and internal stability. For internal validation, leave-one-out cross-validation (Q2LOO) is the most preferred technique in which each compound of the training set was removed once from the dataset, and the biological activity of removed compound was predicted from the model. Leave-many-out cross-validation (Q2LMO) method was also used which carried out for 30% of data out of training each run. If the average R2LMO and Q2LMO is close to R2 and Q2, the model is considered robust (Patil et al. 2018). In order to check the chance correlation, Y-Scrambling testing was performed in which the dependent variable vector, Y-vector, is randomly shuffled many times, then a new QSAR model is built making use of the original independent variable matrix and the R2YS and Q2YS values are calculated each time. The averages R2YS and Q2YS should be much smaller than the values of the original model. The external validation was used to evaluate the predictive accuracy. The model equation, built using the training set compounds, was applied to test set compounds, and the biological activity of test set compounds was computed. The predictive accuracy is checked in terms of the values of root mean square error external (RMSEEXT), r2m (Average), Δr2m, mean absolute error in fitting (MAE), standard error of estimate (s), concordance correlation coefficient (CCC), predictive residual sum of squares (PRESS) and Golbraikh-Tropsha statistics which calculates the slopes (i.e. k and k′) of the regression lines of the external validation (Golbraikh and Tropsha 2002a; Golbraikh and Tropsha 2002b).