2.1. Biological dataset
For this study, total 32 molecules of N-phenyl-2,2-dichloroacetamide series reported for anticancer activity on non-small cell lung cancer cell line (A549) (Table 1) has been chosen to develop QSAR models [25].

Sr. No
|
Compound
|
R1
|
R2
|
R3
|
R4
|
IC50 (µm/ml)
|
pIC50
|
1.
|
5c
|
-H
|
-Cl
|
-H
|
-Cl
|
4.40
|
-0.64
|
2.
|
3e
|
-H
|
-I
|
-H
|
-H
|
4.76
|
-0.68
|
3.
|
5b
|
-H
|
-Cl
|
-Cl
|
-H
|
5.09
|
-0.71
|
4.
|
3q
|
-H
|
-SO2Ph (NHCOCHCl2)
|
-H
|
-H
|
5.89
|
-0.77
|
5.
|
3m
|
-H
|
-SO2CF3
|
-H
|
-H
|
6.53
|
-0.81
|
6.
|
3d
|
-H
|
-Br
|
-H
|
-H
|
7.80
|
-0.89
|
7.
|
4i
|
-H
|
-H
|
-SO2CF3
|
-H
|
8.50
|
-0.93
|
8.
|
4h
|
-H
|
-H
|
-SCF3
|
-H
|
10.56
|
-1.02
|
9.
|
3g
|
-H
|
-C≡CH
|
-H
|
-H
|
11.22
|
-1.05
|
10.
|
3j
|
-H
|
-CF3
|
-H
|
-H
|
12.18
|
-1.09
|
11.
|
5d
|
-Cl
|
-H
|
-H
|
-Cl
|
12.38
|
-1.092
|
12.
|
4e
|
-H
|
-H
|
-I
|
-H
|
12.54
|
-1.10
|
13.
|
4d
|
-H
|
-H
|
-Br
|
-H
|
13.74
|
-1.14
|
14.
|
3l
|
-H
|
-SCF3
|
-H
|
-H
|
14.07
|
-1.15
|
15.
|
3c
|
-H
|
-Cl
|
-H
|
-H
|
14.78
|
-1.17
|
16.
|
3k
|
-H
|
-OCF3
|
-H
|
-H
|
15.05
|
-1.18
|
17.
|
3h
|
-H
|
-NO2
|
-H
|
-H
|
18.04
|
-1.26
|
18.
|
4c
|
-H
|
-H
|
-Cl
|
-H
|
20.37
|
-1.31
|
19.
|
3i
|
-H
|
-OCH3
|
-H
|
-H
|
25.78
|
-1.41
|
20.
|
4f
|
-H
|
-H
|
-NO2
|
-H
|
34.42
|
-1.54
|
21.
|
2d
|
-Br
|
-H
|
-H
|
-H
|
36.99
|
-1.57
|
22.
|
3a
|
-H
|
-CH3
|
-H
|
-H
|
41.05
|
-1.61
|
23.
|
4a
|
-H
|
-H
|
-CH3
|
-H
|
48.92
|
-1.69
|
24.
|
2c
|
-Cl
|
-H
|
-H
|
-H
|
50.15
|
-1.70
|
25.
|
3f
|
-H
|
-C≡N
|
-H
|
-H
|
66.53
|
-1.82
|
26.
|
4b
|
-H
|
-H
|
-F
|
-H
|
68.41
|
-1.84
|
27.
|
4g
|
-H
|
-H
|
-OCH3
|
-H
|
80.88
|
-1.91
|
28.
|
2b
|
-F
|
-H
|
-H
|
-H
|
87.53
|
-1.94
|
29.
|
1
|
-H
|
-H
|
-H
|
-H
|
130
|
-2.11
|
30.
|
2f
|
-NHCOCHCl2
|
-H
|
-H
|
-H
|
130.94
|
-2.12
|
31.
|
2e
|
-NO2
|
-H
|
-H
|
-H
|
197.56
|
-2.30
|
32.
|
DCA
|
-
|
-
|
-
|
-
|
1011
|
-3.0048
|
DCA = Dichloroacetate
pIC50 = log1/IC50
Table-1 Selected series of compounds containing N-phenyl-2, 2-dichloroacetamide pharmacophore
2.2. Molecular modeling tools
All QSAR studies were performed using VLife Molecular Design Suite 3.5 & 4.0 [31]. Molecules were optimized by Merck Molecular Force Field (MMFF) energy minimization method [32].
2.3. Two-dimensional QSAR (2D QSAR) studies
2.3.1. Experimental design for 2D QSAR
Dataset of 32 molecules was divided into multiple training and test sets by using manual data selection method and no. of sets were generated using different combinations of molecules in training and test sets such that it covers each molecule in different set every time in an attempt to ensure robustness of QSAR model and increase its predictive ability [27]. From these sets, training & test sets which followed all model evaluation parameters were subjected to Y-randomization test (Fig. 2.).
Evaluation parameters of Y-randomization test:
n = Number of molecules
df = Degree of freedom (n-k-1) (higher is better)
k = Number of descriptors in a model (≤ n/5)
r2 = Coefficient of determination (>0.7)
q2 = Cross validated r2 (>0.5)
SEE = Standard error of estimate (smaller is better)
Pred_r2 = r2 of external test set (>0.5)
F-test = Statistical significance of the model (higher is better for same descriptors and compounds)
Best_ran_r2 = Highest r2 value in the Y-randomization test (as low as compared to r2)
Best_ran_q2 = Highest q2 value in the Y-randomization test (as low as compared to q2)
Z score = It is calculated by the Y-randomization test (higher is better)
Alpha = Statistical significance parameter by randomization test (<0.01)
Three models were selected which satisfied the results of Y-randomization test and were named as Training set-A, Training set-B and Training set-C. These models were subjected to two times for external validation by splitting them into two test sets viz. test set a1, a2 for training set-A, test set-b1, b2 for training set-B and test set c1, c2 for training set-C in order to avoid the chance correlated results. Only those models which satisfied both the test sets were selected for design of NMEs. We have ensured that selected training and test sets also satisfied the following criteria:
- Representative points of the test set must be close to those of the training set;
- Representative points of the training set must be close to representative points of the test set;
- Training set must have wide chemical and biological diversity.
2.3.2 Uni-column statistics
The comparative statistical parameters of training and test sets created by manual data selection method are reported in Table 2. The minimum and maximum values in both training and test set should be compared in a way that-
- The maximum of the test should be less than max of training set.
- The minimum of the test should be greater than min of training set.
It shows that the test set is interpolative i.e., derived within the min-max range of the training set. The mean and standard deviation of the training and test set provides insight to the relative difference of mean and point density distribution (along mean) of the two sets. Standard deviation of Training set A, B and C with test set a1, a2; b1, b2 and c1, c2 respectively were found to be nearly close to each other. This showed that even though the selected molecules in training or test sets are different, but the distribution pattern with respect to the biological activity of molecules in both the selection methods is quite similar.
Para-meters
|
Training Set B+C
|
Test set a1
|
Test set a1
|
Training Set A+C
|
Test set b1
|
Test set b2
|
Training Set A+B
|
Test set c1
|
Test set c2
|
Avg.
|
0.9653
|
0.8400
|
-1.0479
|
-0.9858
|
-0.8363
|
-0.9305
|
-0.9127
|
-0.8722
|
-0.8103
|
Max
|
0.0128
|
-0.4389
|
-0.6421
|
0.0128
|
-0.6675
|
-0.8311
|
-0.002
|
-0.4412
|
-0.2203
|
Min
|
-1.5923
|
-1.2149
|
-1.1155
|
-1.5923
|
-0.2331
|
-0.1089
|
-1.6212
|
-1.3274
|
-1.3788
|
S.D.
|
0.5629
|
0.5134
|
0.5928
|
0.5868
|
0.5626
|
0.5437
|
0.5549
|
0.6034
|
0.5273
|
Avg- Average
S.D.- Standard deviation
Table 2 Uni-Column statistics for training sets and test sets
2.3.3 Descriptor selection
Various 2D descriptors (a total of 338) were calculated and preprocessing of them was carried out by removing invariable columns.
It has been reported that there is high probability of chance correlation between the observed and predictive activity; especially when no. of descriptors is comparable or more than the no. of compounds in dataset for any QSAR analysis [31]. Thus, reduction in no. of descriptors is a very important step which is required to avoid the occurrences of chance correlation and inclusion of irrelevant descriptors in final QSAR model. We applied combinations of different descriptor selection methods viz. forward, forward-backward, genetic algorithm, simulated annealing etc. as well as different QSAR methods on same molecule sets and finally considered the results of forward variable selection method with Multiple Linear Regression (MLR) after comparing all results to improve performance as well as predictability of QSAR model.
2.3.3.1 Correlation matrix
It is very popular and crucial technique used for QSAR studies [28]. We have considered the correlation between descriptor with activity as well as their inter-correlation i.e., descriptor-descriptor correlation. We have shown only those descriptors contributed for the selected series of compounds in 2D QSAR studies; which show either direct or inverse correlation with biological activity.
DESCRIP-TORS
|
Xlog10P
|
T_N_N_3
|
SssSE_index
|
T_N_Cl_4
|
T_O_O_2
|
SddssS(sulphate)count
|
T_O_Cl_6
|
T_2_S_0
|
Common (set I & II)
|
Descriptor set - I
|
Descriptor set – II
|
Xlog10P
|
1.00
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
SssSE_index
|
-0.407
|
-
|
1.00
|
-
|
-
|
-
|
-
|
-
|
T_N_N_3
|
-0.138
|
1.00
|
0.067
|
-
|
-
|
-
|
-
|
-
|
T_N_Cl_4
|
0.275
|
-0.093
|
0.092
|
1.00
|
-
|
-
|
-
|
-
|
T_O_O_2
|
-0.396
|
0.176
|
0.137
|
-0.188
|
1.00
|
-
|
-
|
-
|
SddssS(sulphate) Count
|
0.125
|
-0.083
|
0.121
|
-0.002
|
-0.21
|
1.00
|
-
|
-
|
T_O_Cl_6
|
0.275
|
-0.092
|
0.087
|
-0.017
|
-0.013
|
-0.151
|
1.00
|
-
|
T_2_S_0
|
0.135
|
-0.075
|
0.183
|
0.009
|
0.142
|
0.103
|
0.213
|
1.00
|
pIC50
|
0.805
|
-0.428
|
0.147
|
0.358
|
-0.122
|
0.347
|
0.358
|
0.332
|
Table 3 Correlation matrix of descriptors (2D QSAR; set-I & set-II)
2.3.3.2 Fitness plot
Correlation coefficient cannot give information about data spread between the descriptor and activity. There may be some descriptors showing chance correlation with activity because each variable selection method is based upon correlation between descriptor and activity and not on the type of data spread. To avoid above said pitfall, the proper observation of fitness plot between descriptor and activity is needed [Fig 3].
The following are few important points that we have taken into consideration while selecting proper descriptors for QSAR model generation:
- We have ensured that percentage distribution of data points on both sides of best fit line should be nearly 50-50%. (We preferred slope value more than 0.15)
- In case of topological descriptor, no. of occurrences of particular data point was observed in fitness plot which gave information about the frequency of occurrence of each particular substituent in series. Thus, although the particular descriptor shows good correlation with activity as well as comes in the QSAR final model result, but we cannot take it into final consideration unless and until it shows well spread fitness plot. In conclusion, we can say that careful observation and right analysis of a fitness plot helped us to reduce no. of descriptors.
2.3.3.3 Variance
Another significant way to find out unimportant descriptors is by using information of variance of descriptors [39]. There were some descriptors which showed consistently high variance even if there was small change in physicochemical properties and vice versa. After close analysis of the output of our study, we conclude that we should focus more on correlation between descriptors and activity instead of considering descriptors of highest variance, as final results rely more on correlation than on variance.
The algorithm we followed for variable reduction is as follows –
- Define appropriate correlation cutoff value between descriptor & activity, which is mentioned as Amax. Remove all descriptors which have value less than Amax
- Define appropriate cross correlation cutoff value between descriptor-descriptor, which is defined as Cmax. Remove all descriptors which have values larger than Cmax.
- Define variance cutoff value for descriptor which is mentioned as Vmax. The descriptors having variance value less than Vmax were removed.
We observed that this algorithm reduced no. of descriptors nearly up to 50%. After this, we applied manual variable selection method and multiple linear regression (MLR) as it is ensured that each remaining descriptor is significantly contributing for QSAR model. The only thing we have to find out is no. of descriptors in final equation should be as low as possible which must be contributing highly and should be seen in the structural features of the reported compounds of the series as well [29].
2.4 Three-dimensional QSAR (3D QSAR) studies
2.4.1 Alignment of molecules
Proper alignment of molecules is the prerequisite for studying 3D QSAR as well as in almost all the fields of drug discovery for getting reliable results. So, after optimization we carried out alignment of all molecules using MolSign which also serves as the basic tool to identify the common pharmacophore features as well as the individual molecular feature.
The color scheme for identification of various chemical features is as follows:
Hydrogen bond donor: magenta color
Hydrogen bond acceptor: Buff color
Hydrophobic: Orange color
Aliphatic: Orange color
Negative ionizable: green color
Positive ionizable: Violet color
The larger tessellated spheres are indicative of the common pharmacophore features identified in the molecules and the smaller solid features are of the individual molecules.
2.4.2 3D QSAR by SA-kNN-MFA
3D QSAR studies were performed by generation of numerous models by taking same molecules in the respective training and test sets as in 2D QSAR by using k-Nearest Neighbor-Molecular Field Analysis (kNN–MFA) methodology with Simulated Annealing (SA) variable selection method as it has been reported as the more relevant and suitable method to perform 3D QSAR [36, 37, 40]. kNN–MFA requires suitable alignment of given set of molecules after optimization which had already been carried out by MolSign, but it was again carried out to generate a folder of aligned molecules to proceed for 3D QSAR by atom-based alignment which gives alignment based on each and every individual atom of the pharmacophore. Molecular alignment was used to visualize the structural diversity in the given set of molecules. It was followed by generation of common rectangular grid around the molecules. Steric and electrostatic interaction energies were computed at the lattice points of the grid using a methyl probe of charge +1. These interaction energy values at the grid points are considered for relationship generation using kNN method and utilized as descriptors for obtaining distances within this method. Resulting set of aligned molecules was then used to build 3D QSAR model.
2.5 Group based QSAR (G-QSAR) studies
Group based QSAR (G-QSAR) is a new approach to investigate structure activity relationship based on molecular fragments of a set of molecules that significantly enhances the use of QSAR as an approach for new molecule design. As a predictive tool for activity, G-QSAR is significantly superior to conventional 2D and 3D QSAR as in this method, every molecule of the data set is considered as a set of fragments, the fragmentation scheme being either template based or user defined. The descriptors are evaluated for each fragment and a relationship between these fragment descriptors is formed with the activity of the whole molecule. Unlike conventional QSAR, with the G-QSAR, we get critically important site-specific clues within a molecule where a particular descriptor needs to be modified. G-QSAR studies were also performed by generation of multiple models of the same training & test sets as in 2D & 3D QSAR by using MLR analysis. G-QSAR was carried out using template-based fragmentation scheme and forward variable selection method.
2.6 Design of New Molecular Entities (NMEs) containing N-phenyl-2, 2-dichloroacetamide pharmacophore
The information obtained from 2D, 3D and G-QSAR studies was utilized in optimizing N-phenyl-2, 2-dichloroacetamide pharmacophore and to design potent anticancer chemical entities Substitution pattern around pharmacophore, shown in Fig. 10 was used to design designed chemical entites using Combilib tool of VLife MDS software. Designed compounds were subjected to Lipinski’s screen [38] to ensure their drug like pharmacokinetic profile in order to ensure their drug like bioavailability. The following parameters were used as Lipinski’s filters (values in parenthesis indicate ideal requirements):
1. Number of Hydrogen Bond Acceptor (A) (<10)
2. Number of Hydrogen Bond donor (D) (<5)
3. Number of Rotatable Bond (R) (<10)
4. XlogP (X) (<5)
5. Molecular weight (W) (<500 g/mol)
6. Polar surface area (S) is (<140 Ǻ)
The fig. 10 (pg. no.33) indicates the structural substitution pattern required around N-phenyl-2, 2-dichloroacetamide based on the result of 2D and 3D QSAR studies.
2.7 Molecular Docking studies
All the designed compounds that showed good predicted activity by all QSAR studies and followed Lipinski’s rule as well as reported series of molecules for comparison purpose were subjected to molecular docking for studying the binding mode of designed compounds and were further screened to sort out the best compounds having good binding affinity compared with binding mode of standard DCA. The main molecular docking tool used was GLIDE (Maestro; Schrödinger Inc., USA) for protein-ligand docking studies in to the receptor Pyruvate Dehydrogenase Kinase (PDK) enzyme binding pocket [43]. The crystal structures of PDKs were obtained from Protein Data Bank. (PDK II-PDB Code: 2BU8; PDK I-PDB Code: 2Q8H) [42]. All structures were prepared for docking using ‘Protein preparation wizard’ and ‘Ligand preparation wizard’ in Maestro wizard of Schrödinger 10.2. In the refinement component, a restrained impact minimization of the co-crystallized complex was carried out. This helps in reorientation of side chain hydroxyl groups. It uses the OPLS-AA force field for this purpose. The co-crystallized ligand was removed from active site and the grids were defined by centering them on the ligand in the crystal structure. Then our structures (ligands) were imported in the project table, built using maestro structure builder panel and prepared by Ligprep module which produces the low energy conformers of ligands using MMFF94 force field. The lower energy conformations of the ligands were selected and docked into the grid generated from protein structure using extra precision (XP) docking mode. In this docking method, the ligands are flexible and receptor is rigid, except the active binding site of protein which has slight flexibility. The final evaluation is done with glide score (docking score) and single best pose is generated as the output for particular ligand.
G-score = a*vdw + b*coul + Lipo + H-bond + Metal + BuryP + RotB + Site ……equation 1
where, vdW, Van der Waals energy; Coul_Coulombic energy; Lipo-Lipophilic contact term; HBond, Hydrogen-bonding term; Metal_Metal-binding term; BuryP-Penalty for buried polar groups; RotB, Penalty for freezing rotatable bonds; Site, Polar interactions at the active site. The coefficients of vdW and Coul are: a = 0.065, b = 0.130 respectively.
The accurate prediction of protein-ligand interaction geometries is essential for the success of virtual screening approach in structure-based drug design. The docking results were evaluated based on Glide Score (G-Score), Hydrogen bonds (H-bond) and Vander Waals (vdW) interactions between ligand and receptor.
2.7.1 Molecular docking studies using other software tools (VLife MDS 3.5 & Molegro Virtual Docker, 2007)
After consideration of docking results obtained using glide (Schrödinger), same compounds were again subjected to molecular docking studies by using VLife MDS 3.5 & MVD 2007 to only check out whether the obtained docking score for all the compounds are in similar order with that of the obtained by glide or not. The results of docking studies using glide, Schrödinger are shown in Table no.10 and Figure no.11 to 15.
2.7.1.1 VLife MDS 3.5
In VLife MDS 3.5, the compounds were subjected to grid-based batch docking with keeping fitness function as dock score and by setting all the parameters to their default values.
Grid batch docking parameters are:
Default grid interval size = 1 O
Rotation angle for ligand rotation = 100⁰
No. of bumps allowed = 4
Specify cavity = 1
Log = Enable logging
VLife MDS uses following fitness functions for calculating docking score:
E (docking score) = InterEq + InterEvdW + IntraEq + IntravdW + IntraEtor…. equation 2
Where, InterEq = Intermolecular electrostatic energy of complex
InterEvdW = Intermolecular vdW energy of complex
IntraEq = Intramolecular electrostatic energy of ligand
IntraEvdW = Intramolecular vdW energy of ligand
IntraEtor = Intramolecular torsion energy of ligand
All the energy components were calculated using MMFF force field.
2.7.1.2 Molegro Virtual Docker, 2007
In MVD 2007, initially the cavity detection was carried out followed by which the compounds in. mol format were imported in the same workshop and docking wizard was set according to the binding site cavity selection and volume size of the axis adjustment. Customize search algorithm parameters were set at their default values. After checking again for any warning with the whole setup, the docking wizard was started as a separate process.
Customize search algorithm parameters are:
Algorithm = MolDock optimizer
No. of runs = 10
Population size = 50
Max iterations = 2000
Scaling factor = 0.50
Crossover rate = 0.90
Maximum No. of multiple poses of each run = 5
MVD uses following equation for calculating MolDock score:
Escore = Einter + Eintra…. equation 3
Where, Einter = Ligand-Protein interaction energy
Eintra = Internal energy of the ligand
2.8 Prediction of ADMET properties
Sometimes compounds that show very high activity in vitro however are proved later to have no in vivo activity, or to be highly toxic in in-vivo models. Lack of in vivo activity may be attributed to undesirable pharmacokinetic properties and the toxicity may result from the formation of reactive metabolites. The failure of NMEs at latter stages of drug discovery process due to lack of drug like pharmacokinetic profile has forced us to set filters of ADMET properties. Thus, we have ensured that only drug-like NMEs would be selected for experimental validation. All designed compounds which showed good binding affinity were filtered by predicting their Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) properties by means of QikProp tool of Schrödinger 10.2 [44] as well as Discovery Studio (DS), Accelrys [45]. Prediction of ADMET properties was used as the last screen to sort out those compounds which already followed Lipinski’s rule, showed good predicted activity as well as good binding affinity with PDK enzymes.
2.8.1 ADMET prediction by QikProp, Schrödinger 10.2
In order to ensure druglike pharmacokinetic profile, ADMET properties were predicted. QikProp tool of Schrödinger predicts both physicochemical significant descriptors and pharmacokinetically relevant properties. It also evaluates the acceptability of analogues based on Lipinski’s rule of 5, which is essential to ensure drug like pharmacokinetic profile while using rational drug design. All the analogues were neutralized before being used by QikProp. This program is designed using the BOSS program and the OPLS-AA force field. It uses Monte Carlo statistical mechanics simulations on organic solutes in periodic boxes of explicit water molecules to perform all predictions. This process resulted in configurationally averages for a no. of descriptors, including H-bond counts and solvent-accessible surface area (SASA). Correlations of these descriptors to determine properties experimentally were obtained and then algorithms that mimic the full Monte Carlo simulations and produce comparable results were developed by the QikProp tool.
2.8.2 ADMET prediction by Discovery Studio, Accelrys
DS provides the methods for assessing the disposition and potential toxicity of a ligand within an organism as the ADMET protocols contain published models which are used to compute and analyze ADMET properties. Thus, ADMET properties specify the rules to remove ligands which are not drug-like, unsuitable leads etc. based on the presence or absence and frequency of certain chemical groups.