3-1. QSAR equation
Choosing the best descriptor for a given dataset is an important step in machine learning. There are several methods for selecting descriptors, including exhaustive search, optimization algorithms, and multivariate statistical analysis. The choice of a descriptor is important because it has a strong influence on the predictive performance of the model. When selecting descriptors, the four main concerns are the correlation between a descriptor and the properties to predict, the interpretability of the descriptor, the computational cost of the descriptor, and the amount of data required in a machine learning project. Some of the common optimization algorithms used in the selection of the descriptors are evolutionary programming (EP), ant colony optimization (ACO), and genetic algorithms (GA). In this investigation, the stepwise method (SW) was employed to identify the most suitable descriptors. The MLR approach is widely utilized in QSAR due to its straightforwardness in execution and consistency. Moreover, this method allows for convenient interpretation of the obtained outcomes[23]. Once the optimal descriptors were selected, the correlation between these descriptors and the activity of the drug compounds was established using a stepwise incremental approach. The samples were divided randomly into training series(%75) and testing series(%25) using IBMSPSS20 software[24] and the stepwise multiple linear regression method. This division was repeated three times to obtain the optimal correlation. The training series was utilized to develop an appropriate model, while the prediction series was employed to assess the performance of the model. The resulting QSAR equation is as follows:
pIC50=3.202+4.016(HATS8e)-1.792(MATS1v)-0.731(GATS6m)+18.455(R5u+)-2.363(HATS7u)+5.496(G2u) (1)
The derived equation demonstrates that the activity of these compounds is associated with the negative coefficient of, MATS1v and GATS6m. This implies that as the values of these descriptors increase, the activity of the compounds decreases. Conversely, the activity of these compounds is influenced by the positive coefficients of HATS8e, R5u+ and G2u . In other words, as these values increase, the activity of the compounds also increases. According to the obtained model, the highest biological activity has been observed in compound 24. Also, compound 48 has the lowest biological activity. In compound 24, which is the compound with the greatest activity, it has electron-withdrawing substitutions in R2 and R3 positions. Also, in the R5 position, it has a phenol ring with an electron-withdrawing chlorine substitution in the meta position. The presence of this ring can participate in the - interactions and interact better with the binding site. In the R6 position of this compound, there is the group -CH=CH-CH2-OH, which can resonantly draw the electrons of the ring towards itself. In compound 48, which has the smallest activity, there are resonant electron-withdrawing substituents, which decrease the activity.
Table 2 presents the statistical parameters associated with the QSAR equation obtained. In order to identify the most suitable regression model, several coefficients including R, R2, R2 (adj), F, RMSE, and Q2cv were employed. R2 referred to as the coefficient of determination, measures the extent to which the independent variables can explain the variance in the dependent variable. Although R2 is widely acknowledged in various research fields, there is a lack of standardized guidelines for determining the level of prediction acceptance. Additional statistical parameters, such as the mean squared error (MSE), root mean squared error (RMSE), and cross-validated coefficient of determination (Q2cv), also contribute to the assessment of the equation's acceptability, validity, and predictive capability for various combinations. The MSE, defined as the average of the squared differences between the predicted and experimental values, serves as a measure of the overall accuracy of the model. Typically, a desirable model will exhibit an MSE value close to zero, indicating a minimal sum of squared differences between the predicted and actual values.[25]The correlation coefficient (R2) of the equation is 0.74, indicating a moderate level of correlation. The root-mean-squared error (RMSE) and mean squared error (MSE) are measures of the model's accuracy, with lower values indicating better performance. Q2, on the other hand, represents the predictive power and validity of the model, and it is considered acceptable if its values exceed 0.5. Table 3, displays the experimental and predicted activities. Additionally, Figures 1 and 2 illustrates the strong correlation observed between the experimental and predicted values in the relevant diagrams.
Table2. Statistical parameters for QSAR model
Train
|
Test
|
Q2LOO
|
cRp2
|
R
|
R2
|
F
|
MSE
|
R2
|
RMSEP
|
0.86
|
0.74
|
15.4
|
0.3
|
0.6
|
0.48
|
0.62
|
0.56
|
Table 3. Chemical descriptors, experimental and predicted activities
Num
|
HATS8e
|
MATS1v
|
GATS6m
|
R5u+
|
HATS7u
|
G2u
|
pIC50exp
|
pIC50theo
|
1
|
0.505
|
-0.043
|
0.994
|
0.039
|
0.456
|
0.275
|
6.07
|
5.73
|
2
|
0.511
|
-0.033
|
1.227
|
0.039
|
0.461
|
0.201
|
4.75
|
5.15
|
3
|
0.512
|
-0.032
|
0.951
|
0.038
|
0.453
|
0.201
|
4.84
|
5.36
|
4
|
0.327
|
-0.115
|
1.311
|
0.074
|
0.446
|
0.171
|
4.49
|
5.01
|
5
|
0.506
|
-0.046
|
1.703
|
0.039
|
0.463
|
0.223
|
5.00
|
4.92
|
6
|
0.513
|
-0.032
|
0.981
|
0.038
|
0.449
|
0.223
|
5.39
|
5.47
|
7
|
0.481
|
-0.046
|
1.703
|
0.032
|
0.481
|
0.207
|
4.97
|
4.56
|
8
|
0.208
|
-0.038
|
1.696
|
0.037
|
0.206
|
0.228
|
4.30
|
4.31
|
9
|
0.506
|
-0.046
|
1.703
|
0.039
|
0.463
|
0.223
|
5.00
|
4.92
|
10
|
0.574
|
-0.037
|
1.927
|
0.04
|
0.49
|
0.228
|
5.18
|
5.00
|
11
|
0.448
|
-0.046
|
1.576
|
0.057
|
0.416
|
0.201
|
4.69
|
5.11
|
12
|
0.177
|
-0.067
|
1.083
|
0.047
|
0.22
|
0.211
|
4.54
|
4.75
|
13
|
0.639
|
-0.159
|
1.264
|
0.048
|
0.555
|
0.174
|
5.36
|
5.66
|
14
|
0.684
|
-0.012
|
1.172
|
0.042
|
0.511
|
0.179
|
5.84
|
5.67
|
15
|
0.475
|
-0.07
|
1.216
|
0.064
|
0.383
|
0.173
|
6.66
|
5.57
|
16
|
0.631
|
-0.017
|
1.2
|
0.042
|
0.514
|
0.179
|
5.80
|
5.43
|
17
|
0.53
|
-0.191
|
1.214
|
0.044
|
0.429
|
0.169
|
5.23
|
5.51
|
18
|
0.604
|
-0.126
|
1.087
|
0.035
|
0.417
|
0.155
|
5.32
|
5.57
|
19
|
0.696
|
-0.137
|
1.212
|
0.048
|
0.552
|
0.174
|
6.66
|
5.89
|
20
|
0.566
|
-0.062
|
0.811
|
0.044
|
0.358
|
0.162
|
5.71
|
5.85
|
21
|
0.632
|
-0.112
|
1.659
|
0.032
|
0.646
|
0.182
|
4.64
|
54.79
|
22
|
0.657
|
-0.083
|
1.184
|
0.044
|
0.506
|
0.196
|
5.55
|
5.82
|
23
|
0.512
|
-0.059
|
1.185
|
0.061
|
0.378
|
0.185
|
5.92
|
5.75
|
24
|
0.728
|
-0.012
|
1.164
|
0.042
|
0.511
|
0.193
|
5.43
|
5.56
|
25
|
0.446
|
-0.199
|
1.172
|
0.045
|
0.318
|
0.18
|
5.20
|
4.78
|
26
|
0.375
|
-0.164
|
0.544
|
0.04
|
0.391
|
0.186
|
5.65
|
5.47
|
27
|
0.331
|
-0.189
|
1.393
|
0.036
|
0.391
|
0.217
|
5.14
|
5.05
|
28
|
0.426
|
-0.229
|
1.247
|
0.045
|
0.464
|
0.164
|
4.84
|
5.00
|
29
|
0.313
|
0.204
|
1.005
|
0.042
|
0.42
|
0.206
|
5.67
|
5.61
|
30
|
0.395
|
0.11
|
0.621
|
0.045
|
0.398
|
0.217
|
5.55
|
5.70
|
31
|
0.345
|
0.151
|
1.185
|
0.044
|
0.397
|
0.2
|
5.31
|
4.96
|
32
|
0.433
|
0.114
|
0.588
|
0.045
|
0.421
|
0.209
|
5.55
|
5.70
|
33
|
0.339
|
0.166
|
0.978
|
0.028
|
0.225
|
0.193
|
5.52
|
5.19
|
34
|
0.534
|
0.125
|
0.824
|
0.04
|
0.613
|
0.195
|
0 4.8
|
5.33
|
35
|
0.499
|
0.125
|
0.72
|
0.038
|
0.618
|
0.194
|
5.38
|
5.13
|
36
|
0.539
|
0.204
|
0.395
|
0.034
|
0.655
|
0.183
|
5.72
|
5.29
|
37
|
0.551
|
0.16
|
0.468
|
0.036
|
0.6
|
0.204
|
5.14
|
5.54
|
38
|
0.528
|
0.188
|
1.656
|
0.029
|
0.516
|
0.175
|
5.42
|
5.46
|
39
|
0.304
|
-0.143
|
0.729
|
0.395
|
0.034
|
0.163
|
4.51
|
4.74
|
40
|
0.224
|
-0.118
|
1.227
|
0.468
|
0.043
|
0.166
|
5.02
|
4.67
|
41
|
0.306
|
-0.146
|
1.311
|
1.018
|
0.033
|
0.179
|
4.47
|
4.34
|
42
|
0.31
|
-0.135
|
1.703
|
1.04
|
0.033
|
0.171
|
4.15
|
4.32
|
43
|
0.335
|
-0.24
|
0.701
|
0.035
|
0.296
|
0.177
|
3.60
|
4.52
|
44
|
0.338
|
-0.149
|
1.047
|
0.034
|
0.276
|
0.192
|
4.07
|
4.56
|
45
|
0.342
|
-0.095
|
1.016
|
0.034
|
0.276
|
0.185
|
4.70
|
4.60
|
46
|
0.342
|
-0.095
|
1.47
|
0.034
|
0.276
|
0.182
|
4.85
|
5.07
|
47
|
0.256
|
-0.118
|
0.346
|
0.056
|
0.269
|
0.115
|
5.27
|
5.10
|
48
|
0.301
|
-0.146
|
1.05
|
0.032
|
0.3
|
0.179
|
4.54
|
4.28
|
49
|
0.306
|
-0.135
|
1.081
|
0.033
|
0.301
|
0.201
|
4.48
|
4.40
|
50
|
0.346
|
0.149
|
0.183
|
0.033
|
0.285
|
0.171
|
4.44
|
4.41
|
51
|
0.353
|
0.095
|
1.046
|
0.033
|
0.284
|
0.188
|
4.40
|
4.66
|
52
|
0.145
|
0.524
|
0.729
|
0.036
|
0.25
|
0.157
|
5.08
|
5.13
|
3-2.Investigation of the effect of descriptors:
Standardized coefficients in regression analysis are important because they allow for the comparison of the relative importance of different variables in the model. By standardizing the coefficients, the variables are scaled to have a mean of 0 and a standard deviation of 1, which makes them unitless and comparable. This standardization is particularly useful when the metrics of the variables being studied do not have intrinsic meaning, as it provides a measure of effect size that is not dependent on the scale of the variables. Standardized coefficients help in understanding which predictors have a more substantial impact, aiding in prioritizing relevant factors and making comparisons easy, especially when the variables being compared are originally measured in different units. They are also used to discern which of two or more independent variables has more impact on the dependent variable[26, 27]. Based on the figure 3, it can be observed that the HATS8e descriptors exhibit the highest positive impact, while the HATS7u descriptors demonstrate the most negative influence.
3.3. Evaluating of the applicability domain (AD) :
The importance of defining the applicability domain (AD) of a QSAR model lies in ensuring the reliability of the model's predictions and estimating the uncertainty of the predictions for a given chemical compound. The AD helps in identifying the boundary within which the model's predictions are reliable, typically for interpolation rather than for extrapolation [24]. The equation provides the necessary information to determine the optimal level of leverage required for constructing the Williams Plot diagram.
h= 3 ((d-1))/m (2)
The Williams Plot diagram is constructed by considering the standard residual values on the y-axis and the leverage value on the x-axis. This diagram is used to assess the model's acceptability. The points obtained in the Williams Plot diagram are deemed acceptable if they fall within the plotted range. The formula for determining the plotted range involves the number of descriptors selected (d) and the number of training group data (m). In order to assess the extent to which the model can be applied, the residual error technique has been employed to gauge the effectiveness of the model in predicting behavior within the experimental approach (figure4).
3.4.Y-Randomization test:
Y-randomization is a technique used in the validation of Quantitative Structure-Activity Relationship (QSAR) and Quantitative Structure-Property Relationship (QSPR) models. It involves comparing the performance of the original model in data description (r2) to that of models based on the original descriptor pool and the original model building procedure. This method helps assess whether the model's performance is due to the actual relationships in the data or simply by chance. Y-randomization is a valuable tool in ensuring the robustness and reliability of QSAR and QSPR models[28]. The calculation of cRp2 involves determining the discrepancy between the coefficient of determination of the random model and the coefficient of determination of the original model. Statistical parameters associated with the Y-randomization validation method are provided in table 4.
Table 4. Y-randomization test result for MLR models
MODEL TYPE
|
R
|
R^2
|
Q^2LOO
|
Original
|
0.861742
|
0.742599
|
0.619
|
Random 1
|
0.251976
|
0.063492
|
-0.37421
|
Random 2
|
0.236189
|
0.055785
|
-0.3814
|
Random 3
|
0.436241
|
0.190307
|
-0.18281
|
Random 4
|
0.447886
|
0.200602
|
-0.2198
|
Random 5
|
0.294407
|
0.086675
|
-0.27785
|
Random 6
|
0.256614
|
0.065851
|
-0.51294
|
Random 7
|
0.418105
|
0.174812
|
-0.28456
|
Random 8
|
0.376966
|
0.142103
|
-0.19086
|
Random 9
|
0.425881
|
0.181374
|
-0.2586
|
Random 10
|
0.224218
|
0.050274
|
-0.42575
|