Developing machine learning algorithms have become an important tool in the drug discovery process. Nowadays, a variety of machine learning tools are used to establish QSAR models. From our study result, the generated QSAR model via an opensource python program was predicted well with external test set compounds. The generated statistical model afforded the ordinary least squares (OLS) regression as R2 = 0.542, F=8.773, and adjusted R2 (Q2) =0.481, std. error = 0.061 (Table 2), reg.coef_ developed were of, 0.00064 (PC1), 0.07753 (PC2), 0.09078 (PC3), 0.08986 (PC4), 0.05044 (PC5), and reg.intercept_ of 4.79279 developed through statsmodels.formula module. The performance of test set prediction was done by MLR, SVM, and PLS classifiers of sklearnmodule, which threw the model score of 0.5424, 0.6422, and 0.6422, respectively. The model performance was validated through the test set, and the model predicted similar better values when compared to that of the training set. The linear curve has been plotted between the predicted and actual pIC50 value, which showed all the data fall over the middle linear line (Fig. 6). We have found that the model score obtained using these three algorithms were correlated well and there is not much variance between them and may be useful in the design of a similar group of thiazoleanalogs as anticancer agents.
Generated Multiple linear regression model:
pIC 50 = 4.79279 0.00064 (PC1) 0.07753 (PC2) 0.09078 (PC3) 0.08986 (PC4) + 0.05044 (PC5)
Table 2
Dep. Variable:

target

Rsquared:

0.542

Model:

OLS

Adj. Rsquared:

0.481

Method:

Least Squares

Fstatistic:

8.773

Date:

Mon, 27 Dec 2021

Prob (Fstatistic):

1.47e05

Time:

13:43:07

LogLikelihood:

18.552

No. Observations:

43

AIC:

49.10

Df Residuals:

37

BIC:

59.67

Df Model:

5



Covariance Type:

nonrobust




coef

std err

t

P>t

[0.025

0.975]

Intercept

4.7928

0.061

78.262

0.000

4.669

4.917

PC1

0.0006

0.009

0.074

0.941

0.018

0.017

PC2

0.0775

0.019

4.072

0.000

0.116

0.039

PC3

0.0908

0.022

4.174

0.000

0.135

0.047

PC4

0.0899

0.033

2.761

0.009

0.156

0.024

PC5

0.0504

0.034

1.495

0.143

0.018

0.119

Omnibus:

2.933

DurbinWatson:

1.185

Prob(Omnibus):

0.231

JarqueBera (JB):

2.283

Skew:

0.564

Prob(JB):

0.319

Kurtosis:

3.054

Cond. No.

7.08

The purpose of linear regression analysis is to discover a linear function of a collection of predictor variables that minimizes the distances between data points along the dimensions of a set of outcome measures consider a dataset of training data points. Multivariate linear regression is used extensively in initial QSAR approaches for eg.Hansch and Free–Wilson analysis. Moreover, the use of linear regression models in QSAR is complicated by feature correlations and highdimensional feature spaces, which could lead to model overfitting. To tackle these two issues that could complicate the accuracy and applicability, there are several approaches available like regularization, dimension reduction, and genetic algorithms [20]. Dimensionality reduction approaches, such as principal components analysis (PCA), on the other hand, reduce the huge sets of correlated variables into smaller groups of unrelated variables [21]. Gao et al. 1999 [22], employed to reduce the correlation between the variables for estrogen receptor interaction prediction.
The supervised machine learning algorithm SVM, solve the classification problem by mapping data into a highdimensional space with nonlinear kernel functions and determining the bestseparated hyperplane [23]. The hyperplane is a linear data layout that maximizes the margin across support vectors, which are the locations nearest to the decision boundary. Nekoei et al. 2015 [24], recently employed a genetic variable selection technique in conjunction with SVMs to identify several structural features of aminopyrimidine5carbaldehyde oxime analogs that are essential for their strong VEGF2 inhibition effect. In our study, the performance of the test set has been done through the SVM of the sklearn module, which generated the model score of 0.6422 and the linear curve between the predicted and actual pIC50 value plotted as shown in Fig. 7.
On the other hand, the PLS regression algorithm combines dimensionality reduction with multiple regressions to turn variables into uncorrelated variables that are optimally associated with the activity or feature of interest. PLS is widely utilized in 3DQSAR [25], and Erikkson et al. 2003 [26], suggest it as a primeline technique to QSAR model because of its improved efficiency and accuracy compared to deliberately merging unsupervised dimensionality with multiple regressions. Though the linear regression analysis has been effective in many drugs optimization applications. However, the underlying linearity and vector space constraints are not applicable for most QSAR applications. As a result, despite careful selection of features and the analyzed system, it is sometimes not quite enough to assure the success of linear regression models. Our study result on PLS showed a fit score of 0.6422.