Quantitative structure–activity relationship modeling of hydroxylated polychlorinated biphenyls as constitutive androstane receptor agonists

Hydroxylated polychlorinated biphenyls (OH-PCBs), a series of toxic chemical compounds produced via biotic and abiotic transformation of polychlorinated biphenyls (PCBs), are known to cause endocrine disruption by interacting inappropriately with human nuclear receptors. Due to occurrence of high numbers of inactive OH-PCB congeners recorded in many experimental toxicity studies, it is pertinent to develop rapid and inexpensive QSAR models that can reliably predict the activities of OH-PCB congeners prior to experimental testing. Using a combination of genetic function approximation and multiple linear regression methods, a local QSAR model, consisting of six 2D descriptors (MATS1s, VE3_DzZ, VE1_Dzp, SpMin8_Bhv, SpMax5_Bhi, topoRadius) and two 3D descriptors (RDF95u, RDF45m), was developed from a training set of 44 OH-PCBs. Statistical parameters for fitting (R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}^{2}$$\end{document} = 0.8902, Radj2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}_{adj}^{2}$$\end{document} = 0.8651, s = 0.2840), cross-validation (QLOO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Q}_{LOO}^{2}$$\end{document} = 0.8201, RMSECV\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${RMSE}_{CV}$$\end{document} = 0.3242), and Y-randomization (cRp2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${cR}_{p}^{2}$$\end{document} = 0.8019) obtained for the developed QSAR model indicate that the model is reliable, robust, and provides good fit to the data in the training set. The results of external validation carried out on 20 OH-PCBs in the test set also indicate that the developed QSAR model possessed good external predictivity and can be used to predict the agonistic activities of untested OH-PCB congeners to constitutive androstane receptor.

In recent times, research efforts are being directed towards understanding the mechanisms involved in endocrine disruption by persistent organic pollutants. One of the most widely reported group of persistent organic pollutants with high endocrine disrupting potencies is polychlorinated biphenyls (PCBs) and their metabolites. Hydroxylated polychlorinated biphenyls (OH-PCBs) are a group of exogenous chemicals that are produced by oxidation of PCBs through a variety of mechanisms, including metabolic transformation in living organisms and abiotic reactions with hydroxyl radicals [48]. Recently, the possible roles of nuclear receptors in mediating the endocrine disrupting effects of OH-PCBs are being vigorously studied by many research groups. Using a variety of in vitro bioassay methods, researchers have shown that some OH-PCB congeners act as agonists of estrogen receptor α (ERα) [49], estrogen receptor β (ERβ) [49], constitutive androstane receptor (CAR) [50], retinoid X receptor β (RXRβ) [51], retinoic acid receptor γ (RARγ) [51], and estrogen-related receptor γ (ERRγ) [52]. Some OH-PCB congeners have also been demonstrated in some in vitro studies to act as ERα antagonists [49], ERβ antagonists [49], androgen receptor (AR) antagonists [49], glucocorticoid receptor (GR) antagonists [49], and thyroid hormone receptor (TR) antagonists [53].
The major downside of the in vitro studies reviewed in the preceding paragraph is the high numbers of inactive OH-PCB congeners encountered in most of the reported experiments. For instance, only 22% and 38% of the OH-PCB congeners selected for investigation exhibited agonistic activities against RXRβ and RARγ, respectively [51].
Similarly, only 9%, 6%, and 30% of the OH-PCB congeners selected for investigation exhibited antagonistic activities against ERα, ERβ, and GR, respectively [49]. The low numbers of active OH-PCB congeners recorded in most of the experimental studies reviewed was due to the approach adopted by researchers in selecting OH-PCB congeners used in the experiments. In most experimental studies, the selections of chemicals used for toxicity testing were usually based on trial-and-error approach rather than on rational basis. This has resulted in waste of time and other valuable resources. In order to prioritize the choice of OH-PCB congeners for toxicity testing, it is pertinent to develop rapid and inexpensive quantitative structure-activity relationship (QSAR) models that can reliably predict the biological activities of OH-PCBs prior to experimental studies. QSAR modeling is a computational approach that establishes a correlation between the biological activities and measured or computed molecular features of a series of chemical compounds [54]. It is based on the assumption that changes in molecular structures of chemical compounds reflect corresponding changes in the observed biological activities [54]. Exposure of humans to PCBs and OH-PCBs has been linked to onset of obesity, diabetes, and fatty liver disease, and these metabolic disorders are thought to be mediated through activation of constitutive androstane receptor by this group of persistent organic pollutants [55]. Although QSAR models have been developed for toxicity prediction of some persistent organic pollutants, no QSAR model is currently available in the literature for the prediction of agonistic activities of OH-PCB congeners to constitutive androstane receptor. The objective of this study was to develop a QSAR model that can reliably predict the agonistic activities of untested OH-PCBs to constitutive androstane receptor using the limited experimental data available in literature.

Methods
The procedures adopted for constructing and validating the QSAR model described in this paper are summarized in the workflow displayed in Fig. 1 [56]. Details of the steps involved in these procedures are described below.

Dataset acquisition and molecular descriptor computation
The dataset used for the construction and validation of the QSAR model developed in this paper was obtained from the literature [50]. This dataset consists of structural formulae of 64 OH-PCB congeners and the experimental values of their agonistic activities to constitutive androstane receptor (Table S1 in the Supplementary Information). The agonistic activity of each OH-PCB congener in the dataset (reported as tenfold effective concentration) was measured in a reporter gene assay using yeast cells transduced with human constitutive androstane receptor [50]. This agonistic activity (EC × 10) was defined as the concentration of OH-PCB in solution that produced luminescence intensity that was 10 times greater than the luminescence intensity of a blank solution [50]. Before being used as response variable in the model building step, the original activity values were converted to logarithmic scale ( log1∕(EC × 10) ). Using semi-empirical PM6 optimized molecular structures obtained with Spartan '14 program (version 1.1.4) as input [57], a total of 1875 one-dimensional (1D), two-dimensional (2D) and three-dimensional (3D) molecular descriptors, representing the numerical information encoded within the molecular structure of each chemical compound in the dataset, were calculated using PaDEL-Descriptor software [58].

Variable elimination, molecular descriptor standardization, and dataset division
Existence of irrelevant and redundant molecular descriptors in multiple regression modeling is problematic and must n = number of objects p = number of predictor variables s Standard error of estimate Symbols as above y i∕i = predicted value of the response calculated excluding the ith element from the model computation RMSE CV Root-mean-square error in CV prediction Symbols as above therefore be eliminated [59]. Molecular descriptors with constant or nearly constant values (variables with low variance) were removed from the descriptors set because these variables are considered irrelevant for model building. In this paper, a descriptor is considered to have constant or nearly constant values if its variance is less than 0.0001. Multicollinearity among molecular descriptors introduces redundancy in a QSAR model since highly correlated descriptors contribute essentially the same information in the model [60]. In this paper, two molecular descriptors are considered redundant if the correlation coefficient between them exceeds 0.90. Multicolinear and low-variance descriptors were removed from the pool of 1875 descriptors calculated in the preceding section using V-WSP algorithm [61] as implemented in V-WSP tool (version 1.2) developed by Ambure et al. [62]. Pairwise correlations in a correlation matrix and variance inflation factor (VIF) were calculated and used to examine the presence or absence of multicollinearity among variables utilized in

Coefficient of determination for the prediction set
y i = observed dependent variable of test set y i = predicted dependent variable of test set y = mean value of the dependent variable of training set Coefficient of determination for the prediction set on forcing to origin Coefficient of determination for the prediction set on interchanging the axes and forcing to origin Symbols as above k Slope of the regression line on forcing to origin Symbols as above Variance explained in external prediction Variance explained in external prediction r 2 = squared correlation value between the observed and predicted values with intercept r 2 0 = squared correlation value between the observed and predicted values without intercept r ′2 0 = squared correlation value between the observed and predicted values on interchanging the axes and without intercept

RMSE EXT
Root-mean-square error in external prediction Symbols as above (15) MAE EXT Mean absolute error in external prediction MAE EXT = |yi−ŷi| n Symbols as above (16) the final QSAR model [59]. After removing irrelevant and redundant descriptors from the descriptors set, the remaining descriptors were transformed into auto-scaled descriptors using standard normalization method [63]. Standardization of the original descriptors to auto-scaled descriptors ( X n ik ) was accomplished by subtracting the mean of the descriptors ( k ) from the original descriptor values ( X ik ) and then divided by their standard deviations ( K ). The main reason for using standardized regression coefficients in a QSAR model is that the magnitude of the standardized regression coefficient reflects the relative contribution of each descriptor to the predicted activity in the developed model [64]. Transformation of the original descriptor to auto-scaled descriptors was done using Minitab ® 18.1 [65]. Finally, the entire dataset was divided into training set (70% of the entire dataset) and test set (30% of the entire dataset) using Kennard-Stone algorithm [66][67][68] as implemented in Dataset Division GUI 1.2 [62]. This corresponds to 44 OH-PCBs in the training set and 20 OH-PCBs in the test set. In Table S1 (Supplementary Information), compounds assigned to the test set are marked with single asterisk to distinguish it from compounds assigned to the training set.

Variable selection and model building
Selection of predictor variables for inclusion in a regression model is one of the most crucial aspects of a QSAR study. In this paper, genetic function approximation (GFA) [54], as implemented in Materials Studio 7.0 [69], was used to select the most appropriate combination of molecular descriptors for inclusion in the QSAR model developed in this study. This was then followed by a comprehensive regression analysis using multiple linear regression (MLR) method [70] as implemented in Minitab ® 18.1 [65] and Materials Studio 7.0 [69]. In this model building step, efforts were made to establish a linear relationship between the dependent variable ( log1∕(EC × 10) ) and the selected independent variables (standardized molecular descriptors) of the series of chemical compounds assigned to the training set in Table S1 (Supplementary Information).

Internal and external validation of QSAR model
The goodness-of-fit of the QSAR model developed from the application of the GFA and MLR procedures described in the preceding section was evaluated using the following statistical parameters: R 2 , R 2 adj , and s in Eqs. (1), (2), and (3), respectively. The stability and robustness of the developed QSAR model were evaluated using leave-one-out cross-validation and Y-randomization techniques [71]. In the leave-one-out cross-validation technique, one chemical compound from the training set was omitted and a model was developed using the data of the remaining chemical compounds. The model developed was then used to predict the agonistic activity of the omitted compound. This procedure was repeated iteratively and values of Q 2 LOO and RMSE CV in Eqs. (4) and (5) respectively were calculated. In the Y-randomization procedure, the dependent variable of compounds in the training set was randomly shuffled 50 times while keeping the independent variables as they are. New MLR model was then developed after each random shuffling. Statistical parameters for the generated random models (R, R 2 , and Q 2 values for random models) and the Y-randomization parameter, cR 2 p in Eq. (6), were calculated. All the statistical parameters used for evaluating the fitness and robustness of the developed QSAR model were calculated using Minitab ® 18.1 [65], Materials Studio 7.0 [69], and Y-randomization tool 1.2 [62]. In addition to the statistical parameters listed in Table 1 for evaluating the goodness-of-fit and robustness of the developed model, the QSAR model constructed in this paper was also externally validated for its ability to predict the agonistic activities of untested OH-PCB congeners. This task was accom-

Evaluation of model applicability domain
Applicability domain expresses the scope and limitation of a QSAR model by defining the range of chemical structures for which the QSAR model is considered applicable [72]. In this paper, the leverage approach was used to evaluate the applicability domain of the QSAR model developed in this study. A measure of how far a chemical compound is from the applicability domain of a QSAR model is its leverage in the original variable space, h ii [64]. This measure is defined as: the descriptor row-vector of the query compound, and X is the n × p matrix of p model descriptor values for n training set compounds. The superscript T refers to the transpose of the matrix vector. The warning leverage h* is generally fixed at 3k∕n , where n is the number of training compounds and k is the number of model descriptors plus one (p + 1) [64]. To visualize the applicability domain of the developed model, a plot of standardized residuals versus leverages (Williams plot) was constructed.

Results and discussion
The QSAR model obtained on correlating the agonistic activities ( log 1∕(EC × 10) values) of 44 OH-PCB congeners assigned to the training set in Table S1 (Supplementary information) with the standardized molecular descriptors computed for these 44 OH-PCB congeners ( Table S2 in the Supplementary information) is displayed in Eq. (17). As shown in Eq. (17), a linear relationship was established between log 1∕(EC × 10) and eight structural features of OH-PCB congeners. Table 3 shows the descriptions, classes, and types of the molecular descriptors utilized in the developed QSAR model. As shown in Table 3, the QSAR model displayed in Eq. (17) contained six 2D descriptors (MATS1s, VE3_DzZ, VE1_Dzp, SpMin8_Bhv, SpMax5_Bhi, and topoRadius), belonging to four classes of descriptors-autocorrelation descriptor, Barysz matrix descriptor, Burden modified eigenvalues descriptor, and topological descriptor [73]. Table 3 also shows that the two 3D descriptors (RDF95u and RDF45m) in the QSAR model displayed in Eq. (17) belong to the same class of descriptor-radial distribution function (RDF) descriptor [73].
In Table 4, the standardized regression coefficients and the statistical significance of the predictor variables utilized in the QSAR model displayed in Eq. (17)are presented. By  examining the magnitudes and signs of the regression coefficients in the QSAR model, a broad interpretation of the model can be made [74]. While the descriptor that plays the most important role in the predictive ability of a model is identified by the magnitude of its regression coefficient in the model, the sign of this regression coefficient indicates the direction of the relationship between the descriptor and the predicted activity [74]. As shown in Table 4, the relative contributions of the eight descriptors in Eq. (17) to the predictive ability of the developed QSAR model, as can be inferred from the magnitudes of the standardized regression coefficients of the descriptors, decreased in the following order: VE1_Dzp > SpMax5_Bhi > RDF95u > RDF45m > topoRadius > SpMin8_Bhv > MATS1s > VE3_DzZ.  [75][76][77]. The p values shown in Table 4 are all less than 0.05, indicating that the relationship between log 1∕(EC × 10) and each of the eight descriptors in the QSAR model displayed in Eq. (17) was statistically significant [78]. This indicates that the strength of the association between log 1∕(EC × 10) and each of the eight molecular descriptors is strong and reliable. Variance inflation factor (VIF) and correlation matrixtwo approaches used for detecting multicollinearity among molecular descriptors in a QSAR model-are presented in Tables 4 and 5, respectively. Values of VIF reported in Table 4 (17) contains no multicollinearity [79]. In the correlation matrix shown in Table 5, the absolute values of the correlation coefficients ranged from 0.000 (between VE1_Dzp and SpMin8_Bhv) to 0.798 (between SpMin8_Bhv and SpMax5_Bhi). Although no generally agreed cut-off value is currently available in literature for defining lack of collinearity between two descriptors, a squared correlation coefficient lower than 0.80 has been suggested as a threshold value for accepting lack of collinearity between two variables [63]. Detraction of QSAR model from mechanistic interpretation and deterioration of the model's statistical parameters are the two major problems encountered when multicollinearity exists among the descriptors utilized in a QSAR model [60,80].
The values of coefficient of multiple determination ( R 2 ) and standard error of estimate (s)-two statistical metrics used for evaluating the goodness-of-fit of a QSAR modelare shown in Table 4. The values of R 2 and s reported in Table 4 were 0.8902 and 0.2840, respectively. This R 2 value indicates that 89.02% of total variation in the response variable ( log1∕(EC × 10) ) could be accounted for by its relationship with the eight predictor variables (molecular descriptors) utilized in the developed QSAR model. Value of R 2 > 0.6, as obtained in this paper, suggests that the QSAR model displayed in Eq. (17) provides a good fit to the data used in   [81]. The standard error of estimate (s) measures the dispersion of the observed values from the regression line [64]. The low value of s reported in Table 4 indicates that the observed values of the agonistic activities of OH-PCBs are close to the regression line predicted by the developed QSAR model. Another statistical parameter that is used as a measure of goodness-of-fit of a QSAR model is the adjusted R 2 ( R 2 adj ). Unlike the value of R 2 which always increases regardless of whether the addition of an extra predictor variable to a QSAR model improves the model or not, value of R 2 adj only increases when addition of an extra predictor variable improves the model, thus eliminating the possibility of overfitting [82]. The value of R 2 adj reported in Table 4 for the QSAR model displayed in Eq. (17) was 0.8651. The eight predictor variables retained in the QSAR model are considered acceptable because further addition of extra independent variable to the model caused significant reduction in the reported R 2 adj value [82]. Furthermore, since the ratio of the number of compounds in the training set to the number of descriptors in the QSAR model displayed in Eq. (17) was at least 5:1 as suggested by Topliss and Costello [83], the risk of chance correlation in the developed QSAR model was avoided.
To assess the stability and robustness of the QSAR model displayed in Eq. (17), parameters obtained from leaveone-out cross-validation and Y-randomization techniques were used. In Table 4, values of Q 2 LOO and RMSE CV -two parameters obtained from leave-one-out cross-validation technique-are presented. The QSAR model displayed in Eq. (17) is considered robust and stable because the value of Q 2 LOO reported in Table 4 ( Q 2 LOO = 0.8201) was greater than the cut-off value ( Q 2 LOO > 0.5 ) suggested in the literature [71,84]. Furthermore, a difference of less than 0.3 between R 2 value and Q 2 LOO value, as obtained in this study, indicates that the QSAR model displayed in Eq. (17) did not suffer from overfitting [84]. The low value of root-mean-square error in cross-validation ( RMSE CV = 0.3242) reported in Table 4 suggests that the accuracy of the prediction made by the developed model was good [85]. The results of Y-randomization procedure conducted to check the stability and robustness of the QSAR model developed in this paper are shown in Table 6. The R 2 and Q 2 values reported in Table 6 for each of the 50 random models generated were significantly lower than the values of R 2 and Q 2 obtained for the original QSAR model displayed in Eq. (17) ( R 2 and Q 2 values obtained for the original model were 0.8902 and 0.8201, respectively). The lower values of R 2 (average was 0.1770) and Q 2 (average was −1.3278) obtained for the random models ruled out the possibility of chance correlation in the original QSAR model displayed in Eq. (17) [86]. The value of cR 2 p reported in Table 6 was 0.8019. Value of cR 2 p higher than 0.5, as obtained in this study, indicates that the QSAR model displayed in Eq. (17) was robust and stable [86].
One of the core functions of a QSAR model is to make reliable prediction of biological activities for yet to be tested chemical compounds. Because generating an entirely new set of experimental data for the purpose of external validation is often difficult in everyday practice, the strategy suggested by QSAR experts is to divide the available dataset into two subsets-the training set and the test set [64]. In this paper, the 44 OH-PCB congeners assigned to the training set (70% of the entire dataset) were used for model building while the remaining 20 OH-PCB congeners assigned to the test set (30% of the entire dataset) were reserved for external validation of the developed model. Using the computed molecular descriptors presented in Table S3 (Supplementary information) as inputs, agonistic activities of OH-PCBs in the test set were predicted with the QSAR model displayed in Eq. (17). The values of the external validation metrics used for evaluating the predictive ability of the developed QSAR model and the threshold values used for judging the acceptability of these metrics are presented in Table 7. As shown in Table 7, the values of the external validation metrics obtained when prediction was made on chemical compounds assigned to the test set, before and after removing 5% of data with high residuals from the test set, indicate that all the validation metrics satisfied the requirements for accepting the external predictivity of the QSAR model displayed in Eq. (17) [71,84,[87][88][89][90]. Using the values of observed activities, predicted activities and residuals shown in Tables S4 and S5 (Supplementary information), a plot of residuals versus experimental values of log 1∕(EC × 10) (Fig. 2) and a plot of predicted values of log 1∕(EC × 10) versus experimental values of log 1∕(EC × 10) (Fig. 3) were constructed. The random distribution of the residuals on both sides of the horizontal zero line in Fig. 2 indicates that there was no systematic error in the QSAR model developed in this paper. Figure 3 also shows that there was a positive and strong correlation between the activities predicted by the developed QSAR model and the experimental activities obtained from literature. These results indicate that the prediction made by the QSAR model displayed in Eq. (17) was accurate and reliable. Applicability domain-a concept that expresses the scope and limitation of a QSAR model by defining the range of chemical structures for which the model is considered applicable-is another crucial aspect that must be included in a QSAR study [72]. To visualize the outliers in both the descriptor space and the response space, the Williams plot shown in Fig. 4 was constructed from the values of standardized residuals and leverages presented in Tables S4 and S5 (Supplementary information). In the Williams plot displayed in Fig. 4, a compound with a standardized residual greater than three standard deviation units is considered to be a response outlier while a compound with leverage higher than the warning leverage (h* = 0.6136) is considered to be a structurally influential chemical in the developed model [88]. As shown in Fig. 4, none of the 64 OH-PCBs in the dataset was response outlier but 2′,4′,6′-trichlorobiphenyl-4-ol and 2′,3,4′-trichlorobiphenyl-2-ol were found to be structurally influential chemicals because of their high leverage values. As structurally influential chemical in the training set, the compound 2′,3,4′-trichlorobiphenyl-2-ol greatly influenced the regression parameters by forcing the fitted regression line near its observed value [64]. On the other hand, the predicted response of a compound in the test set (compound 2′,4′,6′-trichlorobiphenyl-4-ol in this case) with a leverage value greater than the warning leverage may not be reliable because the prediction was probably a result of substantial extrapolation of the QSAR model [64].

Conclusion
Some congeners of hydroxylated polychlorinated biphenyls are known to cause endocrine disruption in humans by acting as nuclear receptor agonists or nuclear receptor antagonists. However, the high numbers of inactive OH-PCB congeners recorded in many experimental toxicity studies designed to measure the agonistic and antagonistic activities of OH-PCBs necessitate the need to develop QSAR models that can predict the activities of OH-PCBs prior to in vitro experiments. Use of QSAR models for large-scale screening of OH-PCB congeners for the purpose of prioritizing the congeners for further experimental testing offers the advantages of being rapid, inexpensive, and high-throughput. In this paper, a local QSAR model for predicting the agonistic activities of OH-PCBs to constitutive androstane receptor was constructed and validated using certain internal and external validation criteria. The developed QSAR model was found to be statistically reliable, robust, and possessed good external predictivity. The QSAR model developed in this paper can therefore be used by toxicologists to predict the agonistic activities of untested OH-PCBs to constitutive androstane receptor prior to experimental studies. Development of more QSAR models for the prediction of OH-PCB activities in toxicity studies involving other nuclear receptors is recommended.