An effective prediction of biomagnification factors for organochlorine pollutants

4 Biomagnification factor (BMF) is an important index of pollutants in food chains but its 5 experimental determination is quite tedious. In this contribution, as the feature descriptors of 6 molecular information, Tchebichef moments (TMs) were calculated from their structural images. 7 Then stepwise regression was employed to establish the prediction model for the logBMF of 8 organochlorine pollutants. The correlation coefficient with leave-one-out cross-validation ( R cv ) was 9 0.9570; the correlation coefficient of prediction ( R p ) and root mean square error ( RMSE p ) for 10 external independent test set reached 0.9594 and 0.2129, respectively. Compared with traditional 11 two-dimensional (2D) quantitative structure-property relationship (QSPR) and the reported 12 augmented multivariate image analysis applied to QSPR (aug-MIA-QSPR), the proposed approach 13 is more simple, accurate and reliable. This study not only obtained the model with better stability 14 and predictive ability for the BMF of organochlorine pollutants, but also provided another effective 15 approach to QSPR research.


Introduction 20
With the developments of modern industries, the problem of environmental pollution has been 21 more and more payed attention because that influences human health. Persistent organic pollutions 22 (POPs) are one of the main factors contributing to environmental pollutions, which include three 23 categories: certain industrial chemicals, certain by-products and contaminants, and certain industrial 24 processes (Hillman 1998). A great many POPs are organochlorine compounds (Rosa Vilanova 2001) 25 that easily circulate into organisms with ecosystem cycles (both on land and in aquatic 26 environments), which causes damage to the physiological function of the organisms (Jepson et al. 27 2016, Paul D. Jepson 2009. More important, since the increasing rate of their concentration is 28 higher than that of degradation due to their stability, organochlorine pollutants can be accumulated 29 along the food chains, which causes a toxicity magnification in the organisms at the top of the food 30 chains (also called Biomagnification phenomenon) (Birgit M. Braune 1989). According to one 31 research of the Lake Ontario ecosystem, it was observed that the content of polychlorinated 32 biphenyls (PCBs) increased with trophic levels (Nliml 1988). To assess this toxicity of POPs, 33 biomagnification factor (BMF) was defined and calculated by the following formula (D. Mackay 34

2000): 35
(1) 36 where is the concentration of chemical in the organism and is the concentration in the 37 organism's diet. 38 Although the BMFs can be determined by the experimental approaches (Charles J. Henny 2003, 39 where is the reconstructed image, nN and mM are the maximum orders of n and m (n=0-96 nN, nN<N-1; m=0-mM, mM<M-1). The reconstruction error can be calculated: 97 (4) 98

Modeling and evaluation 99
Stepwise regression was employed to establish the linear prediction model, in which TMs were 100 regarded as the independent variables and logBMF was denoted as dependent variable. The 101 performance of obtained model was evaluated by means of various statistical parameters such as 102 the determination coefficient (Rc), the adjusted determination coefficient (Radj), root mean square 103 error (RMSEc), the correlation coefficient with leave-one-out (LOO) cross-validation (Rcv) and LOO 104 where R is Rc of the model and Rr 2 is the average of R 2 for the randomized model. 113 The predictive capability of the model can be validated by external test, and the related 114 parameters (k, k', rm 2 , rm' 2 and Δrm 2 ) are defined by follows (Ojha et al. 2011, Roy et al. 2013: 115 Here, k and k' is the slope of experimental and predicted values respectively. Yobs and Ypred are 121 the observed and predicted values, respectively. rm 2 and rm' 2 are modified r 2 . r 2 and r0' 2 are 122 determination coefficients between the observed and predicted values for the least square linear 123 regression with and without interpret. And is the absolute of the difference between them. 124 Meanwhile, it is necessary to discuss the applicability domain (AD) of the established model 125

Comparison with 2D QSPR as well as the reported method 135
Traditional 2D QSPR method was also applied to the same data set. The molecular descriptors 136 of the 40 samples were calculated by CODESSA (v2.63) after being optimized by HyperChem 137 (v7.5), and the total of 337 common descriptors were obtained (Supporting information, Table S1). means that molecular structures do not need to be precisely aligned in their images. 156

Model and evaluation 157
After the TMs were directly calculated from the grayscale images of molecular structures, the 158 maximum orders were determined as nN=28 and mM=43 according to the change of reconstruction 159 errors (Eq. 4). Then a linear quantitative model was established by stepwise regression based on the 160 training set, in which the TMs were the independent variables and logBMF was the target response 161 variable. The values of TMs in the following model are listed in Table S2 Fig. 1) 174 To investigate the robustness and reliability of the TM model, the further evaluation was carried 175 out. For randomized test, the parameter c RP 2 is 0.5988 (more than its threshold value of 0.5), 176 indicating that the model has not randomness and fortuitousness. For the external test, the obtained 177 parameters (listed in Table 2)  All above results and discussions indicated that the proposed method was reliable and 187 reasonable, and the established model possessed the higher robustness and prediction ability. The calculated values of logBMF are also listed in Table 1. The obtained statistical parameters 197 (listed in Table 2) and Williams plot (showed in Fig.2B) illustrate that the established 2D-QSPR 198 model was robust and reliable. The comparison of statistical parameters in the Table 2 indicated that 199 the TM model was slightly better than the 2D-QSPR model the owing to its higher prediction ability, 200 which suggested the feasibility of the proposed approach.  Table 3. It can be seen that the predicted results from the 207 proposed model are significantly better than that of aug-MIA-QSPRcolor model, which demonstrates 208 that the proposed model possesses stronger predictive ability and reliability. 209 (Table 3

Conclusion 212
In this study, TM method was used to extract the feature information of molecular structure 213 images and establish the linear quantitative model to predict the logBMF of organochlorine 214 pollutants. The results of comprehensive evaluation indicate that the established model has 215 satisfactory robustness and predictive ability. As an effective extraction pathway of feature 216 information, TM method could be applied on many QSPR research. 217 218

Acknowledgement 219
This research did not receive any specific grant from funding agencies. 220

Availability of data and materials 221
The datasets used and analyzed during the current study are available from the corresponding 222 author on reasonable request. 223

Compliance with ethical standards 224
Ethics approval and consent to participate Not applicable.