A Novel Approach to the Prediction of Biomagnication Factors Based on Molecular Structure Images

11 Although biomagnification factor (BMF) is an important index of pollutants in food chains, 12 its experimental determination is quite tedious. In this contribution, as the feature information, 13 Tchebichef moments (TMs) were calculated directly from the molecular structural images, and 14 then stepwise regression was employed to establish the prediction model of the logBMF . The 15 proposed approach was applied to the logBMF prediction of organochlorine pollutants, and the 16 correlation coefficient with leave-one-out cross-validation ( R cv ) of the obtained model was 0.96, 17 and the root mean square error ( RMSE p ) for the external independent test set was 0.21. Compared 18 with traditional two-dimensional (2D) quantitative structure-property relationship (QSPR) as well 19 as the reported method, the proposed approach was more simple, accurate and reliable. This study not only obtained the satisfactory prediction model for organochlorine pollutants, but also 21 provided another effective approach to QSPR research.


5
The data set was derived from the literatures (da Mota et al. 2017), which consisted of 30 83 polychlorinated biphenyls (PCBs) congeners and 10 organochlorine pesticides (DDT,DDE,HCB,TCDF,84 OCDF,TCDD,H6CDD,H7CDD,OCDD and DDD). Their values of logBMF are listed in Table 1 as 85 Exp. column. All of 40 samples were randomly divided into training set (30 samples) and test set (10 86 samples). The training set was used to establish the prediction model, and the test set was employed to 87 evaluate the prediction capability of the obtained model as external independent sample set. Thus the reconstruction of image with T n,m can be performed: 103 ( y x f  is the reconstructed image, nN and mM are the maximum orders of n and m (n=0-nN, 105 nN<N-1; m=0-mM, mM<M-1). The reconstruction error  can be calculated:

Modeling and evaluation 108
Stepwise regression was employed to establish the linear prediction model, in which TMs were 109 regarded as the independent variables and logBMF was denoted as dependent variable. The 110 performance of obtained model was evaluated by means of various statistical parameters such as the 111 determination coefficient (R c ), the adjusted determination coefficient (R adj ), root mean square error 112 (RMSE c ), the correlation coefficient with leave-one-out (LOO) cross-validation (R cv ) and LOO root 113 mean square error (RMSE cv ) for training set; F-test for model and t-test for the regression coefficients; 114 the correlation coefficient of test set (R p ) and root mean square error (RMSE p ) for test set (Gadaleta et al. 115 2016). 116 In order to further inspect the robustness of the model, a randomized test was performed on the 117 established model, in which models are established with invariant X-matrix and randomized Y-matrix 118 (Mitra et al. 2010). To determine the reliable of model, c R P 2 was adopted by following corrected 119 formula (Todeschini 2010): 120 where R is R c of the model and R r 2 is the average of R 2 for the randomized model. 122 The predictive capability of the model can be validated by external test, and the related parameters 123    is a significant advantage in the QSPR studies based on molecular structure images, which could 167 9 guarantee the stability of the obtained feature information. 168

Model and evaluation 169
After the TMs were directly calculated from the grayscale images of molecular structures, the 170 maximum orders were determined as nN=28 and mM=43 according to the change of reconstruction 171 errors (Eq. 4). Then a linear quantitative model was established by stepwise regression based on the 172 training set, in which the TMs were the independent variables and logBMF was the target response 173 variable. The values of TMs in the following model are listed in Table S1. To investigate the robustness and reliability of the TM model, the further evaluation was carried out. 185 For randomized test, the parameter c R P 2 is 0.5988 (more than its threshold value of 0.5), indicating that 186 the model has not randomness and fortuitousness. For the external test, the obtained parameters (listed 187 in Table 2) also conform to the requirements (0.85 ≤ k ≤ 1.15; 0.85 ≤k' ≤ 1.15; r m 2 ≥ 0.5; r m ' 2 ≥ 0.5; 188 sample 8 may be different with others so that they are not well modeled by adopted variables. Another 192 possible reason is the sample belongs to other type chemicals. To the sample 1, it owns the same 193 structure with sample 8 so that the model may not well predict the value of logBMF of it. 194 All above results and discussions indicated that the proposed method was reliable and reasonable, 195 and the established model possessed the higher robustness and prediction ability. 196

Comparison with 2D-QSPR method 198
Based on the obtained 337 common molecular descriptors (Supporting information,

Principal moment of inertia A. 204
The calculated values of logBMF are also listed in Table 1. The obtained statistical parameters 205 (listed in Table 2) and Williams plot (showed in Fig.2B) illustrate that the established 2D-QSPR model 206 was robust and reliable. The comparison of statistical parameters in the Table 2 indicated that the TM 207 model was slightly better than the 2D-QSPR model the owing to its higher prediction ability, which 208 suggested the feasibility of the proposed approach. 209

Comparison with the method in reference 210
Compared with the best results obtained by aug-MIA-QSPR color method in the literature ( Table 3. It can be seen that the predicted results from the proposed model 215 are significantly better than that of aug-MIA-QSPR color model, which demonstrates that the proposed 216 model possesses stronger predictive ability and reliability. 217 218

Conclusion 219
The extraction and selection of features are the most important factors in QSAR/QSPR research. In 220 this study, TM method was used to extract the feature information of molecular structure images and 221 stepwise regression was used to select the effective feature variables and establish the linear 222 quantitative model to predict the logBMF of organochlorine pollutants. The results of comprehensive 223 evaluation and comparison indicate that the established model has satisfactory robustness and predictive 224 ability, although it could not provide the explicit physicochemical meaning of the variables in model. 225 This study presented that, as an effective extraction pathway of feature information, TM method is more 226 suitable for the many QSPR/QSAR research based on molecular structure images.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryInformation.docx