Direct prediction of bioaccumulation of organic contaminants in plant roots from soils with machine learning models based on molecular structures

Root concentration factor (RCF) is an important characterization parameter to describe accumulation of organic contaminants in plants from soils in life cycle impact assessment (LCIA) and phytoremediation potential assessment. However, building robust predictive models remains challenging due to the complex interactions among chemical-soil-plant root systems. Here we developed end-to-end machine learning models to devolve the complex molecular structure relationship with RCF by training on a unified RCF data set with 341 data points covering 72 chemicals. We demonstrate the efficacy of the proposed gradient boosting regression tree (GBRT) model based on the extended connectivity fingerprints (ECFP) by predicting RCF values and achieved prediction performance with R-squared of 0.77 and mean absolute error (MAE) of 0.22 using 5-fold cross validation. In addition, our results reveal nonlinear relationships among properties of chemical, soil, and plant. Further in-depth analyses identify the key chemical topological substructures (e.g., -O, -Cl, aromatic rings and large conjugated π systems) related to RCF. Stemming from its simplicity and universality, the GBRT-ECFP model provides a valuable tool for LCIA and other environmental assessments to better characterize chemical risks to human health and ecosystems.

tools to evaluate the transfer of contaminants from soil to plants based on molecular signature of chemicals 5,6 . These tools could be utilized in life cycle impact assessments (LCIA) for organic chemicals within their life cycle assessment (LCA) framework 7 .
As an essential substance-specific characterization parameter in LCIA, the accumulation of environmental contaminants in plant roots is routinely evaluated in many experimental and modeling studies by root concentration factor ( ). RCF is defined as the ratio of contaminant concentration in roots to that in soil at a steady state or equilibrium state among compartments including soil solids, soil solution (interstitial water) and the plant roots 5,8,9 . is controlled by chemical, plant and soil properties, as a result of sorption-desorption processes at the interface of water-soil components, including soil organic matter (OM) and clay minerals, and partitioning processes at the water-plant root system that is influenced by lipid and carbohydrate contents.
Compared to based on water concentration that can be well predicted by empirical models, directly based on soil concentration is a more relevant and measurable characterization factor in LCIA. However, it is difficult to accurately estimate the value due to the complexity of multiple interactions for contaminants at the interfaces of soils, water, plant roots and microorganisms (i.e., rhizosphere). Models for the direct accurate prediction of are lacking, as most available empirical models have reported relative uncertainties for predictions around one order of magnitude 10 . Furthermore, the relationships between the chemical structures of contaminants and their uptake mechanisms from soil is not well understood.
Machine learning and deep learning algorithms have been widely used in image recognition, natural language processing, and with chemistry applications including reaction prediction and molecular property prediction 11,12 . Recently, as big data-based assessment and decision-making tools, machine learning models were successfully applied to predict some characterization parameters of LCIA, such as chemical USEtox HC50 values [13][14][15] . However, applying these algorithms to other LCIA characterization parameters, such as in the plant-soil system, remain challenging. This not only requires computational models to be able to learn complex relationships among multiple variables but also depends on the selection of well representative variables involved. Similar to the traditional empirical models, e.g., compartment models and quantitative structure-activity relationship (QSAR) models, common representations could be physicochemical properties [16][17][18] . For example, adsorption predictions of organic molecules on carbonaceous materials and polymers have used molecular physicochemical properties such as Abraham descriptors to represent the molecules 16,18 . However, these physicochemical properties are usually selected based on the assumptions of their importance in sorption processes and do not adequately account for molecular structure. Therefore, a more objective approach will be encoded with fundamental molecular structure information, rather than the general physicochemical properties.
Molecular fingerprint is a way of encoding molecular structures with a series of binary bits to describe the presence or absence of a certain chemical structure. One of the most commonly used fingerprint is Extended Connectivity Fingerprints (ECFP) 19  In this study, the uptake of organic chemicals by plants from soils was predicted by machine learning methods based on the collection of reliable data from literatures. ECFP were used in conjunction with gradient boosting decision tree (GBRT) methods for the development of unbiased predictive models. The workflow from data collection (Fig. 1a), construction of machine learning models ( Fig. 1b-   An overall comparison of molecule structure similarity using ECFP is shown in Fig. 2c.
Dice similarity, a feature-based similarity coefficient, was calculated for each pair of molecules.
The structure similarity reflected by dice similarity scores ranged from 0.0 to 1.0 (More details can be found in SI Table S2). Chemicals with similar molecular structures are supposed to have a higher score than structurally dissimilar chemicals.

Model interpretation by substructure analysis
Interpretability of machine learning models is one of the key challenges for their wide applications in environmental scenarios. ECFP accounts for the presence of particular substructures in the chemicals, and these substructures contribute differentially to the model.  A new plant uptake dataset that consists of 341 data points covering 72 chemicals was collected, together with their molecular structure specifications in the form of SMILES strings and related physio-chemical information such as and ) (SI Table S3).

Molecular, plant and soil property representation
The predictive power of machine learning models relies not only on the algorithms but also on the quality of input data. The inputs to our machine learning models represented plant, Another consideration of selecting physicochemical properties was that we tried to avoid those properties require expensive computation from molecule electronic structures, which may limit the model's applications in practice. For comparison, an empirical predictive LR model based on , , and was constructed to predict , as shown in Equation (1).

Molecular representation with ECFP fingerprint
For molecular fingerprints, ECFP based on the Morgan Algorithm implemented in Python Rdkit package was employed 19 . ECFP is a circular fingerprint which represents the presence of particular substructures in a molecule. It was generated iteratively, starting from assigning an integer identifier for each atom. These initial atom identifiers formed the initial fingerprint set for the molecule. In the subsequent step, each atom's own identifier together with the identifiers of its immediate neighboring atoms were fed into a hash function to generate a new, single-integer identifier. These new identifiers were then added into the fingerprint set. This iteration is repeated for a prespecified number of times (2 in this study) and all duplicate identifiers in the set are removed. The remaining integer identifiers in the fingerprint set define the ECFP fingerprint. In this study, ECFP fingerprints with 1024 bits were used to avoid bit collision.

Gradient boosting regression tree
Gradient Boosting Regression Tree (GBRT) is a popular machine learning algorithm that has been widely used in many predictive models [47][48][49] . Gradient boosting is an additive model which minimizes the loss function by adding weak learners. These weak learners are usually decision trees. The basic idea is that by adding a new tree into the model each time, the output of the existing sequence of trees will help improve the prediction, in this case , and decrease loss as shown in Fig. 1c. Our GBRT model utilized a mean squared error loss function defined as

Model validation
Hyperparameters including max depth and number of estimators were carefully tuned through 5-fold cross validation as shown in Fig. 1d , where ̂ is the predicted value of -th sample, is the corresponding true value, and ̅ = 1 ∑ =1 .

Data availability
The authors declare that all data supporting the findings of this study are available within the paper and its supplementary information files.      Statistic analysis of the RCF_soil dataset. The dataset showed a large variation of chemicals with molecular weights ranging from 128 to 959 and logRCF_soil ranging from -2.9 to 1.0.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. DirectpredictionofRCFsoilMLSItableS2heatmapdataframe.pdf DirectpredictionofRCFsoilMLSI.docx NATSUSTAIN21028965SI.pdf