A Deep-Learning Approach for Identifying Prospective Chemical Hazards

doi:10.21203/rs.3.rs-3121421/v1

Download PDF

Article

A Deep-Learning Approach for Identifying Prospective Chemical Hazards

https://doi.org/10.21203/rs.3.rs-3121421/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

With the aim of helping to set safe exposure limits for the general population, various techniques have been implemented to conduct risk assessments for chemicals and other environmental stressors; however, none of these tools facilitate the identification of completely new chemicals that are likely hazardous and elicit an adverse biological effect. Here, we detail a novel in silico, deep-learning framework that is designed to systematically generate structures for new chemical compounds that are predicted to be chemical hazards. To assess the utility of the framework, we applied the tool to four endpoints related to environmental toxicants and their impacts on human and animal health: (i) toxicity to honeybees, (ii) immunotoxicity, (iii) endocrine disruption via ER-α antagonism, and (iv) mutagenicity. In addition, we characterized the predicted potency of these compounds and examined their structural relationship to existing chemicals of concern. As part of the array of emerging new approach methodologies (NAMs), we anticipate that such a framework will be a significant asset to risk assessors and other environmental scientists when planning and forecasting. Though not in the scope of the present study, we expect that the methodology detailed here could also be useful in the de novo design of more environmentally-friendly industrial chemicals.

Biological sciences/Computational biology and bioinformatics/Machine learning

Earth and environmental sciences/Environmental sciences

Risk assessment is an essential element in protecting public health by quantifying the probability of adverse effects following exposures to chemicals and other stressors. Aims of this process include helping to set safe exposure limits for the general population and those in occupational settings and establishing clean-up limits for contaminated sites. These assessments are conducted by both government agencies [1] and industry [2], often in response to regulations.

Though there are several different guidelines available for conducting a chemical risk assessment [3–5], virtually all contain the component of hazard identification, which is the process of determining whether exposure to an agent can cause an increase in the incidence of specific adverse health effects and whether these effects are likely to occur in humans. Hazard identifications are generally performed on existing chemicals in production, often with the idea of prioritizing agents for safety evaluation [6, 7].

Over the years, risk assessment scientists have developed a spectrum of powerful techniques to aid in conducting hazard identifications, including in vitro systems, in silico physiologically-based models, computational tools for predicting toxicity [8], and a variety of ’omics technologies [9]. More recently, chemical hazard assessments have employed new approach methodologies (NAMs) [10], which include tools like adverse outcome pathways [11], quantitative structure activity relationships (QSARs) [12, 13], in vitro to in vivo extrapolation (IVIVE) [14, 15], and read-across [16].

In the field of pharmaceutical sciences, aside from QSARs, another important approach has recently been developed and applied: de novo molecular design [17–21]. This methodology makes use of computational methods to generate novel molecules subject to various property and biological-activity constraints, and implements algorithms based on graph theory [22], recurrent neural networks [21, 23], variational autoencoders [17], and its specialized forms, grammar variational autoencoders [18].

Unfortunately, to the best of our knowledge, these de novo molecular design techniques have not been applied in the environmental sciences. In this domain, the approach could have numerous applications, including forecasting chemicals that may be of concern in the future, designing greener industrial chemicals, and developing compounds having varied physicochemical properties for use in studies of environmental fate and transport.

In this paper, we describe a novel in silico, deep-learning-based framework, IProCH (Identification of Prospective Chemical Hazards), designed to address the first application above by systematically generate new chemical compounds that are predicted to be chemical hazards.

To demonstrate the application of IProCH, we selected four diverse case studies involving environmental toxicants based on adequacy of data to develop quantitative models, good diversity of chemical structures across the cases, availability of existing toxicity models for comparison against framework components, and at least one example relevant to wildlife health. The specific case studies and rationale for each were as follows:

Toxicity to honeybees: The western honeybee (Apis mellifera L.) ranks as the most frequent single species of pollinator for crops worldwide [24] and the most frequent floral visitor in natural habitats worldwide [25]. Unfortunately, honeybee colonies are in decline worldwide [26]. Though the reasons are not fully known and are still the subject of debate, exposures to pesticides are one of the suspected causes [27, 28]. Though a variety of chemicals have been shown to have adverse effects on honeybee health [29], compounds of particular concern are the pyrethroids (e.g., cypermethrin [30] and the neonicotinoids (e.g., \(\beta\)-cyfluthrin [31] and imidacloprid [32].

Immunotoxicity: Exposure to immunotoxicants has been linked to several infectious diseases, autoimmune disorders, and some types of cancers [33]. Accordingly, diverse classes of chemicals have become of concern owing to their immunotoxic potential [34, 35]. These include chemicals that have garnered significant public attention, like bisphenol A (BPA) [36] and the per- and poly-fluorinated alkylated substances (PFAS) [37, 38].

Endocrine disruption: Endocrine disrupting chemicals (EDCs) interfere with the action of hormones and disrupt homeostasis and have been linked to developmental, reproductive, neurological, cardiovascular, metabolic, and immune effects in humans [39, 40]. Of significant concern are EDCs that exert their effects by degrading, agonizing, or antagonizing endocrine receptors ER-\(\alpha\) and ER-\(\beta\), which may lead to sexual and reproductive abnormalities in exposed organisms. In this study, we therefore focused on endocrine disrupting compounds that act via ER-\({\alpha }\) antagonism.

Mutagenicity: In evaluating the carcinogenic potential of chemicals, national and international regulatory agencies use a wealth of genotoxicity information [41, 42], including studies related to mutagenicity. Under European Chemicals Agency (ECHA) guidelines, mutagenicity alone is an important aspect of hazard identification [43]. Additionally, information on mutagenicity is important in characterizing the risk of other adverse effects, such as mutation of germ cells or genotoxicity that occurs in somatic cells during embryogenesis and fetal development [44].

2.1 Information flow and components

Here, we outline the information flow through IProCH and describe its major components. A schematic of the framework is shown in Fig. 1.

Information format

Because of their ubiquitous use in cheminformatics, we used SMILES (Simplified Molecular-Input Line-Entry System) [45] as the representation for molecular structures within the framework.

Data

We divided each full data set into training (60%), validation (20%), and testing (20%) subsets. The training and validation subsets were used to develop and optimize the model performance by fine-tuning model parameters. A fraction of the training data (20–30 molecules) was randomly selected to train the structure generator. The testing subset was used to critically assess the predictive capabilities and generalizability of the activity scoring functions.

Structure generator

Following training of the structure generator using the input molecules, and guided by the activated scoring functions, the generation process was initiated. During the generation process, the design of structures was again constrained or guided by these functions.

Scoring functions

A scoring function (SCF) is a means of directing or constraining the nature of the molecules produced by the structure generator. The function will take as input a molecule’s SMILES and output a numerical value roughly indicating the extent to which future generated structures should reflect the current one. Such functions can be written for almost any quantifiable structure-derived attribute, including the presence of some particular structural feature(s) (e.g., the presence of an aromatic group), a physicochemical property (e.g., solubility), or the biological activity of the molecule with respect to some endpoint (e.g., the LD₅₀ associated with cytotoxicity). Scoring functions for pharmaceutical applications might comprise those aimed at optimizing solubility, logP, binding affinity, and structural diversity, while those for the present application focus in three areas

Activity (\({\text{SCF}}_{\text{activity}}\)): We independently developed activity scoring functions as machine learning classification models for three of the endpoints: toxicity to honeybees, immunotoxicity, and endocrine disruption via ER-\(\alpha\) antagonism. We implemented techniques during development to help ensure that all models accurately reproduced the testing data, showed minimal bias, and did not suffer overfitting. These techniques included iterative refinement, cross-validation, and oversampling using synthetic data. For the fourth endpoint, mutagenicity, we utilized the well-validated, QSAR component of the VEGAHUB platform [46, 47] to illustrate the flexibility of the framework. Details about the endpoints and data to support the models are given in Section 2.2.

Structure (\({\text{SCF}}_{\text{structure}}\)): The structure score was based on the presence or absence of a specified chemical substructure within the molecule. Depending on the objective, substructures can be chosen to represent anything from a single atom to a functional group to a core structure or scaffold. As a demonstration for the present study, we designed \({\text{SCF}}_{\text{structure}}\) to target scaffolds found in the molecules comprising the underlying data.

Synthesizability (\({\text{SCF}}_{\text{synth}}\)): To generate a synthesizability score, we utilized the Retro* package [48], which is designed to find a retrosynthesis path to build a molecule from a known list of compounds and synthesis steps [49].

Filter: In the filtering step, structures can be accepted or rejected based on user-defined criteria. For the present work, the relevant criterion was novelty, where novel molecules were defined as those absent from all of the following large chemical databases (numbers in square brackets indicate the approximate number of unique chemicals listed in that database): PubChem [50] [112 million], Zinc [51] [230 million], ChemBL [52] [2.3 million], and NCI Open Database [53] [300,000].

2.2 Data and modeling approaches

2.2.1 Toxicity to honeybees

To develop and assess the classification model related to toxicity to honeybees, we used the data from Spruill and coworkers [54], who collected data from four publicly-accessible databases. These data comprised information about each compound’s oral LD₅₀, expressed as µg active ingredient (a.i.) per bee.

Adopting the designations from the above investigators, and consistent with ideas from the US Environmental Protection Agency [55], we divided the chemicals into three groups: highly toxic (LD₅₀\(<\) 2 µg a.i./bee), moderately toxic (LD₅₀ between 2 and 11 µg a.i./bee), and practically nontoxic (LD₅₀\(>\) 11 µg a.i./bee).

After eliminating duplicates, the final data set consisted of 236 compounds, and using the above classification system, 37 were categorized as highly toxic, 13 as moderately toxic, and 186 as practically nontoxic.

2.2.2 Immunotoxicity

Naidenko and coworkers [56] investigated the relevance of 29 immunotoxicity-focused assays conducted under the ToxCast program [57] with respect to a set of compounds for which the immunotoxic potential was known. These assays focused on three classes of targets: (i) proteins involved in cellular communication within the immune system [15 assays]; (ii) immune cell surface receptors [5 assays]; and (iii) cell adhesion molecules [9 assays].

We collected the ToxCast information about the compounds for which these assays had been conducted to develop a model to classify chemicals with regard to their immunotoxic potential. Current data in the literature are insufficient to discriminate the extent to which each of the assays is predictive of immunotoxicity, so we defined a simple Aggregate Toxicity Score, \(ATS\), as the ratio of the number of active (or positive hit call) assays to the total number of relevant assays for the chemical in question. We then used the results from Naidenko et al., to develop a cutoff-based classifier for immunotoxicity in terms of the \(ATS\). Using this information, we determined that the most accurate discrimination occurred when compounds with \(ATS\ge 0.15\) were classified as toxic, while those with \(ATS<0.15\) were classified as nontoxic.

The immunotoxicity-focused data set contained a total of 3214 compounds, 502 of which were toxic and 2712 of which were nontoxic under the above classifier.

2.2.3 Endocrine disruption

For this case, we focused on developing a classification model for chemicals that have the potential to disrupt endocrine function by altering estrogen receptor (ER) signaling [40, 58]. Though signaling via the ER can be affected by both agonists and antagonists, we limited our attention to ER-\(\alpha\) receptor antagonists and queried the BindingDB database [59] to find IC₅₀ values for relevant compounds. For classification, we created a three-level system, where compounds were divided into highly potent (IC₅₀\(<\) 1 µg), moderately potent (IC₅₀ between 1 and 10 µg), and non-potent (IC₅₀\(>\) 10 µg).

After eliminating duplicates, the resulting data set contained a total of 3490 compounds, and using the above classification system, 2140 were highly potent, 738 were moderately potent, and 612 were non-potent.

2.2.4 Mutagenicity

As noted earlier, for the mutagenicity classification model, we utilized the quantitative structure activity relationship related to mutagenicity from the VEGAHUB project [47]. The approach and data underlying this model are detailed in Benfenati et al. [46].

The data used to generate the chemical taxonomy and train the structure generator in IProCH were taken from a large, consolidated database of Ames test results [60, 61].

2.3 Software tools

Overall framework

The IProCH framework was written in Python (v3.9), utilizing packages for data management (pandas [62], v1.4.1), manipulation of SMILES representations (RDKIT [63], v2020.03.1), tree diagram generation (anytree [64], v2.8.0), and chemical structure depiction (mols2grid [65], v0.2.2)

Scoring functions

The machine learning analyses for the scoring functions were conducted using the scikit-learn package [66] (v1.1.3), while molecular properties were computed using RDKit. For the scoring function related to synthesizability, we utilized the Retro* package [48] developed by Chen et al. [49].

Structure generator

For this study, we selected a structure generation algorithm based on data-efficient graph grammar learning, as developed and implemented by Guo et al. [67] in their package DEG. This technique has been shown to produce robust samples of output molecules based on input training sets with sizes that are significantly smaller than those commonly required for other algorithms. In contrast to usage in drug development, this attribute is highly desirable for the present study in which relatively small numbers of molecules have been assessed for the toxicological endpoints of interest.

Structure classification and scaffold determination

To automate the process of determining the class and taxonomy of molecules, we used the web-based platform ClassyFire [68], and to identify the common core structures for sets of molecules, we used the Murcko scaffolds algorithm [69] as implemented in RDKit. This algorithm aims to find scaffolds comprising a union of ring systems and linker atoms connecting the ring systems.

Additional details about the software and algorithms employed are given in the Supplementary Material.

For each of the activity scoring functions developed in house, we computed evaluation metrics [70], including the F1 scores, overall accuracy, and a normalized confusion matrix/contingency table. We then examined the chemical heredity and similarity of the compounds relevant to each endpoint through a taxonomy diagram. Each taxonomy diagram was not intended to be exhaustive but to represent the major classes of compounds and their chemical lineage.

For all the studies, we found that the evaluation metrics associated with the testing data set were quite similar to those found using the data resulting from multi-fold cross-validation, which suggests robust values for the metrics and a lack of bias across samples.

Finally, although we generated numerous novel structures for each endpoint, we display only a small set of randomly selected compounds for each case as illustrative examples. The SMILES representations for additional molecules are given in the Supplementary Material (Tables S1, S3-S5).

3.1 Toxicity to honeybees

Activity scoring function (honeybee pesticide classifier)

To evaluate the performance of this classification model, we evaluated its ability to classify chemicals from the testing subset data into the categories noted earlier. We found that the model generated results having F1 scores of 0.9, 0.78, and 0.95 for highly toxic, moderately toxic, and nontoxic compounds, respectively. We computed the overall accuracy score for the model as 88%, which is comparable to the cited accuracy of 91% for BeeToxAI [71]. Further, calculations showed that the classifier had high sensitivity (98%, 99%, and 65%), which indicates a low probability of false negatives, and high specificity (92.5%, 84%, and 99%), which suggests a low probability of false positives across all three classes. Details of the classification performance are given in Figure S1 of the Supplementary Material in the form of a confusion matrix.

Taxonomy

Although molecules with documented toxicity to honeybees are diverse, a significant fraction of the highly toxic molecules belong to the chemical classes pyrethroids, benzene and substituted derivatives, pyridines, and organonitrogen compounds. Specifics of the chemical classes and taxonomy for members of the honeybee toxicant data set are shown in Figure S2.

Novel compounds

For this case study, 30 pyrethroids and benzenoids were used to train the structure generator. Examples of output from the framework in this case are shown in Fig. 2, where labels indicate the taxonomic classes.

3.2 Immunotoxicity

Activity scoring function (immunotoxicity classifier)

Using the testing subset data for immunotoxicants, we quantified the classification performance of the activity scoring function using the metrics noted earlier. Among the values for the performance metrics were a sensitivity of 68.7%, specificity of 83.8%, and an overall accuracy score of 76%, which compare well to those of the immunotoxicity prediction model developed and implemented by ProTox-II [72], for which the authors reported a 69.5% sensitivity, 79.5% specificity, and an overall accuracy of 75%. The computed F1 scores for our model were 0.70 and 0.73 for toxic and nontoxic classes, respectively. Details of the classification performance, as represented by a confusion matrix, are given in Figure S3.

To further assess the performance of the classifier, we tested its ability to correctly identify chemicals with well documented immunotoxicity. The model showed a good ability (82.6% accuracy) to recognize the immunotoxic potential of members from chemical classes of concern (e.g., pyrethroids and per- and poly-fluoroalkyl substances). See Table S2 of the Supplementary Material for more information.

Taxonomy

Chemical compounds with potential immunotoxicity are structurally diverse. A taxonomy analysis of the full data set of immunotoxic molecules revealed that many of the compounds with documented or suspected immunotoxicity belonged to the chemical classes diphenylmethanes, alkyl fluorides, azoles, stilbenes, and carbonyl compounds. The full taxonomy is depicted in Figure S4.

Novel compounds

In this analysis, 20 immunotoxicants spanning several chemical classes were used to train the structure generator. A sample set of generated molecules is displayed in Fig. 3.

3.3 Endocrine disruption

Activity scoring function (ER- antagonists classifier)

We evaluated the accuracy of this model by examining its ability to classify chemicals from the testing subset for endocrine disruptors into the categories noted previously. We computed evaluation metrics and found that the classifier produced F1 scores of 0.94, 0.76, and 0.88 for the classes highly potent, moderately potent, and non-potent, respectively. We further found that the model had an overall accuracy of 89%, which was comparable to the value of 91% for the two-category estrogen receptor prediction model developed and implemented in ProTox-II. Further, the model showed high sensitivity (77.8%, 90.6% and 89.4%) and high specificity (98.1%, 94.5%, and 86.1%) for highly potent, moderately potent, and non-potent classes, respectively. Figure S5 depicts the details underlying these metrics.

Taxonomy

Ligands with ER- antagonistic properties belong to a broad range of chemical classes, including stilbenes, indoles, naphthalenes, and hydroxyflavonoids (see Figure S6).

Novel compounds

In the case study of potential endocrine disruptors, we used 20 randomly selected structures from the training subset as the input to IProCH. Figure 4 shows samples of the generated structures and their chemicals classes.

3.4 Mutagenicity

Activity scoring function (mutagenicity classifier)

As noted earlier, we used a previously published model as the basis for for this test case. Because information was already available regarding the model’s accuracy and performance [73], we did not perform our own assessment.

Taxonomy

An analysis of over 800 chemicals reveals that molecules with documented or potential mutagenicity belong to a broad range of chemical classes and span both the organic and inorganic chemical kingdoms. This diversity is illustrated in Figure S7.

Novel compounds

We used a training subset of 25 structures spanning various classes of the taxonomy as input to the structure generator. A sample of the generated novel and synthesizable chemicals with mutagenicity potential are shown in Fig. 5.

4.1 Limitations of study

We developed a framework to generate novel molecules and examined the molecules in the context of case studies focused on environmental chemical hazard identification. The framework can be adapted to utilize user-defined scoring functions and filtering components to produce molecules satisfying application-specific criteria and constraints. However, the specific study detailed here had several limitations:

Though they had satisfactory to very good accuracy metrics and guided the production of plausible novel molecules, the activity scoring functions we developed for this study may not be “best in class”. It may well be that other models would produce better results for specific types of chemicals and endpoints of interest. For instance, the use of similarity descriptors [16, 74, 75] may provide the means to produce more accurate models. Based on its modular design, alternative models can easily be employed within IProCH.

The scoring function \({\text{SCF}}_{\text{synth}}\) utilized a data-backed retrosynthesis approach and may not reflect synthesizability in the eyes of a trained synthetic organic chemist. Likewise, so that operating IProCH would require minimal manual intervention, the taxonomy determination and chemical classifications were conducted using an automated procedure, results may not be consistent with those from individuals experienced in this area.

We defined structural novelty in terms of a molecule’s absence from several large chemical databases; however, it is not unlikely that many of the molecules generated can be found in other databases and/or have been synthesized previously, but their structures have not been incorporated into the queried databases. Other filtering strategies could be employed within IProCH, depending on the objectives of the end user.

4.2 Applicability and scope

In this study, we concentrated on the application of IProCH in a hazard identification context. The case studies shown were chosen to represent a range of compounds and endpoints of interest in environmental toxicology and risk assessment. We feel that a framework like IProCH, with its ability to aid in the prediction of prospective chemicals hazards, can give risk assessors and other environmental scientists another tool to employ, particularly for activities related to planning and forecasting.

In addition to the present application, we anticipate that the IProCH approach has other uses, for instance in the de novo design of more environmentally friendly industrial chemicals. In this case, appropriate scoring functions could be written to guide designs toward structures with desirable physicochemical properties, like those associated with rapid environmental degradation and low persistence. Also, for scientific studies involving environmental transport and fate, it may be useful to design surrogate or probe molecules with various physicochemical characteristics that map out molecular property domains of interest.

Finally, there is a significant gap between the number of compounds that have been subject to authoritative hazard classifications and those currently in commerce (Guyton et al. 2009). To address this gap and provide information on the existing in-use chemicals that are lacking hazard evaluations, a variety of methods have been employed, including QSARs, IVIVE, and read-across. Components used in IProCH, namely the scoring functions and methodology used to identify common core structures across a set of compounds, could supplement these existing methods and provide additional capabilities for this important effort.

Data availability

The software and datasets generated and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.8113642 .

Landrigan PJ, Goldman LR. Chemical safety, health care costs and the Affordable Care Act. American Journal of Industrial Medicine. 2014;57: 1–3. doi:10.1002/ajim.22268
Fisk P. Chemical Risk Assessment: A Manual for REACH. 1st edition. Chichester: Wiley; 2014.
Council NR, Studies D on E and L, Sciences C on L, Health C on the IM for A of R to P. Risk Assessment in the Federal Government: Managing the Process. National Academies Press; 1983.
National Research Council. Science and decisions: advancing risk assessment. National Academies Press; 2009.
OECD. OECD cooperative chemicals assessment programme (CoCAP). 2017.
Barupal DK, Schubauer-Berigan MK, Korenjak M, Zavadil J, Guyton KZ. Prioritizing cancer hazard assessments for IARC Monographs using an integrated approach of database fusion and text mining. Environment International. 2021;156: 106624. doi:10.1016/j.envint.2021.106624
Wood WP. Safety evaluation under the toxic substances control act. The Journal of Toxicological Sciences. 1987;12: 179–184.
Cavasotto CN, Scardino V. Machine Learning Toxicity Prediction: Latest Advances by Toxicity End Point. ACS Omega. 2022;7: 47536–47546. doi:10.1021/acsomega.2c05693
EFSA. Modern methodologies and tools for human hazard assessment of chemicals. EFSA Journal. 2014;12: 3638. doi:10.2903/j.efsa.2014.3638
Isaacs KK, Egeghy P, Dionisio KL, Phillips KA, Zidek A, Ring C, et al. The chemical landscape of high-throughput new approach methodologies for exposure. J Expo Sci Environ Epidemiol. 2022;32: 820–832. doi:10.1038/s41370-022-00496-9
Edwards SW, Tan Y-M, Villeneuve DL, Meek ME, McQueen CA. Adverse Outcome Pathways-Organizing Toxicological Information to Improve Decision Making. J Pharmacol Exp Ther. 2016;356: 170–181. doi:10.1124/jpet.115.228239
Benfenati E, Pardoe S, Martin T, Gonella Diaza R, Lombardo A, Manganaro A, et al. Using toxicological evidence from QSAR models in practice. ALTEX - Alternatives to animal experimentation. 2013;30: 19–40. doi:10.14573/altex.2013.1.019
Wagner PM, Nabholz JV, Kent RJ. The new chemicals process at the Environmental Protection Agency (EPA): structure-activity relationships for hazard identification and risk assessment. Toxicology Letters. 1995;79: 67–73. doi:10.1016/0378-4274(95)03358-R
Bell SM, Chang X, Wambaugh JF, Allen DG, Bartels M, Brouwer KLR, et al. In vitro to in vivo extrapolation for high throughput prioritization and decision making. Toxicol In Vitro. 2018;47: 213–227. doi:10.1016/j.tiv.2017.11.016
Breen M, Ring CL, Kreutz A, Goldsmith M-R, Wambaugh JF. High-throughput PBTK models for in vitro to in vivo extrapolation. Expert Opin Drug Metab Toxicol. 2021;17: 903–921. doi:10.1080/17425255.2021.1935867
Rovida C, Barton-Maclaren T, Benfenati E, Caloni F, Chandrasekera PC, Chesné C, et al. Internationalization of read-across as a validated new approach method (NAM) for regulatory toxicology. ALTEX. 2020;37: 579–606. doi:10.14573/altex.1912181
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent Sci. 2018;4: 268–276. doi:10.1021/acscentsci.7b00572
Kusner MJ, Paige B, Hernández-Lobato JM. Grammar Variational Autoencoder. arXiv:170301925 [stat]. 2017 [cited 12 Apr 2021]. Available: http://arxiv.org/abs/1703.01925
Meyers J, Fabian B, Brown N. De novo molecular design and generative models. Drug Discovery Today. 2021; S1359644621002531. doi:10.1016/j.drudis.2021.05.019
Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular De Novo Design through Deep Reinforcement Learning. arXiv; 2017. doi:10.48550/arXiv.1704.07555
Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks. arXiv:170101329 [physics, stat]. 2017 [cited 28 Jun 2021]. Available: http://arxiv.org/abs/1701.01329
Atance SR, Diez JV, Engkvist O, Olsson S, Mercado R. De Novo Drug Design Using Reinforcement Learning with Graph-Based Deep Generative Models. Journal of Chemical Information and Modeling. 2022. doi:10.1021/acs.jcim.2c00838
Huang S, Mei H, Lu L, Qiu M, Liang X, Xu L, et al. De Novo Molecular Design of Caspase-6 Inhibitors by a GRU-Based Recurrent Neural Network Combined with a Transfer Learning Approach. Pharmaceuticals (Basel, Switzerland). 2021;14: 1249. doi:10.3390/ph14121249
Calderone NW. Insect pollinated crops, insect pollinators and US agriculture: trend analysis of aggregate data for the period 1992-2009. PLoS One. 2012;7: e37235. doi:10.1371/journal.pone.0037235
Hung K-LJ, Kingston JM, Albrecht M, Holway DA, Kohn JR. The worldwide importance of honey bees as pollinators in natural habitats. Proc Biol Sci. 2018;285: 20172140. doi:10.1098/rspb.2017.2140
Oerke E-C. Crop losses to pests. The Journal of Agricultural Science. 2006;144: 31–43. doi:10.1017/S0021859605005708
Serrão JE, Plata-Rueda A, Martínez LC, Zanuncio JC. Side-effects of pesticides on non-target insects in agriculture: a mini-review. Die Naturwissenschaften. 2022;109: 17. doi:10.1007/s00114-022-01788-8
Steinhauer N, Kulhanek K, Antúnez K, Human H, Chantawannakul P, Chauzat M-P, et al. Drivers of colony losses. Current Opinion in Insect Science. 2018;26: 142–148. doi:10.1016/j.cois.2018.02.004
Mullin CA, Frazier M, Frazier JL, Ashcraft S, Simonds R, Vanengelsdorp D, et al. High levels of miticides and agrochemicals in North American apiaries: implications for honey bee health. PLoS One. 2010;5: e9754. doi:10.1371/journal.pone.0009754
Bendahou N, Bounias M, Fleche C. Toxicity of Cypermethrin and Fenitrothion on the Hemolymph Carbohydrates, Head Acetylcholinesterase, and Thoracic Muscle Na+, K+-ATPase of Emerging Honeybees (Apis mellifera mellifera. L). Ecotoxicology and Environmental Safety. 1999;44: 139–146. doi:10.1006/eesa.1999.1811
Rundlöf M, Andersson GKS, Bommarco R, Fries I, Hederström V, Herbertsson L, et al. Seed coating with a neonicotinoid insecticide negatively affects wild bees. Nature. 2015;521: 77–80. doi:10.1038/nature14420
Rondeau G, Sánchez-Bayo F, Tennekes HA, Decourtye A, Ramírez-Romero R, Desneux N. Delayed and time-cumulative toxicity of imidacloprid in bees, ants and termites. Scientific Reports. 2014;4: 5566. doi:10.1038/srep05566
Semwal R, Semwal RB, Lehmann J, Semwal DK. Recent advances in immunotoxicity and its impact on human health: causative agents, effects and existing treatments. International Immunopharmacology. 2022;108: 108859. doi:10.1016/j.intimp.2022.108859
DeWitt JC, Shnyra A, Badr MZ, Loveless SE, Hoban D, Frame SR, et al. Immunotoxicity of perfluorooctanoic acid and perfluorooctane sulfonate and the role of peroxisome proliferator-activated receptor alpha. Critical Reviews in Toxicology. 2009;39: 76–94. doi:10.1080/10408440802209804
Segner H, Bailey C, Tafalla C, Bo J. Immunotoxicity of Xenobiotics in Fish: A Role for the Aryl Hydrocarbon Receptor (AhR)? International Journal of Molecular Sciences. 2021;22: 9460. doi:10.3390/ijms22179460
Silano V, Bolognesi C, Castle L, Cravedi J-P, Engel K-H, Fowler P, et al. A statement on the developmental immunotoxicity of bisphenol A (BPA): answer to the question from the Dutch Ministry of Health, Welfare and Sport. EFSA Journal. 2016;14: e04580.
Grandjean P. Delayed discovery, dissemination, and decisions on intervention in environmental health: a case study on immunotoxicity of perfluorinated alkylate substances. Environmental Health. 2018;17: 62. doi:10.1186/s12940-018-0405-y
Liu C, Gin KY-H. Immunotoxicity in green mussels under perfluoroalkyl substance (PFAS) exposure: Reversible response and response model development. Environmental Toxicology and Chemistry. 2018;37: 1138–1145. doi:10.1002/etc.4060
Delfosse V, Maire A le, Balaguer P, Bourguet W. A structural perspective on nuclear receptors as targets of environmental compounds. Acta Pharmacologica Sinica. 2015;36: 88–101. doi:10.1038/aps.2014.133
Gore AC, Chappell VA, Fenton SE, Flaws JA, Nadal A, Prins GS, et al. EDC-2: The Endocrine Society’s Second Scientific Statement on Endocrine-Disrupting Chemicals. Endocr Rev. 2015;36: E1–E150. doi:10.1210/er.2015-1010
Food and Agriculture. Genotoxicity. Principles and Methods for the Risk Assessment of Chemicals in Food. World Health Organization; 2020. pp. 471–471.
US Environmental Protection Agency. Guidelines for Mutagenicity Risk Assessment. 1986 p. 23.
European Chemicals Agency. Guidance on information requirements and chemical safety assessment Part B: Hazard Assessment. European Chemicals Agency; 2011 Dec. Available: https://echa.europa.eu/documents/10162/17235/information_requirements_part_b_en.pdf/7e6bf845-e1a3-4518-8705-c64b17cecae8?t=1323782779823
Meier MJ, O’Brien JM, Beal MA, Allan B, Yauk CL, Marchetti F. In Utero Exposure to Benzo[a]Pyrene Increases Mutation Burden in the Soma and Sperm of Adult Mice. Environmental Health Perspectives. 2017;125: 82–88. doi:10.1289/EHP211
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28: 31–36. doi:10.1021/ci00057a005
Benfenati E, Manganaro A, Gini G. VEGA-QSAR: AI inside a platform for predictive toxicology. Proceedings of the workshop "Popularize Artificial Intelligence 2013, CEUR Workshop Proceedings. 2013;1107: 8.
Roncaglioni A, Lombardo A, Benfenati E. The VEGAHUB Platform: The Philosophy and the Tools. Altern Lab Anim. 2022;50: 121–135. doi:10.1177/02611929221090530
Chen B. Retrosynthetic Planning with Retro*. 2022. Available: https://github.com/binghong-ml/retro_star
Chen B, Li C, Dai H, Song L. Retro*: Learning retrosynthetic planning with neural guided A* search. In: III HD, Singh A, editors. Proceedings of the 37th international conference on machine learning. PMLR; 2020. pp. 1608–1616. Available: https://proceedings.mlr.press/v119/chen20k.html
PubChem. PubChem. 2022. Available: https://pubchem.ncbi.nlm.nih.gov/
Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: A Free Tool to Discover Chemistry for Biology. Journal of Chemical Information and Modeling. 2012;52: 1757–1768. doi:10.1021/ci3001277
Gaulton A, Hersey A, Nowotka M, , Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. ChEMBL Database. Nucleic Acids Res. 2017;45. Available: https://www.ebi.ac.uk/chembl/
NCI/CADD Group. Downloadable Structure Files of NCI Open Database Compounds. In: NCI Open Database [Internet]. 2022 [cited 1 Nov 2022]. Available: https://cactus.nci.nih.gov/download/nci/
Spruill SE, O’Neill BF, Hinarejos S, Cabrera AR. A Comparison of Acute Toxicity Endpoints for Adult Honey Bees with Technical Grade Active Ingredients and Typical End-use Products as Test Substance. Journal of Economic Entomology. 2020;113: 1015–1017. doi:10.1093/jee/toz305
U.S. Environmental Protection Agency-Office of Pesticide Programs. Guidance on Exposure and Effects Testing for Assessing Risks to Bees. 2016. Available: https://scholar.google.com/scholar_lookup?title=Guidance+on+exposure+and+effects+testing+for+assessing+risks+to+bees&publication_year=2016&
Naidenko OV, Andrews DQ, Temkin AM, Stoiber T, Uche UI, Evans S, et al. Investigating Molecular Mechanisms of Immunotoxicity and the Utility of ToxCast for Immunotoxicity Screening of Chemicals Added to Food. International Journal of Environmental Research and Public Health. 2021;18: 3332. doi:10.3390/ijerph18073332
Dix DJ, Houck KA, Martin MT, Richard AM, Setzer RW, Kavlock RJ. The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicological sciences. 2007;95: 5–12.
Shanle EK, Xu W. Endocrine disrupting chemicals targeting estrogen receptor signaling: Identification and mechanisms of action. Chem Res Toxicol. 2011;24: 6–19. doi:10.1021/tx100231n
Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Research. 2016;44: D1045–D1053. doi:10.1093/nar/gkv1072
Kirkland D, Zeiger E, Madia F, Corvi R. Can in vitro mammalian cell genotoxicity test results be used to complement positive results in the Ames test and help predict carcinogenic or in vivo genotoxic activity? II. Construction and analysis of a consolidated database. Mutation Research/Genetic Toxicology and Environmental Mutagenesis. 2014;775–776: 69–80. doi:10.1016/j.mrgentox.2014.10.006
Madia F, Kirkland D, Morita T, White P, Asturiol D, Corvi R. EURL ECVAM Genotoxicity and Carcinogenicity Database of Substances Eliciting Negative Results in the Ames Test: Construction of the Database. Mutation Research/Genetic Toxicology and Environmental Mutagenesis. 2020;854–855: 503199. doi:10.1016/j.mrgentox.2020.503199
pandas development team. pandas-dev/pandas: Pandas. Zenodo; 2020. Available: https://doi.org/10.5281/zenodo.3509134
Landrum, Greg. RDKit: Open-source cheminformatics. 2022.
c0fec0de. anytree: Python tree data library. 2022. Available: https://github.com/c0fec0de/anytree
Bouysset C. mols2grid: An interactive molecule viewer for 2D structures, based on RDKit. 2022. Available: https://github.com/cbouy/mols2grid
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12: 2825–2830.
Guo M, Thost V, Li B, Das P, Chen J, Matusik W. Data-Efficient Graph Grammar Learning for Molecular Generation. 2022. Available: https://openreview.net/forum?id=l4IHywGq6a
Djoumbou Feunang Y, Eisner R, Knox C, Chepelev L, Hastings J, Owen G, et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. Journal of Cheminformatics. 2016;8: 61. doi:10.1186/s13321-016-0174-y
Bemis GW, Murcko MA. The Properties of Known Drugs. 1. Molecular Frameworks. J Med Chem. 1996;39: 2887–2893. doi:10.1021/jm9602928
Kroese DP, Botev ZI, Taimre T, Vaisman R. Data Science and Machine Learning: Mathematical and Statistical Methods. 1st edition. Boca Raton London New York: Chapman and Hall/CRC; 2019.
Moreira-Filho JT, Braga RC, Lemos JM, Alves VM, Borba JVVB, Costa WS, et al. BeeToxAI: An artificial intelligence-based web app to assess acute toxicity of chemicals to honey bees. Artificial Intelligence in the Life Sciences. 2021;1: 100013. doi:10.1016/j.ailsci.2021.100013
Banerjee P, Eckert AO, Schrey AK, Preissner R. ProTox-II: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Research. 2018;46: W257–W263. doi:10.1093/nar/gky318
Bakhtyari NG, Raitano G, Benfenati E, Martin T, Young D. Comparison of in silico models for prediction of mutagenicity. J Environ Sci Health C Environ Carcinog Ecotoxicol Rev. 2013;31: 45–66. doi:10.1080/10590501.2013.763576
Banerjee A, Roy K. First report of q-RASAR modeling toward an approach of easy interpretability and efficient transferability. Mol Divers. 2022;26: 2847–2862. doi:10.1007/s11030-022-10478-6
Banerjee A, Roy K. On Some Novel Similarity-Based Functions Used in the ML-Based q-RASAR Approach for Efficient Quantitative Predictions of Selected Toxicity End Points. Chem Res Toxicol. 2023;36: 446–464. doi:10.1021/acs.chemrestox.2c00374

No competing interests reported.

supplementarymaterial.pdf

Download PDF

Version 1

posted

You are reading this latest preprint version

A Deep-Learning Approach for Identifying Prospective Chemical Hazards

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methods

2.1 Information flow and components

2.2 Data and modeling approaches

2.2.1 Toxicity to honeybees

2.2.2 Immunotoxicity

2.2.3 Endocrine disruption

2.2.4 Mutagenicity

2.3 Software tools

3. Results

3.1 Toxicity to honeybees

3.2 Immunotoxicity

3.3 Endocrine disruption

3.4 Mutagenicity

4. Discussion

4.1 Limitations of study

4.2 Applicability and scope

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1