Discovering Pesticides and their Transformation Products in Luxembourg Waters using Open Cheminformatics Approaches

The diversity of hundreds of thousands of potential organic pollutants and the lack of (publicly available) information about many of them is a huge challenge for environmental sciences, engineering, and regulation. Suspect screening based on high-resolution liquid chromatography-mass spectrometry (LC-HRMS) has enormous potential to help characterize the presence of these chemicals in our environment, enabling the detection of known and newly emerging pollutants, as well as their potential transformation products (TPs). Here, suspect list creation (focusing on pesticides relevant for Luxembourg, incorporating data sources in 4 languages) was coupled to an automated retrieval of related TPs from PubChem based on high condence suspect hits, to screen for pesticides and their TPs in Luxembourgish river samples. A computational workow was established to combine LC-HRMS analysis and pre-screening of the suspects (including automated quality control steps), with spectral annotation to determine which pesticides and, in a second step, their related TPs may be present in the samples. The data analysis with Shinyscreen (https://git-r3lab.uni.lu/eci/shinyscreen/), an open source software developed in house, coupled with custom-made scripts, revealed the presence of 162 potential pesticide masses and 135 potential TP masses in the samples. Further identication of these mass matches was performed using the open source MetFrag (https://msbi.ipb-halle.de/MetFrag/). Eventual target analysis of 36 suspects resulted in 31 pesticides and TPs conrmed at Level-1 (highest condence), and ve pesticides and TPs not conrmed due to different retention times. Spatio-temporal analysis of the results showed that TPs and pesticides followed similar trends, with a maximum number of potential detections in July. The highest detections were in the rivers Alzette and Mess and the lowest in the Sûre and Eisch. This study (a) added pesticides, classication information and related TPs into the open domain, (b) developed automated open source retrieval methods - both enhancing FAIRness (Findability, Accessibility, Interoperability and Reusability) of the data and methods; and (c) will directly support “L’Administration de la

is still surprisingly di cult to access information on TPs in a central and "FAIR" (Findable, Accessible, Interoperable and Reusable 8,9 ) manner, with much valuable information documented as detailed reaction schemes (e.g. as images) or descriptive text in regulatory reports that are not always easily or publicly accessible. In this study, not only is the presence of pesticides in samples investigated, but also the presence of their documented TPs in openly available information sources, as well as in the samples.
Previous work by Moschet et al. 10 and Kiefer et al. 11 both characterised the relevance of pesticide transformation products in their ndings and shared their lists afterwards (SWISSPEST 12 and SWISSPEST19 13 , respectively) on the NORMAN Suspect List Exchange (NORMAN-SLE) 14 , thus making them more "FAIR" 8 . The SWISSPEST suspect list was a starting point for the pesticide suspect list developed in this work, with additional chemicals of local relevance added as described below (note: SWISSPEST19 was published in parallel during the early stages of this work).
For the identi cation of unknown contaminants in the environment, a technology that is sensitive, fast, and accurate is required, capable of con dently identifying chemical contaminants emerging at trace concentrations in complex environmental and biological matrices. High resolution mass spectrometry (HR-MS) coupled with liquid chromatography has become an established technique for the monitoring of thousands of chemicals in water (and other) samples. 15,16 Various computational approaches can help screen non-target HR-MS measurements for large numbers of suspect chemicals using suspect lists and/or mass spectral libraries 15,17 , or to discover and identify new, previously unknown chemicals in the environment. 15,18 These two non-targeted analysis strategies are called suspect screening and nontargeted screening, respectively. 10 Suspect screening, the strategy used in this study, uses only the information of the chemical structure and its mass (and/or spectrum) a priori and is, therefore, a very promising approach for the e cient tentative identi cation of compounds. 10,19 Consequently, suspect screening can be used to perform extensive analytical screening for speci c chemicals suspected to be in the samples without necessarily the need for reference standards in advance. 10 Targeted analysis is a more classical approach for quanti cation providing high sensitivity and high selectivity that requires preselection of the chemicals in advance and the availability of reference standards. Nevertheless, this approach is the only way to verify and quantify the tentative candidates in the end. The increasing number of chemicals of interest in environmental and exposomics studies makes it practically impossible for target analyses dependent on individual standards to cover all potentially occurring chemicals. 10 Thus, suspect screening methods are therefore developed to reveal a fuller picture of occurring chemicals and can be performed with suspect chemical lists, 10,15,16 allowing for eventual prioritization for target analysis and con rmation efforts. 10 Con dence in HR-MS-based identi cations inherently varies between compounds, since it is not always possible or reasonable to synthesize each substance or con rm them via complementary methods (e.g. nuclear magnetic resonance) at very low environmental concentrations and in complex mixtures. 20 These varying levels of con dence and the need for a standardized manner to report the results were motivating reasons for a level system that was introduced in 2014. 20 The system contains ve identi cation con dence levels, which can be achieved through experimental and computational analysis of the compound(s) measured in HR-MS experiments, with the objective to achieve the highest possible identi cation level that is realistic with the available evidence. Suspect screening can generally be considered to start at an identi cation con dence of Level-3 (tentatively detected candidates following pre-screening; see below), and through data analysis compounds can obtain the con dence Level-2a, i.e. probable structures via a high-quality spectral library match. Should target analysis reveal a suitable match with a reference standard measured in house with the same method, this results in a Level-1 con rmed identi cation.
Since a suspect list is often set up based on a substance class (or classes) of interest, there is no guarantee that the suspects are present in the sample. Thus, a pre-screening step helps to determine which suspects may be present with matching MS1 and MS2 spectra of su cient quality for further data analysis. This step was performed using Shinyscreen (https://git-r3lab.uni.lu/eci/shinyscreen/) 21 , a semiautomated, open-source alternative to vendor software for peak inspection, with built in quality control criteria as described recently by Lai et al. 22 Potential suspects with MS1 and MS2 spectra passing the Shinyscreen pre-screening were promoted to further for additional identi cation efforts via MS2 spectra annotation using the open source in silico fragmentation approach MetFrag (https://msbi.ipbhalle.de/MetFrag/). 23 MetFrag combines compound database searching and fragmentation prediction plus other experimental and metadata terms for molecule identi cation using HR-MS2 fragmentation information. 23 Given a single MS2 spectrum of a suspect and the neutral mass of the parent ion, MetFrag rst selects matching candidates from databases, such as PubChem (https://pubchem.ncbi.nlm.nih.gov/) 24 and CompTox (https://comptox.epa.gov/dashboard/) 25 , before each of the retrieved candidates is fragmented in silico using a bond-disconnection method and ranked using various scoring terms (see methods for further details). 23 For this study, the US Environmental Protection Agency (US EPA) CompTox Chemicals Dashboard was used as the main compound database, consistent with Lai et al. 22 , because of its relatively small size (~ 880,000 chemicals), and the extensive environmentally-relevant metadata such as toxicity, exposure, and presence integrated in CompTox from various information sources. 25 The recently-released PubChemLite for Exposomics collection 26 , which demonstrated very good performance particularly for agrochemicals (pesticides) was under development at the time that this work was performed.
The main goals for this study were (a) establishing a new high-throughput suspect screening work ow based on open resources coupled with semi-automatic screening and annotation steps (b) the discovery and FAIRi cation of TP information based on their parent compounds using text-mining methods and (c) application of these combined approaches on surface water samples to gain an overview of the pesticide and pesticide TP presence in Luxembourgish rivers. The resulting suspect lists, classi cation and permission information were uploaded to various open databases and repositories to contribute to open and "FAIR" data management for exposomics.

Material And Methods
The high-throughput suspect screening work ow developed here is shown in Figure 1 and explained in the following sections.

Sampling and Solid Phase Extraction
Different river surface water samples, collected throughout Luxembourg, were selected by the "L'Administration de la Gestion de l'Eau" (the Luxembourgish Water Administration, hereafter AGE) for chemical monitoring; pesticides and their TPs are the speci c focus of these efforts (additional activities are ongoing). Nine different locations ( Fig. 2 and Supplementary Materials (SM) Table S1) covered the various river catchments, and the data used in this study were sampled monthly between April 2019 and October 2019 (no sampling in June 2019).
The surface water samples were lled in 1000 mL amber bottles and stored for up to one week at 5°C (± 3°C) in darkness until extraction. To assess possible contamination from sample handling, ultrapure water was analogously enriched and analysed as blank samples.
For the solid-phase extraction, Atlantic® HLB SPE Disks from Horizon (Salem, NH, USA) with a 47 mm diameter were used. The disks were conditioned twice for 1 minute (min) with acetonitrile, and then twice for 1 min with Milli-Q water. 1000 mL of sample was pumped through each disk at a ow rate of roughly 30 mL/min, using the SPE-DEX 47900 system from Horizon. Sample loading was followed by washing the disks twice for 1 min with Milli-Q water and drying by air ow for 15 min. The analytes were eluted for 1 min with cyclohexane, followed by an acetone elution for 1 min, then 4 times for 1 min with acetonitrile. After each elution step, the disks were air-dried for 1 min. The combined extracts were dried under nitrogen ow in a water bath heated to 40°C. The samples were resuspended in 2 mL acetonitrile/water (10/90) by sonication for 5 min and remaining particles were removed by passing the extracts through a 0.7 µm glass-bre lter (Sartorius, Brussels, Belgium).

LC-HRMS Analysis
Reversed-phase chromatography was accomplished using an Acquity Ultra Performance Liquid Chromatography (UPLC) BEH C 18 column (dimensions: 1.7 µm, 2.1 x 150 mm) from "Waters". The ow was set to 0.20 mL/min using water (0.1 % formic acid, A) and methanol (B) as the mobile phase. The mobile phase gradient started at 90 % of A and 10 % of B at 0 min and was kept for 2 min before linearly ramping to 100 % B at 15 min. This condition was kept for another 5 min before bringing back to starting mobile phase conditions after 21 min. The column was allowed to re-equilibrate for 9 min before the next injection.

Pesticide Substance Selection
The plant protection product list from the Luxembourgish "Administration des Services Techniques de l'Agriculture" (ASTA) 27 and the SWISSPEST list of registered insecticides and fungicides in Switzerland 10 were used as starting points for the suspect list. Several (multilingual) documents provided by collaborators in the Clinical & Experimental Neuroscience group at the Luxembourg Centre for Systems Biomedicine as part of previous work 28 were also included, as documented in the "LUXPEST" dataset available on Zenodo 29 and brie y below.
The nal LUXPEST pesticide suspect list included 386 pesticides, 29 classi ed into different classes along with information about their use authorisation in Luxembourg. 30,31,32 Out of the 386 pesticides, 196 are permitted to be used in Luxembourg whereas 169 are not, while for 21 pesticides,no permission information was available. The classi cation efforts revealed that most of them were fungicides and herbicides (96 and 93 respectively); 49 were already classi ed as pesticide TPs (SM Figure S1). As a part of "FAIRifying" this dataset, the LUXPEST list is openly available on the NORMAN-SLE 14 , PubChem 24 and CompTox 25,33 websites, and the detailed classi cation information was added to the PubChem NORMAN-SLE Classi cation Browser (https://pubchem.ncbi.nlm.nih.gov/classi cation/#hid=101) and into the individual records for the pesticides (see SM Figures S2 and S3)

Pre-screening with Shinyscreen
Pre-screening was performed using Shinyscreen 21 with the following settings for extraction and automatic quality control (explained in greater detail in Lai et al. 22 ): coarse precursor m/z error ± 0.5 Da, ne precursor m/z error ± 2.5 ppm, extracted ion chromatogram m/z error ± 0.001 Da, retention time (RT) tolerance ± 0.5 min, an MS1 intensity threshold of 1.0 x 10 5 and an MS2 intensity threshold relative to the MS1 peak intensity of 0.05. Features that ful lled the following four criteria were considered as passing the quality control: 1) MS1 peak intensity > 1x10 5 , 2) presence of MS2 spectrum, 3) alignment of MS1 and MS2 peaks within the RT tolerance, 4) signal to noise ratio > 3.

Candidate Identi cation with MetFrag
The features that passed the quality control were then analysed using MetFrag 23 coupled to CompTox 25,34 to achieve tentative identi cations 23 , generally consistent with Lai et al. 22 Candidates were retrieved using an (exact mass + 10 ppm) window, where the exact mass settings included the measured ion mass plus adduct species ([M + H] + for positive and [M-H] − for negative mode, automatically detected from the Shinyscreen mode output) for internal correction to neutral mass in MetFrag for candidate retrieval. The InChIKey ltering (default setting) was left on, i.e., candidates that vary only in the stereochemistry are merged in the output, and the highest scoring candidate is considered. Several MetFrag scoring terms were included. The two most relevant scoring terms for this study are the MetFrag in silico fragmentation score (settings: mzabs = 0.001; frag_ppm = 5; adduct setting as per candidate retrieval) and the MoNA (MassBank Of North America) score. 35 While MetFrag compares the experimental results with in silico fragmentation results, it also searches the experimental data with online mass spectral records from a public spectral library, MoNA, and presents these outcomes to users via the MoNA spectral similarity scoring term (hereafter "MoNA Score"). Several additional metadata terms were used in the MetFrag calculation (generally consistent with Lai et al. 22 ) but were not considered further here, yielding in the end a maximum score of 10 where every scoring term has the same weight (10 scoring terms each with a weight of 1). However, as described below, the MoNA Score became the primary decision-making criterion in this work. The additional scoring terms were CPDAT__COUNT, PUBMED_ARTICLES, DATA_SOURCES, PUBCHEM_SOURCES, TOXCAST_PERCENT_ACTIVE_BIOASSAYS, PREDICTED_EXPOSURE, KEMIMARKET_EXPO and KEMIMARKET_HAZ.
All the chemicals that achieved a MoNA Score greater than or equal to 0.9 (scoring range between 0 and 1) were assigned as Level-2a compounds according to the scheme described by Schymanski et al. 20 and as described above. In this study, four different MoNA score scenarios were de ned in the context of the results available, also in line with commonly used thresholds in the community. The four scenarios were de ned as the following: 1) "very good" describes the cases with a MoNA score equal or greater to 0.9, i.e., a Level-2a, 2) "good" describes the cases with a MoNA score between 0.7 and 0.9, which can be considered in some cases su cient for Level-2a but based on experience not always su cient; 3) "poor" describes the cases with a MoNA score between greater than 0 and smaller than 0.7 and 4) "no spectrum" describes the cases with a MoNA score equal to 0. The rst scenario led to a Level-2a as described above and the three other scenarios remained at a Level-3 for further inspection.

Transformations
In a collaborative effort between PubChem and the NORMAN-SLE, several lists of chemicals including parent-TP information were mapped up into a standardized format and added into PubChem as "Transformations", as described elsewhere 26 (see Fig. 3).
The so-called "parents" were termed "predecessor" to avoid terminology clashes (as the term "parent" has a different meaning in PubChem), and the TPs or metabolites were termed "successors" in PubChem. At the time this study was performed, the NORMAN-SLE lists included were S60 SWISSPEST19 13  or via scripting queries. Custom-made R functions were designed to access this as a part of this work. 39

Hazardous Substance Database (HSDB) Metabolites
A further information source of TPs within PubChem is the "Metabolism and Metabolites" section which, unlike the table above, are human-readable text excerpts from several data sources, including the Hazardous Substance Database (HSDB) from the US National Library of Medicine (NLM), recently fully integrated within PubChem. As a pilot project as part of this work, a data extraction work ow was designed based on the HSDB annotation le (available in JavaScript Object Notation -JSON format). In short, text excerpts are automatically screened for recognized synonyms PubChem-side and, where detected, hyperlinked (shown as blue text in Figure 4, and recognizable in the annotation le by CID).
This information can be automatically retrieved from the JSON le. Additionally, the text also contains many descriptive reactions that are not suitable for automated synonym recognition, but interpretable by chemists. Thus, information was automatically extracted in a tabular form for manual curation (e.g., removal of irrelevant matches, addition of new chemicals) with full provenance suitable for conversion into a "Transformations"  40,41 . To describe the challenges visually, the predecessor (Fig. 4, atrazine) is circled in purple and was automatically extracted, along with two TPs 2-hydroxyatrazine (red; two different synonyms mapping to the same structure) and 2hydroxydesethylatrazine (orange, three synonyms; not each synonym was recognised fully). Text-mined entries retrieved in this manner are circled in full lines. Desethylatrazine was not automatically recognised (no blue hyperlink present) but was curated and added in manually (blue dotted lines). The synonym "hydroxy" was automatically mapped (blue hyperlink, green dashed circle) but removed in the manual curation step as an artefact of the mapping.
All HSDB TPs extracted in this manner were added to a new suspect list S68 HSDBTPS 42 and full provenance of the curation is available on the Environmental Cheminformatics GitLab repository 43 .

Veri cation and Quanti cation using Reference Standards
All the pesticides at a Level-2a were selected for further veri cation via reference standards analysed with the same chromatographic parameters and procedures as for the sample analysis. Several reference standards came from the in house available ENTACT mixtures, obtained from participation in the EPA's Non-Targeted Analysis Collaborative Trial. 44 Retention times were considered a match if the difference was less than ± 0.2 min. Additional reference standards were purchased where possible (SM Table S2).
Where reference standards were available, the concentration of the pesticides and TPs were quanti ed using an external calibration curve ranging from 1ppb to 1000ppb spanning the linear dynamic range for the compounds quanti ed. Thermo Scienti c TraceFinder™ Software (version 5.1) was used for automatic peak integration and generation of the calibration curve. Concentrations below 1ppb were reported to be below the quanti able range.

Results
The numbers that will be explained in detail in the next sections are summarized in a table (SM Table S3), to provide an overview of the number of cases and/or compounds for each step of the work ow.

Tentatively Detected Pesticides
Shinyscreen was run with the 386 LUXPEST 29 suspects (SM Table S4) on river water samples from nine locations over six months and for two modes (positive and negative), comprising 20,844 cases for the automated quality control protocol. In total, there were 3,006 cases deemed suitable for further identi cation with MetFrag, corresponded to 162 unique compounds (SM Table S5). Figure 5 illustrates the number of cases for each location and for each month.
For example, in April 2019 the river "Sûre" in Erpeldange revealed 44 cases that passed the quality check in Shinyscreen. These were subsequently analyzed and annotated with MetFrag to assign an identi cation con dence level.

Pesticide Annotation with MetFrag
The 3,006 cases were categorized into four different scenarios depending on their MoNA score, as shown in Figure 6.

Pesticide Transformation Products Suspect List
Out of the 386 compounds, 162 different pesticides were found (tentatively, at Level-2a or Level-3 con dence) in either one or more locations over six months. Since the manual curation of HSDB content is complex and time-consuming, only the 36 previously selected Level-2a pesticides (suspects with a MoNA sore > 0.9) were selected (SM Table S6) for further retrieval of TP information from PubChem. Of the 36 pesticides, there were 30 that already had information in the "Transformations" section. In addition, 22 pesticides had further information in the HSDB Metabolism and Metabolites section, while no information was available for only 3 pesticides. There were 19 pesticides that had information in both the HSDB and "Transformations" section.
In the end, a new suspect list of 181 transformation products and their parent compounds was created, including the 36 parent compounds (the Level-2a cases identi ed earlier) and 173 TPs related to these 36 pesticides that were added in this step. Although the parent compounds were already analysed previously, they were retained for a direct comparison between the presence of the parent compounds and their TPs (see discussion). This table is given in the SM, Table S6.
After manual curation, the merged data le of TPs extracted from HSDB was added to Zenodo as HSDBTPS 42 and the newly generated information was also provided to PubChem as "Transformation" tables to update this section as well (also included in the Zenodo deposition). The HSDBTPS list is also available in CompTox. 45

Suspect Screening for the Pesticide TPs
Shinyscreen was run again for all samples with 181 pre-selected compounds (SM Table S7), resulting in a total of 19,548 cases. Of these, there were 1,275 cases in negative mode and 2,159 cases in positive mode that were able to pass the quality check. Since some suspects were detected in different locations in positive and negative ionization mode, these 3,434 cases corresponded to 99 transformation products (SM Table S8) and the 36 parent compounds (135 different compounds in total). The number of cases for each location and month is available in the SM, Figure S4.
The MS2 spectra of 135 tentatively identi ed suspects were then processed using MetFrag with the same databases and scoring terms as before and the identi cation con dence levels were determined based on the MoNA scores (SM Figure S5). Out of the 3,434 cases, there were 1,190 were able to achieve a MoNA score above 0.9 corresponding to eight unique additional TPs (SM Table S8).

Veri cation of the Tentative Candidates and Their Quanti cation
The 36 Level-2a pesticide identi cations were selected for further con rmation efforts with reference standards (SM Table S9). Of these, 26 of these were veri ed using single standards and 10 compounds were veri ed with reference standards contained in the ENTACT mixtures (the work on the TPs had not yet been performed when this selection was made).
Out of the 36 parent compounds, there were 31 chemicals that achieved a Level-1, while ve could not be con rmed (different retention times, see SM Table S9). Of the 31 Level-1 compounds, only 20 were present at quanti able amounts (within the scope here), as presented in Fig. 7 (see also SM Table S10 and Table S11).
The classi cation and Luxembourgish permission information for the 20 quanti ed compounds are summarized in SM Figure S6.

Discussion
This work aims for a more dynamic experience of suspect screening in non-target environmental HR-MS measurements, using open cheminformatics approaches and tentative detections in samples, while using Luxembourgish river samples as an example. The discussion will look into how the coupling of parent and TP information can support interpretation using the example of terbutylazine, then look at the overall implications of these results for Luxembourg, before delving into the FAIRi cation of TP data and the implications for further efforts.

Example of Pesticide-TP Screening: Terbutylazine
The following example of terbutylazine and three TPs visualizes how the coupling of suspect screening for pesticides and transformation products can be automated and visualized in Shinyscreen. Figure 9 shows three different plots belonging to one parent compound (terbutylazine, top, suspect list ID N° 3) with three TPs, 2-hydroxyterbutylazine (ID N° 11), desethyl-2-hydroxyterbutylazine (ID N° 4), and desethylterbutylazine (ID N° 2).
The parent compound was found in the months May, July and September at the identi cation Level-2a, retention time of ~ 17.41 min, with two isobars found at ~ 16.00 and 14.63 min. These isobars are speculated to be other compounds in this case; MetFrag suggested for both the compound propazine, due to highest metadata scores with the selected scoring terms in CompTox (speci cally due to higher toxicity concerns and some higher reference counts); propazine was also reported as a suspect by many in the 2015 NORMAN Collaborative Trial 46 , although it has not been permitted for use for many years.
Interestingly, the use of PubChemLite with the optimized default scoring terms 26 resulted in terbutylazine appearing ahead of propazine in the metadata ranking; further addition of the "agrochemicals" category 26 helps up-prioritize the potentially most relevant alternative isobars for further consideration at a later stage (e.g. sebutylazine). The importance of the choice of the various CompTox metadata terms and the resulting consequences in interpretation are discussed in detail in Lai et al. 22 and thus not discussed further here.
One of the main TPs, desethylterbutylazine (ID N° 2, 4th chromatogram in the Figure 9) involves the loss of the ethyl group and is detected at 15.6 min at high intensity in July and October. Since one ethyl is lost, a lower (but not dramatically lower) retention time than the parent would be expected on a reverse phase column, thus the detection at 15.6 min is considered more plausible than other peaks reported at 9.0 min for other months. The fact that the TP peak does not occur at the same time as the parent rules out the possibility of an in-source fragmentation from the parent. After the veri cation with reference standards, it became clear that the retention time of desethylterbutylazine is indeed 15.67 min (the isobar, simazine, was con rmed at RT of 15.24 min, see SM Table S9). The third TP, desethyl-2-hydroxyterbutylazine (ID N°4 ) is detected at 9 minutes in May, July, August and September (at Level-3), which coincides with parent detections plus another month where the parent was not detected. Since the chlorine has been replaced by an oxygen, combined with the ethyl group, the dramatic reduction of retention time relative to the parent is plausible, as both transformations increase the polarity and thus reduce the retention time. The last TPs of terbutylazine is terbutylazine-2-hydroxy (ID N° 11) containing an oxygen instead of a chorine as well. This compound was found for all months and since this TP can be a degradation compound from different parent compounds (e.g. terbutylazine found at Level-3 and terbutryn found at Level-1 amongst others, it could be present due to the transformation from both, see https://pubchem.ncbi.nlm.nih.gov/compound/135495928#section=Transformations).

Pesticides and TPs in Luxembourgish Surface Waters
The fact that half of the detected and quanti ed suspects are not permitted for use in Luxembourg (see SM Figure S6) will be investigated further by AGE. Several reasons could contribute to this: either these pesticides were allowed in the past and their presence is due to historical use; or these pesticides are applied without permission (considered unlikely based on the results here); ve of the entries were TPs that are not permitted for use. Looking at the permission information of their parent compounds revealed that for some banned TPs (e.g. 2-hydroxyatrazine) the parent compound is banned as well (atrazine), but for others (e.g. desethylterbutylazine) the parent compound is permitted (terbutylazine). As an example, the low levels of atrazine detected here (< 100 ppt) are likely to be due to historical applications still seeping into the surface waters; fresh applications would likely yield higher levels.
As shown in Fig. 7 (all the concentrations are available in the SM, Table S10), the pesticide TP succinic acid was found in highest concentrations (maximum concentrations found: 773.52 parts per trillion = 0.77 ng/L) in the river samples. This high concentration is most probably due to the fact that this chemical has several "roles" in the environment and can come from both natural and anthropogenic sources. For instance, succinic acid is involved in several processes in the body (e.g., generated in mitochondria via the citric acid cycle) and is also a food additive 47 ; thus alternative sources are likely to be much higher contributors to the overall concentrations than this being a documented TP of the pesticides sulcotrione (present in the LUXPEST list but did not pass the pre-screening) and linuron (not present in the LUXPEST list). This shows the importance of having information about the multiple roles of chemicals available in an easily accessible and readable manner. The overall lowest concentrations were found for the compounds desethylatrazine, 2-hydroxyatrazine and simazine (minimal concentrations around 0.001 ng/L). Returning to the example from the section before (Sect. 4.1), desethylterbutylazine was con rmed in 8 out of 9 river samples (except for the river Alzette from Mersch-Berschbach), in all the 6 months (SM Table S10).
As shown in Fig. 8, the overall lowest average number of compounds were found in the rivers Eisch and Sûre, which is reassuring in the context of Luxembourg as about one-third of the drinking water originates in the river Sûre. 48 The temporal patterns (Fig. 8B) show that there is a spike in detections in late spring/beginning of the summer, with an additional smaller spike in September. The overall lowest average number of compounds was found in April, re ecting the expected seasonality of the pesticide application. All screening results presented here have been communicated with AGE for consideration in their subsequent monitoring efforts; while this article presents the results from April-October 2019, these collaborative nontarget screening efforts are also still continuing.

Pre-screening and Annotation Work ow
During pre-screening, all the les were loaded into Shinyscreen, corresponding to a total of 41,688 cases and graphs (386 pesticides times two modes times six months times nine locations: 386 x 2 x 6 x 9 = 41,688) that were analysed. The manual inspection revealed that for the majority of cases, an empty graph was obtained leading to the conclusion that most suspects were not present in the samples. This demonstrates the need for such a semi-automated procedure, since it makes visualizing and checking the experimental data very e cient and easy. In the end, there were 3,006 cases retained that passed the quality checks were retained, leading to a nal set of 162 different tentatively identi ed compounds. This means that 42 % of the compounds that were screened with Shinyscreen may be present in at least one of the samples.
Some of these 162 compounds were detected in multiple locations and the comparison between the retention times for the different locations revealed two general trends. The rst trend shows a subtle difference (e.g., ± 0.5 min) in the retention times, which is probably the consequence of uctuations in the liquid chromatography. The second trend shows wide differences in retention times (several minutes) leading to the conclusion that only one of these signals could potentially belong to the suspect, whereas the other signals most likely belong to different (isobaric, i.e., same mass) substances. For example, Shinyscreen suggested that the compounds 3-hydroxybenzoic acid and 4-hydroxybenzoic acid (both isobaric) are present in the samples and the automatic retrieved retention time was equal to 14.89 min (default behaviour extracts the retention time of the most intense peak). However, in the end, through the veri cation with reference standards, the results showed that the compound in the sample was salicylic acid since only the reference standard for this compound had a retention time of 14.9 min and the ones from 3-hydroxybenzoic acid and 4-hydroxybenzoic acid differed (12.04 min and 10.83 min respectively). Shinyscreen has subsequently been upgraded to offer more extensive isobar handling during prescreening (release 1.0.0, 2nd April 2021); the MetFrag post-processing has also been correspondingly updated and, as discussed above, the metadata scoring terms integrated into PubChemLite have also made data interpretation of relevant isobars both easier and more powerful 26 .
During the analysis of the MetFrag results, the months, modes and locations were considered together. At rst, the MoNA score is investigated and out of the 3,006 cases: 719 cases obtained a very good, 118 a good, and 663 a poor MoNA score. Additionally, in 1,506 cases the MoNA score was equal to 0 (no spectrum matching or available in the library). In consequence, for 719 cases an identi cation of Level-2a can be achieved and for the remaining 2,287 cases, a Level-3 is attained (Fig. 6). When looking at the level of unique pesticides, out of the 162 pesticides, there are 140 that remain at an identi cation of Level-3, while 36 obtained a Level-2a based on MoNA scores and further metadata analyses (SM Table   S5).
For the TPs, there were 19,548 cases and graphs (181 pesticides times two modes times six months times nine locations) were analyzed. Out of these, there were 3,434 cases that passed the quality check and kept for further analysis. This leads to a nal number of 99 newly identi ed compounds (135 compounds in total − 36 known pesticides = new compounds 99). When excluding the 36 parent compounds, this led to eight TPs with a very good, one with a good, and nine with a poor MoNA score.
The remaining 81 pesticides (out of 99) had no spectrum available in MoNA, showing the importance of additional community contributions to open resources to help ll these data gaps in the future.
For the tentative identi cation with MetFrag, only the spectral-based scoring terms were investigated here, namely the MetFrag in silico fragmentation and primarily the MoNA similarity score. None of the additional metadata scores were used, as prioritization was done purely based on achieving a very good MoNA score for highest con dence. The work described here also helped contribute to the conceptual design of the PubChemLite for Exposomics collection, where the category of chemical (e.g. agrochemical/pesticide or pharmaceutical) can be used in interpretation and even scoring. The performance described elsewhere 26 demonstrated that the interpretation of results can be improved with this additional information, achieving up to 90 % annotation success for the agrochemicals (pesticides) in the benchmarking set. Efforts are underway to streamline the coupling of suspect + TP screening together with Shinyscreen, MetFrag and PubChemLite in a smooth work ow on the foundation of the work described here, including the collapsing of many "Cases" into unique compounds much earlier in the work ow.

Open Pesticide and Transformations Data
Out of the 386 selected pesticides, 196 are permitted and 169 are forbidden in Luxembourg (SM Table   S4) and could be classi ed into six main categories (SM Figure S1). This information can be browsed in PubChem under LUXPEST at https://pubchem.ncbi.nlm.nih.gov/classi cation/#hid=101 (SM Figure S2) and this information is incorporated into the individual records in PubChem (Example in the SM Figure   S3). This information ow helps create the annotation categories that form the PubChemLite for Exposomics collection (see Schymanski et al. 26 Fig. 1) and provide PubChem users with additional expert knowledge for interpretation of their results. Ensuring this continual ow of information is a major motivating factor for increasing the FAIRness of datasets and thus the upload of the datasets to different open access databases (CompTox, PubChem) and repositories (NORMAN-SLE, Zenodo), as well as the integration of the classi cation (SM Figure S2) and regulation information in Luxembourg into PubChem.
Since the NORMAN-SLE compound lists are "FAIR" due to the Zenodo deposition with explicit license declaration, they can be used by PubChem directly to create automatic work ows to build the Transformations section; other users and resources are also able (and encouraged) to re-use this data as they wish. By adding chemical identi ers to the historical information retrieved from the HSDB via textmining methods and adding this as a new suspect list to the NORMAN-SLE, the original source (HSDB) can be credited, and the value-added data fed back into PubChem as transformations for improved automated retrieval in future screening activities, so that this information is now available in both human and machine-readable forms.
Several transformations tables have now been added to PubChem, including HSDBTPS as a part of this work. The manual curation involved with the text-mined information was the most time-consuming part of this process and was thus only performed on the 36 Level-2a pesticides that were selected from the rst analysis due to their very good MoNA score. Of these, it was possible to generate transformation products for 33 compounds (no compounds were found in HSDB or the "Transformation" This study describes open cheminformatics approaches to screen for emerging contaminants (in this case pesticides) and their TPs in non-target HR-MS measurements. annotation approaches such as MetFrag will pave the way for higher throughput screening of exposomics samples in many contexts, as showcased here for pesticides.
In terms of local outcomes, these efforts (and parallel efforts investigating other substances classes) are continuing and the results are being exchanged with AGE to help improve monitoring efforts and thus human and environmental health in Luxembourg, above and beyond the current EU requirements.

Supplementary Material
Two supplementary data les are provided, a document containing Figures S1 through to S8, and an excel le containing Tables S1 to S11. For details about the code, software, suspect list and raw le availability, see Data Statement.  Figure 1 The newly created high-throughput suspect screening work ow, including experimental (top, grey) and computational steps. Both suspect and target screening were performed.
Page 24/27 Figure 5 The results of pre-screening with Shinyscreen, showing how many pesticides passed the quality check for each sampling location and per month (positive and negative modes are visualized together).

Figure 6
The results of MetFrag spectra annotation. The graph represents the 3,006 cases (162 pesticides) regrouped according to the four MoNA score scenarios for the six months (positive and negative mode together). Concentration values that were below the respective quanti cation range were excluded. All compounds were measured in positive mode except for those marked with an asterisk, which were measured in negative mode.

Figure 8
The spatial (A) and temporal (B) distribution of the tentatively detected pesticides and transformation products as well as for the veri ed and quanti ed compounds. No samples were available for June.