This work aims for a more dynamic experience of suspect screening in non-target environmental HR-MS measurements, using open cheminformatics approaches and tentative detections in samples, while using Luxembourgish river samples as an example. The discussion will look into how the coupling of parent and TP information can support interpretation using the example of terbutylazine, then look at the overall implications of these results for Luxembourg, before delving into the FAIRification of TP data and the implications for further efforts.
4.1 Example of Pesticide-TP Screening: Terbutylazine
The following example of terbutylazine and three TPs visualizes how the coupling of suspect screening for pesticides and transformation products can be automated and visualized in Shinyscreen. Figure 9 shows three different plots belonging to one parent compound (terbutylazine, top, suspect list ID N° 3) with three TPs, 2-hydroxyterbutylazine (ID N° 11), desethyl-2-hydroxyterbutylazine (ID N° 4), and desethylterbutylazine (ID N° 2).
The parent compound was found in the months May, July and September at the identification Level-2a, retention time of ~ 17.41 min, with two isobars found at ~ 16.00 and 14.63 min. These isobars are speculated to be other compounds in this case; MetFrag suggested for both the compound propazine, due to highest metadata scores with the selected scoring terms in CompTox (specifically due to higher toxicity concerns and some higher reference counts); propazine was also reported as a suspect by many in the 2015 NORMAN Collaborative Trial46, although it has not been permitted for use for many years. Interestingly, the use of PubChemLite with the optimized default scoring terms26 resulted in terbutylazine appearing ahead of propazine in the metadata ranking; further addition of the "agrochemicals" category26 helps up-prioritize the potentially most relevant alternative isobars for further consideration at a later stage (e.g. sebutylazine). The importance of the choice of the various CompTox metadata terms and the resulting consequences in interpretation are discussed in detail in Lai et al.22 and thus not discussed further here.
One of the main TPs, desethylterbutylazine (ID N° 2, 4th chromatogram in the Figure 9) involves the loss of the ethyl group and is detected at 15.6 min at high intensity in July and October. Since one ethyl is lost, a lower (but not dramatically lower) retention time than the parent would be expected on a reverse phase column, thus the detection at 15.6 min is considered more plausible than other peaks reported at 9.0 min for other months. The fact that the TP peak does not occur at the same time as the parent rules out the possibility of an in-source fragmentation from the parent. After the verification with reference standards, it became clear that the retention time of desethylterbutylazine is indeed 15.67 min (the isobar, simazine, was confirmed at RT of 15.24 min, see SM Table S9). The third TP, desethyl-2-hydroxyterbutylazine (ID N° 4) is detected at 9 minutes in May, July, August and September (at Level-3), which coincides with parent detections plus another month where the parent was not detected. Since the chlorine has been replaced by an oxygen, combined with the ethyl group, the dramatic reduction of retention time relative to the parent is plausible, as both transformations increase the polarity and thus reduce the retention time. The last TPs of terbutylazine is terbutylazine-2-hydroxy (ID N° 11) containing an oxygen instead of a chorine as well. This compound was found for all months and since this TP can be a degradation compound from different parent compounds (e.g. terbutylazine found at Level-3 and terbutryn found at Level-1 amongst others, it could be present due to the transformation from both, see https://pubchem.ncbi.nlm.nih.gov/compound/135495928#section=Transformations).
4.2 Pesticides and TPs in Luxembourgish Surface Waters
The fact that half of the detected and quantified suspects are not permitted for use in Luxembourg (see SM Figure S6) will be investigated further by AGE. Several reasons could contribute to this: either these pesticides were allowed in the past and their presence is due to historical use; or these pesticides are applied without permission (considered unlikely based on the results here); five of the entries were TPs that are not permitted for use. Looking at the permission information of their parent compounds revealed that for some banned TPs (e.g. 2-hydroxyatrazine) the parent compound is banned as well (atrazine), but for others (e.g. desethylterbutylazine) the parent compound is permitted (terbutylazine). As an example, the low levels of atrazine detected here (< 100 ppt) are likely to be due to historical applications still seeping into the surface waters; fresh applications would likely yield higher levels.
As shown in Fig. 7 (all the concentrations are available in the SM, Table S10), the pesticide TP succinic acid was found in highest concentrations (maximum concentrations found: 773.52 parts per trillion = 0.77 ng/L) in the river samples. This high concentration is most probably due to the fact that this chemical has several "roles" in the environment and can come from both natural and anthropogenic sources. For instance, succinic acid is involved in several processes in the body (e.g., generated in mitochondria via the citric acid cycle) and is also a food additive47; thus alternative sources are likely to be much higher contributors to the overall concentrations than this being a documented TP of the pesticides sulcotrione (present in the LUXPEST list but did not pass the pre-screening) and linuron (not present in the LUXPEST list). This shows the importance of having information about the multiple roles of chemicals available in an easily accessible and readable manner. The overall lowest concentrations were found for the compounds desethylatrazine, 2-hydroxyatrazine and simazine (minimal concentrations around 0.001 ng/L). Returning to the example from the section before (Sect. 4.1), desethylterbutylazine was confirmed in 8 out of 9 river samples (except for the river Alzette from Mersch-Berschbach), in all the 6 months (SM Table S10).
As shown in Fig. 8, the overall lowest average number of compounds were found in the rivers Eisch and Sûre, which is reassuring in the context of Luxembourg as about one-third of the drinking water originates in the river Sûre.48
The temporal patterns (Fig. 8B) show that there is a spike in detections in late spring/beginning of the summer, with an additional smaller spike in September. The overall lowest average number of compounds was found in April, reflecting the expected seasonality of the pesticide application. All screening results presented here have been communicated with AGE for consideration in their subsequent monitoring efforts; while this article presents the results from April-October 2019, these collaborative non-target screening efforts are also still continuing.
4.3 Pre-screening and Annotation Workflow
During pre-screening, all the files were loaded into Shinyscreen, corresponding to a total of 41,688 cases and graphs (386 pesticides times two modes times six months times nine locations: 386 x 2 x 6 x 9 = 41,688) that were analysed. The manual inspection revealed that for the majority of cases, an empty graph was obtained leading to the conclusion that most suspects were not present in the samples. This demonstrates the need for such a semi-automated procedure, since it makes visualizing and checking the experimental data very efficient and easy. In the end, there were 3,006 cases retained that passed the quality checks were retained, leading to a final set of 162 different tentatively identified compounds. This means that 42 % of the compounds that were screened with Shinyscreen may be present in at least one of the samples.
Some of these 162 compounds were detected in multiple locations and the comparison between the retention times for the different locations revealed two general trends. The first trend shows a subtle difference (e.g., ± 0.5 min) in the retention times, which is probably the consequence of fluctuations in the liquid chromatography. The second trend shows wide differences in retention times (several minutes) leading to the conclusion that only one of these signals could potentially belong to the suspect, whereas the other signals most likely belong to different (isobaric, i.e., same mass) substances. For example, Shinyscreen suggested that the compounds 3-hydroxybenzoic acid and 4-hydroxybenzoic acid (both isobaric) are present in the samples and the automatic retrieved retention time was equal to 14.89 min (default behaviour extracts the retention time of the most intense peak). However, in the end, through the verification with reference standards, the results showed that the compound in the sample was salicylic acid since only the reference standard for this compound had a retention time of 14.9 min and the ones from 3-hydroxybenzoic acid and 4-hydroxybenzoic acid differed (12.04 min and 10.83 min respectively). Shinyscreen has subsequently been upgraded to offer more extensive isobar handling during pre-screening (release 1.0.0, 2nd April 2021); the MetFrag post-processing has also been correspondingly updated and, as discussed above, the metadata scoring terms integrated into PubChemLite have also made data interpretation of relevant isobars both easier and more powerful26.
During the analysis of the MetFrag results, the months, modes and locations were considered together. At first, the MoNA score is investigated and out of the 3,006 cases: 719 cases obtained a very good, 118 a good, and 663 a poor MoNA score. Additionally, in 1,506 cases the MoNA score was equal to 0 (no spectrum matching or available in the library). In consequence, for 719 cases an identification of Level-2a can be achieved and for the remaining 2,287 cases, a Level-3 is attained (Fig. 6). When looking at the level of unique pesticides, out of the 162 pesticides, there are 140 that remain at an identification of Level-3, while 36 obtained a Level-2a based on MoNA scores and further metadata analyses (SM Table S5).
For the TPs, there were 19,548 cases and graphs (181 pesticides times two modes times six months times nine locations) were analyzed. Out of these, there were 3,434 cases that passed the quality check and kept for further analysis. This leads to a final number of 99 newly identified compounds (135 compounds in total − 36 known pesticides = new compounds 99). When excluding the 36 parent compounds, this led to eight TPs with a very good, one with a good, and nine with a poor MoNA score. The remaining 81 pesticides (out of 99) had no spectrum available in MoNA, showing the importance of additional community contributions to open resources to help fill these data gaps in the future.
For the tentative identification with MetFrag, only the spectral-based scoring terms were investigated here, namely the MetFrag in silico fragmentation and primarily the MoNA similarity score. None of the additional metadata scores were used, as prioritization was done purely based on achieving a very good MoNA score for highest confidence. The work described here also helped contribute to the conceptual design of the PubChemLite for Exposomics collection, where the category of chemical (e.g. agrochemical/pesticide or pharmaceutical) can be used in interpretation and even scoring. The performance described elsewhere26 demonstrated that the interpretation of results can be improved with this additional information, achieving up to 90 % annotation success for the agrochemicals (pesticides) in the benchmarking set. Efforts are underway to streamline the coupling of suspect + TP screening together with Shinyscreen, MetFrag and PubChemLite in a smooth workflow on the foundation of the work described here, including the collapsing of many "Cases" into unique compounds much earlier in the workflow.
4.4 Open Pesticide and Transformations Data
Out of the 386 selected pesticides, 196 are permitted and 169 are forbidden in Luxembourg (SM Table S4) and could be classified into six main categories (SM Figure S1). This information can be browsed in PubChem under LUXPEST at https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101 (SM Figure S2) and this information is incorporated into the individual records in PubChem (Example in the SM Figure S3). This information flow helps create the annotation categories that form the PubChemLite for Exposomics collection (see Schymanski et al.26 Fig. 1) and provide PubChem users with additional expert knowledge for interpretation of their results. Ensuring this continual flow of information is a major motivating factor for increasing the FAIRness of datasets and thus the upload of the datasets to different open access databases (CompTox, PubChem) and repositories (NORMAN-SLE, Zenodo), as well as the integration of the classification (SM Figure S2) and regulation information in Luxembourg into PubChem. Since the NORMAN-SLE compound lists are “FAIR” due to the Zenodo deposition with explicit license declaration, they can be used by PubChem directly to create automatic workflows to build the Transformations section; other users and resources are also able (and encouraged) to re-use this data as they wish. By adding chemical identifiers to the historical information retrieved from the HSDB via text-mining methods and adding this as a new suspect list to the NORMAN-SLE, the original source (HSDB) can be credited, and the value-added data fed back into PubChem as transformations for improved automated retrieval in future screening activities, so that this information is now available in both human and machine-readable forms.
Several transformations tables have now been added to PubChem, including HSDBTPS as a part of this work. The manual curation involved with the text-mined information was the most time-consuming part of this process and was thus only performed on the 36 Level-2a pesticides that were selected from the first analysis due to their very good MoNA score. Of these, it was possible to generate transformation products for 33 compounds (no compounds were found in HSDB or the “Transformation” table in PubChem for the remaining three compounds). In the end, there were 22 entries from HSDB extracted and manually curated (files available from GitLab49), resulting in 226 new transformation reactions with full literature provenance, and five new structural records in PubChem (CIDs 146035700, 146035701, 146035702, 146035703 and 146037633). In the end, a total of 145 transformation products were added to the 36 pesticides, which results in a suspect list of 181 compounds. Since this work was performed several other datasets have been added to the Transformations tables including MetXBioDB50 from BioTransformer51 and it is highly likely that the numbers of pesticide TPs retrieved for screening would be higher now.
This work was only possible through the exchange of information between the NORMAN-SLE and PubChem and, at this pilot stage, willingness on both sides to develop unconventional workflows not originally foreseen for either resource. While the R scripts developed are certainly functional, several optimizations are possible. In hindsight, the created workflow with this integrated script helped the authors discover and upload relationships between pesticides and their TPs to PubChem as well as identifying areas to improve the information flow in the future. Future efforts are already underway to streamline this further based on this pilot project, to develop even more automated forms of this workflow and to ensure easy, fast and accurate suspect and TP list generation from their parent compounds. All data transfer between the NORMAN-SLE and PubChem includes full provenance to the original literature sources. Since all "Transformations" entries were based on existing suspect lists or resources, it is quite resource intensive to add existing knowledge involving only a few entries. As a result, a new list, REFTPS52 (currently only with very few entries) has been created to provide a pathway to add single or small numbers of transformations resulting from individual studies, such as 6PPD-quinone from Tian et al.53 Overall, these pilot efforts have already caught the interest of several other workflows and are being integrated into the open source HR-MS workflow patRoon18, amongst others.