Creating “PubChemLite” for Exposomics
Since a very large proportion of the PubChem database (> 60%) is sourced from purchasable screening libraries from chemical vendors, where the chemicals are generally produced in relatively small amounts (e.g. mg) in a laboratory setting, the vast majority of these chemicals are highly unlikely to be detectable in either the environment or biological samples. Thus, instead of the current status quo, i.e. searching the entire PubChem database and using metadata scores to “up-prioritize” interesting candidates (i.e., processing tens of thousands of candidates per mass, to only obtain tens to hundreds of interesting entries), the first step investigated the creation of relevant subsets of PubChem for more efficient queries. This was done by selecting relevant sections of the “PubChem Compound Table of Contents” (PubChem Compound TOC) Classification [33] as shown in Fig. 1. Further details are given in the Methods section.
Initially, two versions of PubChemLite were created. The environmental selection (PubChemLite tier0), formed of the yellow-shaded categories in Fig. 1, shortened to “AgroChemInfo, DrugMedicInfo, FoodRelated, PharmacoInfo, SafetyInfo, ToxicityInfo, KnownUse”, whereas the exposomics selection (PubChemLite tier1) had the additional purple-shaded category, shortened to “BioPathway”, which contained the additional biological information categories relevant to metabolomics and exposomics. Entries were merged by InChIKey first block (the structural skeleton), and total Patent Counts and Literature Counts were calculated over the merged entries (full details in the Methods section). Each category was added as an additional column, where each entry was assigned a value that was a (merged) count of the sub-categories, and a total annotation count column was also added, summing the presence in top categories only (for further details, see methods). Initial versions (20 November 2019 [34] / 14 January 2020 [35]) contained 315,843 / 316,810 entries in tier0 (environmental collection) and 361,976 / 363,911 entries in tier1 (exposomics). In other words, the 103 M entries of PubChem (at the time) were collapsed down to two datasets of approximately 316 K and 360 K compounds. An RMarkdown file to visualize the content (categories and subcategories) of PubChemLite as an interactive sunburst plot (for a static version see Fig. 2) using the 14 January 2020 tier1 version is included as Additional File 1 and is also available on the ECI GitLab pages [36, 37]; further details are in the Methods section below.
A benchmark dataset of 977 de-duplicated compounds (see Additional File 2) was created by merging chemicals from previous evaluations [16, 25] (predominantly environmentally relevant) as described in the Methods. MetFrag was run with different versions of PubChemLite as well as CompTox (7 March 2019 release [38]) using comparable scoring terms. A summary of the results shown in Fig. 3 includes calculations both without (green) and with (blue) the use of MS/MS information (in silico fragmentation score and MS library matching scores). Further parameter details are given in the Methods section, with tables included in Additional File 3. Overall, CompTox and PubChemLite perform comparably; initially CompTox had fewer missing entries (grey shading) due to their earlier concerted efforts to add compounds of environmental interest, including transformation products (these gaps may well be smaller with the new data release). These gaps were closed progressively in PubChemLite as described in the next section “Identifying and Filling Gaps in PubChem Annotation Content”. Furthermore, early results (see Additional File 3 Figures S1 and S2, Tables S1 and S2) showed that both versions of PubChemLite, tier0 and tier1, performed almost identically even on environmental substances of interest, such that finally, one “PubChemLite” for exposomics will be created, equivalent to tier1 plus the two additional categories as shown in Fig. 1 [39]. Results from this version are also shown in Fig. 3.
The results in Fig. 3 show that, while annotation information alone leads to good ranking performance (~ 70–73% ranked first, dark green shaded results), the MS/MS information is essential for further improvements (~ 79–83% ranked first, dark blue shaded results). This is discussed further below. The PubChemLite results on the two initial versions (20 November 2019 and 14 January 2020) also clearly show that ~ 8% of the benchmark dataset were missing from PubChemLite. A detailed interrogation of the benchmark set of 977 reference standards from Eawag and UFZ revealed that – as commented by the community over many years – detailed annotation information was missing for well-known relevant transformation products in PubChem. This accounted for 37 of the 57 missing entries in the January 14, 2020 tier0 version and is discussed further in the next section.
Identifying and Filling Gaps in PubChem Annotation Content
During previous evaluations of MetFrag specifically [25], and in silico identification approaches for HR-MS in general during e.g. CASMI [16], the focus has generally been on evaluating the methods themselves, aiming for objective evaluation. The use of identification approaches in typical real-life scenarios, however, often requires additional subjectiveness to provide interpretation, not just identification. Thus, the material in this article should not be viewed as an evaluation of MetFrag itself (which has not changed), but rather demonstrates how improving the underlying database and associated functionality can help to improve outcomes for users (i.e. the ability to find relevant chemicals) in the context of exposomics. In other words, this has been an opportunity to investigate and improve the annotation content (i.e. information content beyond structural properties) in PubChem for exposomics.
As Fig. 3 reveals, 57 chemicals from the benchmark set were missing in the early versions of PubChemLite, many of which were well-known transformation products in environmental studies. Since adding annotation content requires also sufficient provenance and evidence to support the annotation, the NORMAN-SLE [29, 43], which now has its own Classification Browser [44] in PubChem (see Fig. 4) was browsed for suitable suspect lists containing annotation content. Initial efforts concentrated on list S60 (SWISSPEST19) [45], a list of pesticides and transformation products / metabolites documented by Kiefer et al. [46]. This list contained parent-transformation product mappings, plus the link to information about agrochemical use (since the focus was on pesticides). The list was modified into a “predecessor / successor” mapping form (to avoid terminology clashes within other sections of PubChem) and added, with full provenance, into a new “Transformations” section in the individual PubChem records (see Fig. 5). Accompanying statements on “Agrochemical Transformations” within the agrochemical sections were also added, for example “Folpet has known environmental transformation products that include Phthalimide, Phthalamic acid, and Phthalic acid” [47]. The PubChemLite version created 22 May 2020 [48] included these new annotations, with fewer missing entries and slightly better ranks (see Fig. 3). Since this only focused on the agrochemicals (pesticides), the many pharmaceutical (and other) transformation products among the Eawag dataset were still missing. While these are all present in MassBank [49] (S1 in the NORMAN-SLE [50]), this dataset does not come with appropriate annotation content or provenance. Instead, the Supporting Information from Schollee et al. [51] provided suitable parent-TP mappings to create the predecessor-successor tables, which was merged with the Eawag classification information (with permission and support from Juliane Hollender) and added as list S66 [52]. This collection, together with list S68 HSDBTPS [53], resulted in the greater coverage in the June 2020 [48] and October 2020 [39] versions (see Fig. 3), with only 16 missing entries (15 in October) remaining. These remaining 16 entries could not be clearly related to any specific NORMAN-SLE lists to add further annotation content at this stage; although annotation content is being progressively added in separate efforts – as is evident from the one less missing entry in October.
Leveraging Annotation Content in Exposomics
The results presented in Fig. 3 detailed the use of rather generic metadata terms (literature counts, patent counts, total annotation counts). However, one aim of setting up PubChemLite was not only to merge several “useful” categories for exposomics, but to leverage the information within these categories (providing interpretation about candidates in candidates sets). The smallest annotation category in PubChemLite, the agrochemicals, was taken as an additional benchmarking dataset (1336 chemicals, 22 Jan 2020, see Additional File 4) to investigate the influence of database size and the additional scoring terms on the ranking results. Since this was to mimic an environmental investigation interested in detecting agrochemicals (i.e. a “suspect screening” approach [7]), the “agrochemical score”, i.e. how many agrochemical categories exist in PubChem for that chemical, was used as an additional scoring term in MetFrag (details in the Methods). The results are shown as the green entries in Fig. 6; the exact numbers are given in Additional File 3 (Table S3).
With a full PubChem query and using only literature and patent information to score, only 58% of entries were correctly ranked in first place (which is not unexpected, as e.g. pharmaceuticals, industrial chemicals or even metabolites with the same mass may have larger literature or patent counts). When the database was restricted to the candidates in PubChemLite using the same scoring terms (literature and patent counts), this increased to 70%. However, adding the Agrochemical Score improved this further to 79.2%, demonstrating the potential usefulness of individual category-based scoring terms to help select relevant chemicals for further verification. In terms of computational efficiency, the last 101 queries (entries 1236–1336) of the Agrochemicals query took 11 minutes to complete with PubChemLite tier1 (query run 21 Jan 2020), while the equivalent query with the full PubChem database and scoring terms took 164 minutes (query run 26 Jan 2020). This results in approx. 6.5 sec per query for PubChemLite, versus 97 seconds per query for a full PubChem query (note: both queries were without fragmentation).
Since this is purely annotation-based scoring, it is imperative to use additional experimental information such as fragmentation information and further verification with reference standards before any claims of higher confidence annotation are made [11]. To address this, the benchmarking dataset (n = 977) used above (with MS/MS information available) was subset according to the availability of information in the Agrochemical Information category (creating a subset of n = 318), and evaluated with scoring terms relevant to the annotation type, as shown in the blue entry in Fig. 6. This mimics, to a certain extent, a typical suspect screening workflow where the main interest is in finding and confirming pesticides in an environmental sample. As shown, adding MS/MS information (MetFrag in silico fragmentation plus MoNA similarity score) increased the correctly ranked chemicals in first place to 90.6% for those agrochemicals that were also in the benchmarking set. If the database (in this case PubChemLite tier0 12 Jun 2020 version) had been restricted to agrochemicals only this would have risen to 94.3%, as some non-agrochemical isomers still outscored several entries based on the literature and patent values. The performance would not be able to rise much higher than 94% with this dataset, however, since there are multiple agrochemical isomers present in the dataset where the less-well-known (but often structurally related) isomers ranked lower because of less supporting metadata. For instance, for secbutylazine (CID 23712), the candidate terbutylazine, CID 22206 was ranked first and secbutylazine, CID 23712 was third, while another isomer propazine CID 4937 was second. All three isomers were in the dataset. In this case, both the in silico fragmenter and MoNA similarity scores captured these three isomers in the correct order (secbutylazine first, terbutylazine second, propazine third), showing that the experimental evidence is still crucial in distinguishing isomers - or indicating whether they are indistinguishable on given evidence. Terbutylazine was correctly ranked first for its corresponding entry (see Table 1).
Table 1
Candidate score distributions for three isomers/isobars of formula C9H16ClN5 in the agrochemical dataset. Values for the correct candidate in each case are bolded. Only the scores for the top 5 candidates (of 37) are shown.
Name (CID) | Terbutylazine (22206) | Propazine (4937) | Secbutylazine (23712) |
MetFrag Scores | 4.96; 3.45; 2.77; 1.93; 1.59 | 4.46; 3.88; 2.27; 1.81; 1.58 | 4.96; 3.52; 2.78; 1.92; 1.57 |
Fragmenter Score | 351; 250; 351; 239; 126 | 247; 295; 251; 170; 106 | 398; 303; 403; 272; 135 |
MoNA Similarity | 0.959; 0.672; 0.987; 0.0; 0.0 | 0.638; 0.841; 0.661; 0.0; 0.0 | 0.971; 0.703; 0.998; 0.0; 0.0 |
PubMed Count | 282; 127; 0; 11; 1 | 282; 127; 0; 11; 1 | 282; 127; 0; 11; 1 |
Patent Count | 10935; 8900; 1990; 6636; 6861 | 10935; 8900; 1990; 6636; 6861 | 10935; 8900; 1990; 6636; 6861 |
Annotation Count | 5; 5; 4; 4; 5 | 5; 5; 4; 4; 5 | 5; 5; 4; 4; 5 |
AgroChemInfo | 5; 4; 3; 3; 3 | 5; 4; 3; 3; 3 | 5; 4; 3; 3; 3 |
Rank | 1 of 37 | 2 of 37 | 3 of 37 |
Using this benchmarking dataset alone, taking PubChemLite and using the specific topic information for agrochemicals, most candidates were ranked 1st and the worst rank for a chemical was 3rd. Creating a similar pharmaceutical subset (as opposed to agrochemicals) using the “DrugMedicInfo” category yielded similar results (most ranked first, worst rank of 3rd ) using either DrugMedicInfo or PharmacoInfo as scoring terms (see Additional File 3, Figure S3). For a more generic category such as ToxicityInfo, most were ranked 1st or 2nd, but the worst rank was 12, indicating that this term may be less selective (see Additional File 3, Figure S3). Using patent and literature information alone (over the entire benchmark set), the worst rank was 27th, with 11 entries missing entirely. Thus, even though this dataset is of limited size (977 entries), the results indicate that there is a good chance that the top candidate will be among the Top 3 using PubChemLite for highly specific categories such as (agrochemicals, pharmaceuticals). On the other hand, more candidates will often have to be considered for less specific categories or questions (e.g. Toxicity Information) or when only the generic scoring terms are used. In the context of practical use of HR-MS for answering real life questions, e.g. the presence of well-known chemicals in environmental or patient samples, considering only a few candidates (e.g. 1–3) versus hundreds or even thousands of candidates per mass is a great step forward for higher throughput interpretation of non-target screening results and coming to meaningful conclusions quicker. It is expected that greater granularity in the annotation information will improve the interpretability and applicability of this information in the future (for instance toxicity information is currently often only “information is present” and not “the substance is toxic”); efforts are being made to achieve this (beyond the scope of the current article).
As a future perspective, the addition of extra information, such as partitioning information (e.g. logP, logKow or logD) and collision cross section (CCS) values, will also help in candidate selection in specific cases (although for isobars /isomers that are very similar, predictive values will often be very close). Efforts are currently underway to include XlogP3 [54] in future versions of PubChemLite to integrate within the retention time model already present in MetFrag [25]. Further, an initial version of PubChemLite (January 14, 2020 tier1) with CCS values contributed by CCSbase [55, 56] is also available on Zenodo [57] and in MetFrag web version [26] and is currently being evaluated in separate work.