InSpectra – A Platform for Identifying Emerging Chemical Threats

doi:10.21203/rs.3.rs-2120496/v1

Download PDF

Article

InSpectra – A Platform for Identifying Emerging Chemical Threats

https://doi.org/10.21203/rs.3.rs-2120496/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Non-target analysis (NTA) employing high-resolution mass spectrometry (HRMS) coupled with liquid chromatography is increasingly being used to identify chemicals of biological relevance. HRMS datasets are large and complex making the identification of potentially relevant chemicals extremely challenging. As they are recorded in vendor-specific formats, interpreting them is often reliant on vendor-specific software that may not accommodate the advancements in data processing. Here we present InSpectra, a vendor independent automated platform for the systematic detection of newly identified emerging chemical threats. InSpectra is web-based, open-source/access and modular providing highly flexible and extensible NTA and suspect screening workflows. As a cloud-based platform, InSpectra exploits parallel computing and big data archiving capabilities with a focus for sharing and community curation of HRMS data. InSpectra offers a reproducible and transparent approach for the identification, tracking and prioritisation of emerging chemical threats.

Earth and environmental sciences/Environmental sciences/Environmental chemistry/Environmental monitoring

Physical sciences/Chemistry/Analytical chemistry/Mass spectrometry

In 2016, the World Health Organization reported 1.6 million deaths and 45 million disability-adjusted life-years lost because of known chemical exposures. Which chemicals are responsible is poorly understood ¹. A key challenge for chemical use regulators is the lack of and difficulty in collecting sufficient experimental evidence between chemical exposure and effects on humans and the environment ^2–4. Part of this challenge is that our “chemosphere” is overly complex, dynamic, and ever-expanding with most of the chemicals indexed in the Chemical Abstract Service (currently 193 million) not characterised with respect to their potential effects on human safety and environmental health ⁵.

Non-target analysis (NTA) employing high-resolution mass spectrometry (HRMS) is becoming one of the most comprehensive approaches analytical chemists can use to answer questions related to the fate and exposure of chemicals ^6,7. NTA uses full-scan HRMS data without a priori assumptions about chemical composition of the samples, independently from their levels of complexity. A sub-type of NTA, suspect screening, also uses full scan HRMS data but limits data analysis to a predefined (suspect) list of chemicals ^7–12. Sharing of the full-scan HRMS data and applying retrospective analysis has been proposed as an early warning system for rapidly identifying emerging chemical threats across the globe ¹³. This has yet to be realised as challenges remain particularly regarding the archiving of HRMS data, metadata, and processing capabilities.

Advancements in HRMS technology, such as time-of-flight and Orbitrap instruments, particularly when coupled with liquid or gas chromatography (LC-MS and GC-MS), have facilitated rapid NTA and suspect screening assays. These experiments generate thousands of MS/MS spectra per sample in a matter of minutes ¹⁴ and whole projects potentially generating millions of spectra. Such large volumes of complex data implies that manual analysis is unfeasible (i.e., combination of multiple steps) ¹⁵.

The data processing workflow for NTA and suspect screening assays include several steps from data conversion to structural elucidation. There are several open-access and/or open-source data processing tools that tackle parts of such workflows (see Table 1 for examples). As of now, all such tools and workflows do not take into consideration information associated with the sample itself and its preparation. Previous studies have shown that sample matrix, sample collection, pre-treatment, separation, and data acquisition all impact the explored chemical space (i.e. the coverage of all organic chemicals within a sample). For example, each sample matrix has its own chemical space, hence matrix selection limits the types of chemicals that are present ⁷. As such, for a global early warning system for rapidly identifying chemicals of emerging concern to work, the metadata associated with the entire process (from sample collection through to chemical identification) must be archived and retrievable. Currently available platforms (both local and web-based) for processing HRMS data are particularly limited in terms of sharing, archiving and retrospective analysis of NTA data – let alone for capturing metadata.

PhenoMeNal is one of the most transparent (i.e. following the FAIR principles ¹⁶), modular, and scalable web-based platforms currently available ¹⁷. This platform is a Galaxy based ¹⁸ system where the user can choose the type of workflow and the individual tools used for each step. Currently the PhenoMeNal platform does not have a user interface and requires users to tackle the compatibility of the inputs and outputs of different tools within the workflow (e.g. a feature list generated by one tool may not be compatible with a specific identification tool). PhenoMeNal also does not have any archiving or retrospective analysis capabilities. Consequently, this platform has not been widely utilised within the exposomics community. Another web-based platform with limited open-access and closed source is XCMS Online (17). This platform enables feature detection, alignment, and simple statistical analysis. However, feature identification is not part of the workflow and must be performed independently. Additionally, this platform is not able to deal with Data Independent Acquisition (DIA) or profile mode data types. Furthermore, the platform is not modular, and the source code is closed with no archiving capabilities.

Two web-based platforms that have archiving capacity are the Norman Digital Sample Freezing Platform (DSFP) ⁴ and Global Natural Products Social Molecular Networking (GNPS) ¹⁵. Both platforms are closed source and DSFP is limited open-access while GNPS is completely open-access. The GNPS workflow focuses on molecular networking with limited other capabilities. On the other hand, DSFP has a relatively complete suspect screening workflow from the feature detection to library search. However, it does not perform any MS² cleanup, therefore being more suitable to process Data Dependent Acquisition (DDA) chromatograms.

For the local platforms, MZmine 2 ¹⁹ and MS-DIAL/MS-FINDER ²⁰ are the most used due to their user-friendly graphical user interfaces and complete workflows. While MZmine 2 is very modular with different options for each step of the workflow, MS-DIAL follows a fixed workflow. MS-DIAL/MS-FINDER can handle both DDA and DIA files as well as feature identification. The current version of MZmine 2 does not have the capability to handle DIA chromatograms and requires either a local spectral library or external R plug-ins to be able to identify features. Recent development in available local platforms are patRoon ²¹ and TidyMass ²² which are both R based platforms integrating several tools for different steps of the workflow ^21,22. While patRoon has more extensive identification/annotation capabilities compared to MZmine 2 and MS-DIAL/MS-FINDER, its graphical user interface is much simpler – needing minimal R programming knowledge. TidyMas however doesn’t have a graphical user interface ²². Like MZmine 2, patRoon is not able to handle DIA data. All the mentioned platforms however have limited scalability (i.e. parallel computing) nor archiving capacity incorporated in them. See Table 1 for further detail on each platform’s current capabilities.

Considering the above, we present a web-based open-source and open-access software platform called InSpectra that provides vendor independent complete NTA and suspect screening workflows (i.e., from data conversion through to identification). As a cloud-based platform, to optimise the way emerging chemical threats are identified, InSpectra takes advantage of parallel computing and the ability to archive all data and associated metadata with a view for sharing and community curation of HRMS data with rapid retrospective analysis capabilities. Additionally, InSpectra is completely modular with a future vision to incorporate state-of-the-art algorithms and tools. Furthermore, with this paper we invite collaboration with research teams across all disciplines to trial InSpectra and assist with the global curation of HRMS datasets.

InSpectra – The platform

InSpectra is hosted online on a cloud platform that provides many advantages over offline solutions, including independence of end user computer; scalability; ability to archive all data and metadata, and traceability of all processing. The web platform integrates a suite of open source and open- access tools enabling the generation of multiple workflows (Fig. 1). Examples of such workflows and case studies using different workflows are discussed in detail below (see section “Example Workflows”).

Online processing and scalability

All processes are performed on an online opensource platform, thus there are no requirements on the user’s computer to install any software, have a specific operating system, meet specific/minimum system requirements, pay licensing fees, etc. The only requirement is that the user has a computer with a web browser and an internet connection capable of uploading the files they wish to process. Currently all data is stored and processed within the cloud hosted by Amazon Web Services (AWS), which can potentially be moved to other cloud providers. InSpectra processes have been configured so that regardless of the number of the files in a job, an adequate number of optimised processing computers will be started to perform the needed tasks. The number of computers open is linearly correlated to the number of files for processing. The scalability of the infrastructure allows it to process hundreds of requests as efficiently as one without downtime. This optimisation step minimises the time spent for processing a batch (i.e., a set of HRMS datasets), which consequently minimises the associated costs.

Archiving of data, metadata, and traceability of processing

InSpectra has an in-built archiving system to store all data from raw instrument datafiles, metadata, experimental conditions to the outputs of the individual tools within the workflow. This includes the processing parameters, tools used and their versions. These files are stored in a repository and, depending on access requirements, can be stored on low access requirement infrastructure to minimise storage costs. The metadata recorded automatically includes parameters used for processing the data, the metadata of the HRMS files themselves (e.g., instrument used, brand, ionisation mode, etc., which are read directly from the raw HRMS datafiles themselves), and versions and inputs of algorithm used while processing the files. The metadata recorded manually are of the samples themselves (e.g., sample matrix, location, time, sample preparation, etc.). This data is stored in a relational database, enabling rapid and easy analysis and further processing (See Fig. 2). In fact, this relational database is key to allowing the platform to be used for sharing and retrospective analysis of full scan HRMS data as an early warning system for rapid detection of chemicals of emerging concerns across the globe. The collection of such metadata (e.g., sample type and sample preparation steps) are currently hindered by the lack of a user-friendly graphical web interface, which will be addressed soon.

Tools and workflows

The algorithm description and the validation procedure are provided under the section Code Availability.

Use of Open-source tools

InSpectra was built using open-source algorithms which are available via the Git repositories (i.e., Bitbucket), resulting in reproducible and transparent outputs and workflows. Such a level of transparency is often difficult to achieve, given that HRMS instrument vendor software is proprietary, closed source, and closed access. This black box strategy hinders the objective and fair evaluation of the existing algorithms as well as the direct comparison of their outputs. The algorithms used in InSpectra have been tested, validated, peer-reviewed, and published ^29–32. The use of algorithms maintained on Bitbucket also means that updated (and often improved) versions of such algorithms are automatically integrated into InSpectra, providing users with access to state-of-the-art processing tools while providing the means for open collaboration. It also allows for complete transparency for all parties to understand all parts of the data processing if they wish to do so.

The platform makes use of multiple languages and tools to facilitate a seamless data processing workflow from conversion of raw HRMS data files to identification/annotation and statistical analysis. Python is used for the connection of the algorithms on the backend, as it is a very well-supported language, is quick to code in, and supports a multitude of tools to communicate with other languages and computers. MySQL is used as the database, where all data, metadata, and results locations are stored. The different modules are in the main written in Julia, which is a dynamic language that is quite efficient at process-heavy tasks, However, it is important to note that InSpectra as a modular platform is not dependent on any language for processing, if an algorithm can run on UNIX or windows, it can be integrated into InSpectra. ProteoWizard (21), used for HRMS data conversion, is written in C + + which is an extremely efficient language. Because python has extraordinarily strong application programming interfaces (APIs), the modules can be written in any language, such as R, matlab, C#, PHP, etc. All scripts for the platform management and database structures used can be made available upon request. For cases where a different combination of tools is needed for custom workflows, a local version of InSpectra can be deployed on both local workstations and/or high-performance computing servers. This also enables the cases where due to data sensitivity the data cannot be uploaded to the commercial clouds (e.g., forensic laboratories).

Modularity

The steps included in the workflows constitute different modules of the platform, providing maximum flexibility on the potential workflows and tools to be used. The core algorithms of InSpectra are soured directly from their respective GIT repositories which allows updates and fixes from collaborators to be automatically incorporated in InSpectra. Because the software versions are stored with the metadata, if a new update has an impact on the quality of results, the files can be easily reprocessed. As new tools are developed for InSpectra, they can be added as separate modules to improve existing workflows or as additional workflows.

The main/core workflow in InSpectra includes data conversion, feature detection, componentisation, and identification steps. A brief description of these tools is provided below.

Conversion of raw HRMS datafiles

Once the HRMS datafiles are uploaded into the platform, they are converted into the mzXML ³¹ format. This was chosen as it is an open-source format and creates coherency between the many different vendor formats and InSpectra’s algorithms. Future versions of InSpectra will include the mzML ³³ format to facilitate the use of data with ion mobility information. Currently ProteoWizard’s format conversion utility msConvert is used for HRMS data conversion ³⁴. The parameters used for this conversion are stored in the database.

Feature detection

Feature detection is used to obtain the MS¹ information on the parent, adduct, isotope, and in-source fragment ions, for which the self-adjusting feature detection (SAFD) ²⁹ algorithm was implemented. This algorithm performs feature detection by fitting a three-dimensional gaussian on profile data, requiring no prior binning or centroiding. The current version of SAFD is capable of handling both profile and centroided data ³². As three-dimensional feature detection (i.e., profile mode) is more resource intensive compared to two-dimensional (i.e. centroided data) feature detection, SAFD benefits from InSpectra’s cluster computing processing capabilities. The SAFD algorithm takes an mzXML file and a set of parameters as inputs, comprising of the maximum number of iterations, maximum and minimum peak width in the time domain, mass resolution of the instrument, minimum peak width in the mass domain, correlation threshold, minimum intensity, signal to noise ratio, and signal increment threshold. During the process of fitting a three-dimensional gaussian, the user defined parameters (e.g., widths in the mass and time domain) are only utilised as the first guess and subsequently adapted according to the experimental data. The SAFD algorithm outputs a CSV file with the detected features, containing information on the retention time, mass, area, intensity, peak purity, and mass resolution. SAFD has been shown to produce more reliable results compared to XCMS ³⁵, a state-of-the-art algorithm.

Componentisation

Componentisation is used for grouping information belonging to unique chemical constituents, including adducts, isotopologues, and fragments (including in-source fragments). For this, the componentisation algorithm CompCreate ³⁰ was used, since it can obtain both MS¹ (i.e., parent, isotopes, adduct, and in-source fragments) and MS² (i.e., fragments) information CompCreate algorithm can process data coming from both DDA and DIA approaches. Additionally, it has built in processes for the Sciex’s SWATH and multi-collision data types. The algorithm uses the MS¹ features obtained during feature detection as potential precursor ions. For all these potential precursor ions, both the MS¹ features and MS² peaks were grouped based on the time difference between the retention time at the apex, Pearson’s correlation of the extracted ion chromatograms (i.e., peak shape check), and information specific to the ion type. For the latter, adducts are identified based on a database of frequently detected single charged adducts in LC-HRMS experiments (e.g., M + Na) ³⁶. Isotopes are detected based on the mass defect between the parent and potential isotope mass. Whereas (in-source) fragments are further filtered based on the probability of the neutral loss (i.e., mass difference between fragment and parent ion). The CompCreate algorithm outputs a CSV file that contains both the generated components and un-grouped features as well as the spectral information at MS¹ and MS² levels.

Library search

Library search is used for the identification of components or features based on similarity with database spectra. For this, InSpectra uses the Universal Library Search Algorithm (ULSA) ³⁰. ULSA minimises the variability observed in the data caused by different acquisition conditions through spectral normalisation as well as inclusion of multiple sources of information. Additionally, for the components a complete library search is done while for the remaining features a molecular formula assignment is performed based on the compounds present in the databases. For spectral matching of components, an initial search in the InSpectra database is performed based on the precursor ion mass using the mass window associated with each component. On average this mass window ranges between ± 10 mDa ± 30 mDa. For each of those spectra, a Final Score (quality of spectral match) is calculated based on seven different parameters, including the number of fragments matched in both user and reference spectra as well as the associated mass errors. The influence (i.e., weight) of each parameter can be specified by the user via a weight vector of seven values ranging between zero and one. As for the features that are found in the input list, molecular formula assignment is performed with the US-EPA CompTox ³⁷ database. Potential molecular formula matches are scored either with a value between 0 and 1, depending on whether the mass difference between the measured precursor ion and theoretical molecular formula mass is above or below the user defined mass tolerance, respectively. ULSA outputs a list containing all potential candidate identifications or molecular formula assignments with their corresponding Final Score as well as the list of matched fragments for the components and features, respectively. The output is stored on the platform and its metadata is stored and referenced in the database.

InSpectra’s Library

The database of the library search is sourced from two sources, the MassBank Project ³⁸ that offers in vitro (experimental) spectra of which there are 89,826 distinct spectra 15,059 unique compounds, and 16,840 unique isomers. 25,935 Of these entries have a recorded resolution of 7,500 or above. The library database also includes 700,000 in silico (predicted via CFM-ID³⁹) spectra from the EPA’s National Centre for Computational Toxicology CompTox Chemical Dashboard with spectra predicted for EI-MS and ESI-MS/MS in both positive and negative ionisation modes ⁴⁰. To our knowledge the InSpectra platform is the only platform capable of searching against such a large spectral library, which provides the researchers access to such resources.

Statistical analysis

Identified features could be connected to each other across different samples enabling direct spatial and temporal trend analysis of chemicals across different matrices. The identity-based alignment functionality currently implemented in InSpectra is essential for the detection of emerging chemical threats.

For unidentified features, the alignment can take place for samples analysed with the same method. InSpectra, through its database, can group the datasets measured via the same methods and align their feature lists and/or components. Similar types of trend analysis can be performed on such outputs, enabling further understanding of the covered chemical space of a sample set.

Finally, for unidentified features analysed via different methods, the current implementation of InSpectra is not able to perform any alignments. The next version of InSpectra will include a validated retention mapping algorithm to seamlessly connect the unidentified features generated via multiple acquisition methods ⁴¹.

Example workflows

InSpectra enables the user to combine multiple tools that each have their own functions and goals. A complete overview of paths or tools’ combinations can be seen in Fig. 1. However, to give a better idea of the overall process and possibilities, two frequently used workflows are described below.

Library search identification workflow

One of the most used NTA workflows is for feature identification to identify known unknowns starting from raw data through to a list of identified features (i.e., spectra matched against a library; see Fig. 3). In this workflow the HRMS files are converted to mzXML (a common open-source format) that is then processed for feature detection using SAFD to obtain the MS¹ information on the parent, adduct, isotope, and in source fragment ions. Feature Alignment then groups the features across multiple feature lists or component lists based on their retention time and m/z. The file then undergoes componentisation using CompCreate which groups information belonging to unique chemical constituents. The componentised file is then searched against the InSpectra database using ULSA. Lastly, is Statistical Analysis, which offers multiple tools to analyse the results either as a standalone or in context of other stored processed files and its metadata, such as heatmaps, temporal and spatial trends via identity-based alignment approach.

Suspect screening workflow

Another commonly used NTA workflow is suspect screening where only a targeted list of chemicals searched for within the samples (see Fig. 4 for an overview). This algorithm extracts the MS1 and MS2 information provided in the suspect list from the raw data and generates a match factor between the user provided spectra and the experimentally measured one. This is a faster process than complete NTA, given that it focuses on specific mass channels (i.e., monoisotopic mass of the suspect analytes). Additionally, the suspect screening workflow does not perform the classical componentisation, which helps to speed up this process. Additionally, the suspect screening workflow is more sensitive than the NTA workflow due to its more targeted nature. This workflow generates a list of features with their potential structure, isotopic matching, number of matched fragments, and match factors. This information will facilitate the confidence assessment of the identifications by the analysts. Additionally, to facilitate suspect screening within InSpectra, InChiKeys can be converted into suspect lists that are ready to be fed to the algorithm.

Demonstration of InSpectra Workflows

Applying the Library Search workflow to samples representing four different matrices (wastewater, stormwater, cow blood extracts and cow serum extracts) resulted in a mixture of chemicals being tentatively identified covering multiple classes of chemicals including antimicrobials (e.g. 1,2-benzisothiazolin-3-one), food constituents (e.g. niacinamide), pharmaceuticals (e.g. ranitidine), and agrochemicals (e.g. simazine) (see Fig. 5). This demonstrates the capability of using a full NTA workflow for biological and environmental analysis, considering the data were acquired on two different vendor instruments using completely independent experimental conditions. The full NTA workflow (Library Search) outputs feature lists, extracted component lists and candidate lists that could be used for future retrospective analysis for further evaluation. When the same Library Search workflow was also applied to stormwater samples collected over a series of time points during multiple storm events but analysed within the same batch, similar chemicals, were detected (Fig. 6). In the stormwater samples, the most dominant family of detected chemicals were the agrochemicals, pharmaceuticals, and pharmaceuticals’ transformation products (e.g. carbamazepine epoxide). The presence and frequency of detection of these chemicals in stormwater may indicate domestic sources contaminating stormwater such as wastewater exfiltration. These results demonstrate the applicability of InSpectra as an early warning system for chemicals of emerging concern.

To demonstrate the Suspect Screening workflow, we used chemicals tentatively identified using the Library Search workflow to create a suspect list and processed the same files via Suspect Screening workflow. When we performed a direct comparison of the results of the two workflows, in ~ 56% of the cases we found complete agreement between the two workflows, ~ 20% of the cases were only detected by the Library Search workflow and ~ 16% only by Suspect Screening (Figs. 7 and 8). For the Library Search workflow we set a Final Score threshold of 4/7 while for Suspect Screening we set a Match Factor of 0.4/1. It should be noted that for Library Search, Final Score is dependent on the percentage of fragments matched, hence if an entry only has too few fragments, this may lead to a false positive identification. When the recorded signal for both parent ion and the fragments have a signal to noise ratio larger than 5, both workflows are able to detect and identify the chemicals in the analyzed samples (Fig. 9). As for low intensity fragments, the fragment in the MassBank entry for carbofuran did not have high enough intensity to be distinguished as an analytical signal, resulting in a non-detect for the Library Search workflow while Suspect Screening workflow was able to detect this chemical in the sample. Finally, for simazine in stormwater, the Library Search workflow was able to detect this chemical whereas the Suspect Screening workflow fell short. In this case, a deeper investigation of the suspect list and the included fragments indicated that some diagnostic fragments were missing from our suspect list. These results show case how different workflows in InSpectra are able to extract essential information for the chemical characterization of complex samples.

Limitations

As with all available HRMS processing tools, InSpectra also has several limitations. For library matching we used the criteria of a final score greater than 4 out of 7 (where 7 was the maximum possible value), however, it is important to note that we are reliant on the quality of and curation of the spectra in the online spectral databases. This means that where spectra quality is not well controlled and spectra containing noise are uploaded to the database, library matches may occur with instrument noise rather than real fragments of the compounds themselves. For example, an entry in MassBank for erucamide contains 32 unique fragments (MassBank Accession id: FIO00884), but the relative intensities for these range from 89 to 999. For usnone a, a compound with a molecular weight of only 344 da, the MassBank entries are only for a low-resolution instrument and in some cases have more than 200 fragments recorded (MassBank Accession id: NGA01929). Simply applying a resolution threshold may not be sufficient due to most entries not having recorded the resolution at which they were acquired. When generating the suspect screening lists, expert knowledge plays an important role as there should be a balance between the number of fragments included in such a list and the threshold set for the Match Score. A suspect list entry with too many non-diagnostic/low-probability fragments will result in low Match Scores. Finally, as for an automated tool, the outputs of InSpectra may require further expert evaluation to assess their levels of confidence and accuracy and refine the output. To facilitate that, the reports generated from InSpectra include all the information (e.g. number of matched fragments and the associated mass error) used for the detection and identification of chemical signals, which can be used by the analyst during the post processing.

Outlook

The outlook for the InSpectra platform includes expanding its userbase through improving the usability of the platform and incorporating additional tools and workflows. Having focused on developing the backend of the platform, it is currently using a command-line interface (CLI) which requires the user to be comfortable using three pre-built functions with arguments to communicate with InSpectra. Replacing the CLI with a graphical user interface (GUI) would increase accessibility and usability of the program. As a platform designed for sharing, processing and archival of HRMS data, this will need to be simple and informative to guide users through the various workflows. Priority will be given to ensure users can easily and securely upload their HRMS datasets (including associated metadata) and put them through the existing workflows such a NTA and Suspect Screening described above. Already work has begun on new tools such as peak alignment to facilitate easy comparison between samples and novel prioritisation and structural elucidation tools.

In its current form the InSpectra platform is limited to processing LC-HRMS data as expansion to include GC-HRMS capability requires a sufficiently large GC-HRMS library to build a neutral loss model. At present the online crowd-sourced libraries contain insufficient data to build such a model.

As a platform for the early warning of emerging chemical threats, key developments will focus on statistical analysis of the data to allow trends of interest to be explored such as the emergence of new chemicals/features in multiple geographic locations. As InSpectra was built using a relational database, it is expected that such statistical packages will easily be implemented into the platform.

The capabilities of InSpectra could be expanded to metabolomic and compound discoveries. Feature detection, componentisation and library search outputs could be effortlessly combined to perform a molecular networking analysis to identify and visualise molecules with similar spectral data. This can be achieved by pair wise comparison of componentised MS² spectral data using a spectral alignment algorithm and create a network of spectral relations. The metadata (sample information) such as type of sample, spatial and temporal information can be incorporated into the molecular network to facilitate the data analysis and interpretation. Currently, only the Global Natural Products Social Molecular Networking (GNPS) platform offers such analysis as a complete workflow. InSpectra could provide a couple of advantages in the identification of environmental compounds compared to GNPS due to its ability to search against more up to date reference libraries (MassBank and in silico spectra from EPA’s DSSTox database) and its capacity to archive all the data, including metadata.

With this paper we invite collaboration with research teams across all disciplines to trial InSpectra. Researchers who want their HRMS data to be analysed may contact us to do so. The plan of InSpectra is to have a website dedicated for all interested parties to have access to upload and process their files independently, however, because of our limited resources, only the backend workflows and maintenance of them are currently working. By showing the current capabilities and potential of InSpectra, we hope to have further evidence for the necessity of InSpectra and help us secure the resources to have a fully supported platform.

Sample preparation

To demonstrate InSpectra’s capabilities and current workflows, the Library Search Workflow was applied to multiple LC-HMRS datafiles acquired on both QToF and Orbitrap mass spectrometers coupled to liquid chromatography systems and covering multiple complex sample matrices (cow blood extracts, cow serum extracts, stormwater, and wastewater). All data were collected using electrospray ionisation (ESI) positive ionisation mode to allow comparison.

Stormwater samples were collected from Cubberla Creek (Figure S1 Rauert et al ⁴²), a tributary of the Brisbane River, during major storm events in June and October 2020. The samples were solid phase extracted (SPE) as described previously ⁴². Cow blood and serum samples were collected from cattle exposed to contaminated groundwater between March 2015 to March 2018 at Oakey, a small town situated in the State of Queensland, Australia. All the blood and serum samples were collected by a qualified person under the guidelines described by UQ ethics approval (#ANRFA/ENTOX/153/16). The samples were kept frozen (-20 ^oC) until extraction. An equal amount (300 µL) of blood and serum from each collection (n = 4) was pooled, and 1 mL of the pooled samples was used for extraction. The pooled cow blood and serum samples were extracted as described previously ⁴³. Wastewater samples were collected on August 9, 2016, as part of the Australian national wastewater monitoring program (SewAus 2016 ⁴⁴). Aliquots of the wastewater sample (1 mL) were spiked with a mixture of isotope-labelled standards (50 ng of each compound) and filtered through a 0.45 µm PTFE syringe filter directly into a glass LC-vial. The prepared samples were kept frozen until analysis.

Instrumental analysis

The stormwater, blood and serum samples were analysed with high-performance liquid chromatography (ExionLC AD, AB Sciex, Ontario, Canada) coupled to a SCIEX X500R Quadrupole Time-of-Flight (QTOF) mass spectrometer (AB Sciex, Ontario, Canada) equipped with electrospray ionisation (ESI). The stormwater samples were eluted using a Kinetex C18 100 Å analytical column (2.6µm, 100 mm × 2.1 mm; Phenomenex, Lane Cove, Australia) fitted with a guard cartridge (SecurityGuard™, Phenomenex, Lancove, Australia). Chromatographic separation was achieved with mobile phases consisting of Milli-Q water (A) and methanol (B) both acidified with 0.1% formic acid. The cow blood and serum samples were eluted using ACQUITY UPLC HSS T3 Column (1.8 µm, 2.1 mm X 100 mm; Waters Corporation, Milford, MA) equipped with an ACQUITY guard cartridge. Mobile phased used were Milli-Q water (A) and methanol (B) both containing 2 mM ammonium acetate. The injection volume was set at 10 µl, and the column temperature was maintained at 40 ^oC for all the samples. Full scan high-resolution mass spectrometric data were collected across 100–1100 m/z (MS1) and 50-1100 m/z (MS/MS) in SWATH operation mode. The parameters of the SWATH analysis were as follows: ion source temperature 550 ^oC; ion spray voltage 5000 V; curtain gas 30 L/min; ion source gas 1 and 2, 60 psi; declustering potential 80 V (DP); and collision energy 35 V (CE). The SWATH window parameters are given in Table S4.

The wastewater samples were analysed using an ultrahigh performance liquid chromatography coupled to a Q Exactive™ HF Hybrid Quadrupole-Orbitrap™ mass spectrometer (UHPLC-OrbitrapMS/MS, Thermo Fisher Scientific, San Jose, USA) with ESI. Separation was achieved with a reverse-phase Hypersil GOLD™ aQ C18 polar-endcapped column (1.9 µm, 2.1 mm × 100 mm; Thermo Fisher Scientific, San Jose, USA) using a binary mobile phase gradient consisting of Milli-Q water (A) and acetonitrile (B), both containing 0.1% formic acid. Detailed information on the gradients used is given in Table S3. The mass spectrometry parameters used for the analysis have been described previously ⁴⁵.

Quality control and Quality assurance (QA/QC)

All samples were spiked with internal standards to monitor instrument conditions throughout the analysis (from sample-to-sample variations) and assess potential drift. A mixture of reference standards was injected in regular intervals to monitor chromatographic and MS performance. Procedural blanks (Milli-Q water spiked with IS and extracted), solvent blanks (methanol) and instrument blanks (Milli-Q water) were analysed alongside samples. All the samples were injected in triplicate. Instrument calibration and resolution adjustments were performed regularly throughout the analysis. System calibration error was maintained at less than 2ppm.

Data processing parameters

All instrument raw datafiles were uploaded to InSpectra and processed as per the selected workflow. The raw files were automatically converted to mzxml format, and then passed through the various algorithms (Feature Detection followed by Componentisation and Library Search).

To compare relative chemical intensities between the samples, the areas of all features in each sample were normalised to the sum of all areas within that sample, and for simplicity only results for library matches which had a final score greater than 5 out of 7 (maximum possible value) were kept and the top 14 by highest relative area plotted as a dendrogram (Fig. 3).

Selecting compounds for Library Search

To determine which compounds were detected and which were not for Library Search only the compounds from the experimental library were used (as opposed to theoretical). The first filter that was applied was to compounds whose Final Score was 4 or higher, if the compound appears more than once in the list, then compound with the highest intensity (the unit is found in the componentisation file) is selected of the duplicate compound.

Selecting compounds for Suspect Screening

A list of compounds of interest were selected based on the results of the Library Search. Using the InchiKey of those compounds, their spectral signatures was searched in the InSpectra mass spectral library. From these files, suspect screening was performed on the respective mzxml files. Only the results which had a final match factor of 0.4 were detected, in the case a compound appears more than once in a datafile, the one with the highest match factor, followed by intensity was selected.

Acknowledgements:

The authors gratefully acknowledge the financial support from the Australian Research Council ARC Discovery Project (DP190102476). JWO is the recipient of an NHMRC Emerging Leadership Fellowship (EL1 2009209). The Queensland Alliance for Environmental Health Sciences, The University of Queensland, would like to acknowledge the financial support of the Queensland Department of Health.

Author Contributions:

M.F, J.W.O and S.S. drafted the manuscript. M.F., J.W.O. and S.S., designed the platform. M.F., S.S. and D.v.H developed and integrated packages for the platform. S.S., S.K. and K.T. sort funding for the project. P.D. analysed samples for the manuscript. All authors discussed the results in detail and commented on the manuscript.

Competing Interests:

The authors declare no competing interests.

Materials & Correspondence:

Correspondence and requests for materials should be addressed to Jake W. O’Brien or Saer Samanipour.

World Health, O. The public health impact of chemicals: knowns and unknowns. (World Health Organization, Geneva, 2016).
Pleil, J.D. Categorizing biomarkers of the human exposome and developing metrics for assessing environmental sustainability. J Toxicol Environ Health B Crit Rev 15, 264–280 (2012).
Kortenkamp, A., Faust, M., Scholze, M. & Backhaus, T. Low-level exposure to multiple chemicals: reason for human health concerns? Environ Health Perspect 115 Suppl 1, 106–114 (2007).
Alygizakis, N.A., et al. NORMAN digital sample freezing platform: A European virtual platform to exchange liquid chromatography high resolution-mass spectrometry data and screen suspects in “digitally frozen” environmental samples. TrAC Trends in Analytical Chemistry 115, 129–137 (2019).
Muir, D.C.G. & Howard, P.H. Are There Other Persistent Organic Pollutants? A Challenge for Environmental Chemists. Environ. Sci. Technol. 40, 7157–7166 (2006).
Samanipour, S., Martin, J.W., Lamoree, M.H., Reid, M.J. & Thomas, K.V. Letter to the Editor: Optimism for Nontarget Analysis in Environmental Chemistry. Environmental Science & Technology 53, 5529–5530 (2019).
Hollender, J., Schymanski, E.L., Singer, H.P. & Ferguson, P.L. Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go? Environ Sci Technol 51, 11505–11512 (2017).
Hernandez, F., et al. The role of analytical chemistry in exposure science: Focus on the aquatic environment. Chemosphere 222, 564–583 (2019).
Albergamo, V., et al. Nontarget Screening Reveals Time Trends of Polar Micropollutants in a Riverbank Filtration System. Environ Sci Technol 53, 7584–7594 (2019).
Chiaia-Hernandez, A.C., Gunthardt, B.F., Frey, M.P. & Hollender, J. Unravelling Contaminants in the Anthropocene Using Statistical Analysis of Liquid Chromatography-High-Resolution Mass Spectrometry Nontarget Screening Data Recorded in Lake Sediments. Environ Sci Technol 51, 12547–12556 (2017).
Sjerps, R.M.A., Vughs, D., van Leerdam, J.A., Ter Laak, T.L. & van Wezel, A.P. Data-driven prioritization of chemicals for various water types using suspect screening LC-HRMS. Water Res 93, 254–264 (2016).
Chiaia-Hernandez, A.C., Schymanski, E.L., Kumar, P., Singer, H.P. & Hollender, J. Suspect and nontarget screening approaches to identify organic contaminant records in lake sediments. Anal Bioanal Chem 406, 7323–7335 (2014).
Alygizakis, N.A., et al. Exploring the Potential of a Global Emerging Contaminant Early Warning Network through the Use of Retrospective Suspect Screening with High-Resolution Mass Spectrometry. Environ Sci Technol 52, 5135–5144 (2018).
Bouslimani, A., Sanchez, L.M., Garg, N. & Dorrestein, P.C. Mass spectrometry of natural products: current, emerging and future technologies. Nat Prod Rep 31, 718–729 (2014).
Wang, M., et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat Biotechnol 34, 828–837 (2016).
Wilkinson, M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018 (2016).
Peters, K., et al. PhenoMeNal: processing and analysis of metabolomics data in the cloud. Gigascience 8(2019).
Afgan, E., et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research 46, W537-W544 (2018).
Pluskal, T., Castillo, S., Villar-Briones, A. & Orešič, M. MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010).
Tsugawa, H., et al. A lipidome atlas in MS-DIAL 4. Nat Biotechnol 38, 1159–1163 (2020).
Helmus, R., Ter Laak, T.L., van Wezel, A.P., de Voogt, P. & Schymanski, E.L. patRoon: open source software platform for environmental mass spectrometry based non-target screening. J Cheminform 13, 1 (2021).
Shen, X., et al. TidyMass an object-oriented reproducible analysis framework for LC–MS data. Nature Communications 13, 4365 (2022).
Loos, M. enviMass version 3.5 LC-HRMS trend detection workflow—R package. (2018).
Loos, M. enviPick: Peak Picking for High Resolution Mass Spectrometry Data. (2016).
Rost, H.L., et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods 13, 741–748 (2016).
FOR-IDENT LC.
Tautenhahn, R., Patti, G.J., Rinehart, D. & Siuzdak, G. XCMS Online: a web-based platform to process untargeted metabolomic data. Anal Chem 84, 5035–5039 (2012).
Aron, A.T., et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nature Protocols 15, 1954–1991 (2020).
Samanipour, S., O’Brien, J.W., Reid, M.J. & Thomas, K.V. Self Adjusting Algorithm for the Nontargeted Feature Detection of High Resolution Mass Spectrometry Coupled with Liquid Chromatography Profile Data. Analytical Chemistry 91, 10800–10807 (2019).
Samanipour, S., Reid, M., Baek, K. & Thomas, K.V. Combining a deconvolution and a universal library search algorithm for the non-target analysis of data independent LC-HRMS spectra. Environ Sci Technol (2018).
Pedrioli, P.G., et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 22, 1459–1466 (2004).
Samanipour, S., et al. From Centroided to Profile Mode: Machine Learning for Prediction of Peak Width in HRMS Data. Analytical Chemistry 93, 16562–16570 (2021).
Martens, L., et al. mzML–a community standard for mass spectrometry data. Mol Cell Proteomics 10, R110.000133 (2011).
Chambers, M.C., et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat Biotechnol 30, 918–920 (2012).
Smith, C.A., Want, E.J., O'Maille, G., Abagyan, R. & Siuzdak, G. XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification. Anal. Chem. 78, 779–787 (2006).
Lab, U.D.F. Mass Spectrometry Adduct Calculator. (2022).
Williams, A.J., et al. The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. Journal of Cheminformatics 9, 61 (2017).
Horai, H., et al. MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry 45, 703–714 (2010).
Allen, F., Pon, A., Wilson, M., Greiner, R. & Wishart, D. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res 42, W94-99 (2014).
Allen, F., Pon, A., Greiner, R. & Wishart, D. Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification. Analytical Chemistry 88, 7689–7697 (2016).
Boelrijk, J., Samanipour, S., Van Herwerden, D., Ensing, B. & Forré, P. Predicting RP-LC retention indices of structurally unknown chemicals from mass spectrometry data. (American Chemical Society (ACS), 2022).
Rauert, C., et al. Concentrations of Tire Additive Chemicals and Tire Road Wear Particles in an Australian Urban Tributary. Environ Sci Technol 56, 2421–2431 (2022).
Nilsson, S., Mueller, J.F., Rotander, A. & Braunig, J. Analytical uncertainties in a longitudinal study - A case study assessing serum levels of per- and poly-fluoroalkyl substances (PFAS). Int J Hyg Environ Health 238, 113860 (2021).
O'Brien, J.W., et al. A National Wastewater Monitoring Program for a better understanding of public health: A case study using the Australian Census. Environ. Int. 122, 400–411 (2018).
McLachlan, M.S., et al. Removal of 293 organic compounds in 15 WWTPs studied with non-targeted suspect screening. Environmental Science: Water Research & Technology (2022).

There is NO Competing Interest.

InSpectraSupportingInformation.docx
Supplementary Information: InSpectra – A Platform for Identifying Emerging Chemical Threats

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

InSpectra – A Platform for Identifying Emerging Chemical Threats

Status:

Version 1

Abstract

Figures

Introduction

Results

Statistical analysis

Discussion

methods

Sample preparation

Instrumental analysis

Quality control and Quality assurance (QA/QC)

Data processing parameters

Selecting compounds for Library Search

Selecting compounds for Suspect Screening

Declarations

Acknowledgements:

Author Contributions:

Competing Interests:

Materials & Correspondence:

References

Additional Declarations

Supplementary Files

Status:

Version 1