InSpectra – The platform
InSpectra is hosted online on a cloud platform that provides many advantages over offline solutions, including independence of end user computer; scalability; ability to archive all data and metadata, and traceability of all processing. The web platform integrates a suite of open source and open- access tools enabling the generation of multiple workflows (Fig. 1). Examples of such workflows and case studies using different workflows are discussed in detail below (see section “Example Workflows”).
Online processing and scalability
All processes are performed on an online opensource platform, thus there are no requirements on the user’s computer to install any software, have a specific operating system, meet specific/minimum system requirements, pay licensing fees, etc. The only requirement is that the user has a computer with a web browser and an internet connection capable of uploading the files they wish to process. Currently all data is stored and processed within the cloud hosted by Amazon Web Services (AWS), which can potentially be moved to other cloud providers. InSpectra processes have been configured so that regardless of the number of the files in a job, an adequate number of optimised processing computers will be started to perform the needed tasks. The number of computers open is linearly correlated to the number of files for processing. The scalability of the infrastructure allows it to process hundreds of requests as efficiently as one without downtime. This optimisation step minimises the time spent for processing a batch (i.e., a set of HRMS datasets), which consequently minimises the associated costs.
Archiving of data, metadata, and traceability of processing
InSpectra has an in-built archiving system to store all data from raw instrument datafiles, metadata, experimental conditions to the outputs of the individual tools within the workflow. This includes the processing parameters, tools used and their versions. These files are stored in a repository and, depending on access requirements, can be stored on low access requirement infrastructure to minimise storage costs. The metadata recorded automatically includes parameters used for processing the data, the metadata of the HRMS files themselves (e.g., instrument used, brand, ionisation mode, etc., which are read directly from the raw HRMS datafiles themselves), and versions and inputs of algorithm used while processing the files. The metadata recorded manually are of the samples themselves (e.g., sample matrix, location, time, sample preparation, etc.). This data is stored in a relational database, enabling rapid and easy analysis and further processing (See Fig. 2). In fact, this relational database is key to allowing the platform to be used for sharing and retrospective analysis of full scan HRMS data as an early warning system for rapid detection of chemicals of emerging concerns across the globe. The collection of such metadata (e.g., sample type and sample preparation steps) are currently hindered by the lack of a user-friendly graphical web interface, which will be addressed soon.
Tools and workflows
The algorithm description and the validation procedure are provided under the section Code Availability.
Use of Open-source tools
InSpectra was built using open-source algorithms which are available via the Git repositories (i.e., Bitbucket), resulting in reproducible and transparent outputs and workflows. Such a level of transparency is often difficult to achieve, given that HRMS instrument vendor software is proprietary, closed source, and closed access. This black box strategy hinders the objective and fair evaluation of the existing algorithms as well as the direct comparison of their outputs. The algorithms used in InSpectra have been tested, validated, peer-reviewed, and published 29–32. The use of algorithms maintained on Bitbucket also means that updated (and often improved) versions of such algorithms are automatically integrated into InSpectra, providing users with access to state-of-the-art processing tools while providing the means for open collaboration. It also allows for complete transparency for all parties to understand all parts of the data processing if they wish to do so.
The platform makes use of multiple languages and tools to facilitate a seamless data processing workflow from conversion of raw HRMS data files to identification/annotation and statistical analysis. Python is used for the connection of the algorithms on the backend, as it is a very well-supported language, is quick to code in, and supports a multitude of tools to communicate with other languages and computers. MySQL is used as the database, where all data, metadata, and results locations are stored. The different modules are in the main written in Julia, which is a dynamic language that is quite efficient at process-heavy tasks, However, it is important to note that InSpectra as a modular platform is not dependent on any language for processing, if an algorithm can run on UNIX or windows, it can be integrated into InSpectra. ProteoWizard (21), used for HRMS data conversion, is written in C + + which is an extremely efficient language. Because python has extraordinarily strong application programming interfaces (APIs), the modules can be written in any language, such as R, matlab, C#, PHP, etc. All scripts for the platform management and database structures used can be made available upon request. For cases where a different combination of tools is needed for custom workflows, a local version of InSpectra can be deployed on both local workstations and/or high-performance computing servers. This also enables the cases where due to data sensitivity the data cannot be uploaded to the commercial clouds (e.g., forensic laboratories).
Modularity
The steps included in the workflows constitute different modules of the platform, providing maximum flexibility on the potential workflows and tools to be used. The core algorithms of InSpectra are soured directly from their respective GIT repositories which allows updates and fixes from collaborators to be automatically incorporated in InSpectra. Because the software versions are stored with the metadata, if a new update has an impact on the quality of results, the files can be easily reprocessed. As new tools are developed for InSpectra, they can be added as separate modules to improve existing workflows or as additional workflows.
The main/core workflow in InSpectra includes data conversion, feature detection, componentisation, and identification steps. A brief description of these tools is provided below.
Conversion of raw HRMS datafiles
Once the HRMS datafiles are uploaded into the platform, they are converted into the mzXML 31 format. This was chosen as it is an open-source format and creates coherency between the many different vendor formats and InSpectra’s algorithms. Future versions of InSpectra will include the mzML 33 format to facilitate the use of data with ion mobility information. Currently ProteoWizard’s format conversion utility msConvert is used for HRMS data conversion 34. The parameters used for this conversion are stored in the database.
Feature detection
Feature detection is used to obtain the MS1 information on the parent, adduct, isotope, and in-source fragment ions, for which the self-adjusting feature detection (SAFD) 29 algorithm was implemented. This algorithm performs feature detection by fitting a three-dimensional gaussian on profile data, requiring no prior binning or centroiding. The current version of SAFD is capable of handling both profile and centroided data 32. As three-dimensional feature detection (i.e., profile mode) is more resource intensive compared to two-dimensional (i.e. centroided data) feature detection, SAFD benefits from InSpectra’s cluster computing processing capabilities. The SAFD algorithm takes an mzXML file and a set of parameters as inputs, comprising of the maximum number of iterations, maximum and minimum peak width in the time domain, mass resolution of the instrument, minimum peak width in the mass domain, correlation threshold, minimum intensity, signal to noise ratio, and signal increment threshold. During the process of fitting a three-dimensional gaussian, the user defined parameters (e.g., widths in the mass and time domain) are only utilised as the first guess and subsequently adapted according to the experimental data. The SAFD algorithm outputs a CSV file with the detected features, containing information on the retention time, mass, area, intensity, peak purity, and mass resolution. SAFD has been shown to produce more reliable results compared to XCMS 35, a state-of-the-art algorithm.
Componentisation
Componentisation is used for grouping information belonging to unique chemical constituents, including adducts, isotopologues, and fragments (including in-source fragments). For this, the componentisation algorithm CompCreate 30 was used, since it can obtain both MS1 (i.e., parent, isotopes, adduct, and in-source fragments) and MS2 (i.e., fragments) information CompCreate algorithm can process data coming from both DDA and DIA approaches. Additionally, it has built in processes for the Sciex’s SWATH and multi-collision data types. The algorithm uses the MS1 features obtained during feature detection as potential precursor ions. For all these potential precursor ions, both the MS1 features and MS2 peaks were grouped based on the time difference between the retention time at the apex, Pearson’s correlation of the extracted ion chromatograms (i.e., peak shape check), and information specific to the ion type. For the latter, adducts are identified based on a database of frequently detected single charged adducts in LC-HRMS experiments (e.g., M + Na) 36. Isotopes are detected based on the mass defect between the parent and potential isotope mass. Whereas (in-source) fragments are further filtered based on the probability of the neutral loss (i.e., mass difference between fragment and parent ion). The CompCreate algorithm outputs a CSV file that contains both the generated components and un-grouped features as well as the spectral information at MS1 and MS2 levels.
Library search
Library search is used for the identification of components or features based on similarity with database spectra. For this, InSpectra uses the Universal Library Search Algorithm (ULSA) 30. ULSA minimises the variability observed in the data caused by different acquisition conditions through spectral normalisation as well as inclusion of multiple sources of information. Additionally, for the components a complete library search is done while for the remaining features a molecular formula assignment is performed based on the compounds present in the databases. For spectral matching of components, an initial search in the InSpectra database is performed based on the precursor ion mass using the mass window associated with each component. On average this mass window ranges between ± 10 mDa ± 30 mDa. For each of those spectra, a Final Score (quality of spectral match) is calculated based on seven different parameters, including the number of fragments matched in both user and reference spectra as well as the associated mass errors. The influence (i.e., weight) of each parameter can be specified by the user via a weight vector of seven values ranging between zero and one. As for the features that are found in the input list, molecular formula assignment is performed with the US-EPA CompTox 37 database. Potential molecular formula matches are scored either with a value between 0 and 1, depending on whether the mass difference between the measured precursor ion and theoretical molecular formula mass is above or below the user defined mass tolerance, respectively. ULSA outputs a list containing all potential candidate identifications or molecular formula assignments with their corresponding Final Score as well as the list of matched fragments for the components and features, respectively. The output is stored on the platform and its metadata is stored and referenced in the database.
InSpectra’s Library
The database of the library search is sourced from two sources, the MassBank Project 38 that offers in vitro (experimental) spectra of which there are 89,826 distinct spectra 15,059 unique compounds, and 16,840 unique isomers. 25,935 Of these entries have a recorded resolution of 7,500 or above. The library database also includes 700,000 in silico (predicted via CFM-ID39) spectra from the EPA’s National Centre for Computational Toxicology CompTox Chemical Dashboard with spectra predicted for EI-MS and ESI-MS/MS in both positive and negative ionisation modes 40. To our knowledge the InSpectra platform is the only platform capable of searching against such a large spectral library, which provides the researchers access to such resources.
Statistical analysis
Identified features could be connected to each other across different samples enabling direct spatial and temporal trend analysis of chemicals across different matrices. The identity-based alignment functionality currently implemented in InSpectra is essential for the detection of emerging chemical threats.
For unidentified features, the alignment can take place for samples analysed with the same method. InSpectra, through its database, can group the datasets measured via the same methods and align their feature lists and/or components. Similar types of trend analysis can be performed on such outputs, enabling further understanding of the covered chemical space of a sample set.
Finally, for unidentified features analysed via different methods, the current implementation of InSpectra is not able to perform any alignments. The next version of InSpectra will include a validated retention mapping algorithm to seamlessly connect the unidentified features generated via multiple acquisition methods 41.
Example workflows
InSpectra enables the user to combine multiple tools that each have their own functions and goals. A complete overview of paths or tools’ combinations can be seen in Fig. 1. However, to give a better idea of the overall process and possibilities, two frequently used workflows are described below.
Library search identification workflow
One of the most used NTA workflows is for feature identification to identify known unknowns starting from raw data through to a list of identified features (i.e., spectra matched against a library; see Fig. 3). In this workflow the HRMS files are converted to mzXML (a common open-source format) that is then processed for feature detection using SAFD to obtain the MS1 information on the parent, adduct, isotope, and in source fragment ions. Feature Alignment then groups the features across multiple feature lists or component lists based on their retention time and m/z. The file then undergoes componentisation using CompCreate which groups information belonging to unique chemical constituents. The componentised file is then searched against the InSpectra database using ULSA. Lastly, is Statistical Analysis, which offers multiple tools to analyse the results either as a standalone or in context of other stored processed files and its metadata, such as heatmaps, temporal and spatial trends via identity-based alignment approach.
Suspect screening workflow
Another commonly used NTA workflow is suspect screening where only a targeted list of chemicals searched for within the samples (see Fig. 4 for an overview). This algorithm extracts the MS1 and MS2 information provided in the suspect list from the raw data and generates a match factor between the user provided spectra and the experimentally measured one. This is a faster process than complete NTA, given that it focuses on specific mass channels (i.e., monoisotopic mass of the suspect analytes). Additionally, the suspect screening workflow does not perform the classical componentisation, which helps to speed up this process. Additionally, the suspect screening workflow is more sensitive than the NTA workflow due to its more targeted nature. This workflow generates a list of features with their potential structure, isotopic matching, number of matched fragments, and match factors. This information will facilitate the confidence assessment of the identifications by the analysts. Additionally, to facilitate suspect screening within InSpectra, InChiKeys can be converted into suspect lists that are ready to be fed to the algorithm.
Demonstration of InSpectra Workflows
Applying the Library Search workflow to samples representing four different matrices (wastewater, stormwater, cow blood extracts and cow serum extracts) resulted in a mixture of chemicals being tentatively identified covering multiple classes of chemicals including antimicrobials (e.g. 1,2-benzisothiazolin-3-one), food constituents (e.g. niacinamide), pharmaceuticals (e.g. ranitidine), and agrochemicals (e.g. simazine) (see Fig. 5). This demonstrates the capability of using a full NTA workflow for biological and environmental analysis, considering the data were acquired on two different vendor instruments using completely independent experimental conditions. The full NTA workflow (Library Search) outputs feature lists, extracted component lists and candidate lists that could be used for future retrospective analysis for further evaluation. When the same Library Search workflow was also applied to stormwater samples collected over a series of time points during multiple storm events but analysed within the same batch, similar chemicals, were detected (Fig. 6). In the stormwater samples, the most dominant family of detected chemicals were the agrochemicals, pharmaceuticals, and pharmaceuticals’ transformation products (e.g. carbamazepine epoxide). The presence and frequency of detection of these chemicals in stormwater may indicate domestic sources contaminating stormwater such as wastewater exfiltration. These results demonstrate the applicability of InSpectra as an early warning system for chemicals of emerging concern.
To demonstrate the Suspect Screening workflow, we used chemicals tentatively identified using the Library Search workflow to create a suspect list and processed the same files via Suspect Screening workflow. When we performed a direct comparison of the results of the two workflows, in ~ 56% of the cases we found complete agreement between the two workflows, ~ 20% of the cases were only detected by the Library Search workflow and ~ 16% only by Suspect Screening (Figs. 7 and 8). For the Library Search workflow we set a Final Score threshold of 4/7 while for Suspect Screening we set a Match Factor of 0.4/1. It should be noted that for Library Search, Final Score is dependent on the percentage of fragments matched, hence if an entry only has too few fragments, this may lead to a false positive identification. When the recorded signal for both parent ion and the fragments have a signal to noise ratio larger than 5, both workflows are able to detect and identify the chemicals in the analyzed samples (Fig. 9). As for low intensity fragments, the fragment in the MassBank entry for carbofuran did not have high enough intensity to be distinguished as an analytical signal, resulting in a non-detect for the Library Search workflow while Suspect Screening workflow was able to detect this chemical in the sample. Finally, for simazine in stormwater, the Library Search workflow was able to detect this chemical whereas the Suspect Screening workflow fell short. In this case, a deeper investigation of the suspect list and the included fragments indicated that some diagnostic fragments were missing from our suspect list. These results show case how different workflows in InSpectra are able to extract essential information for the chemical characterization of complex samples.