Reference data-set driven metabolomics

Human untargeted metabolomics studies succeed in annotating only ~10% of molecular features. We, therefore, introduce reference data-driven analysis that uses the source data as a pseudo-MS/MS reference library to match against human metabolomics MS/MS data. We demonstrate this approach with food source data, allowing an empirical assessment of dietary patterns from untargeted data but is broadly applicable and provides an additional layer of interpretability to metabolomics data.


Abstract
Human untargeted metabolomics studies succeed in annotating only ~10% of molecular features. We, therefore, introduce reference data-driven analysis that uses the source data as a pseudo-MS/MS reference library to match against human metabolomics MS/MS data. We demonstrate this approach with food source data, allowing an empirical assessment of dietary patterns from untargeted data but is broadly applicable and provides an additional layer of interpretability to metabolomics data.

Main
To understand the complexity of sequence data obtained in metagenomic or metatranscriptomic experiments, not only databases that contain curated genes are used to interpret the data but also reference data such as whole genomes (e.g. of microbes, viruses) or other reference sequence data sets with carefully curated metadata (e.g. developmental stage, tissue location or phenotype). 1-4 Such reference data-driven (RDD) analysis enables an increased understanding of the structure and function of complex communities by leveraging matches between genes or transcripts of known and unknown origin. By analogy, interpreting MS/MS based untargeted metabolomics data is performed by searching structural MS/MS libraries, however, leveraging reference data that includes all the known and unknown MS/MS features to further improve the insights that can be obtained from untargeted metabolomics data is not yet done. 10 To enable RDD analysis for understanding an MS/MS-based untargeted metabolomics experiment, instead of only searching MS/MS structural libraries as has been carried out since the late 1970's 5,6 , RDD now also searches against MS/MS spectra from well curated data sets ( Figure 1). The key difference is that the output reports contextualized information obtained from source reference datasets. Source data includes MS/MS spectra of multiple ion forms of both known and unknown molecules, isotopes, adducts, in-source fragments, and multimers. 7,8 The curated reference dataset can be matched in human biospecimens via direct matching of the MS/MS spectra or by sophisticated approaches such as molecular networking. We have created a step-by-step tutorial on how to perform an RDD analysis using the GNPS ecosystem (https://ccms-ucsd.github.io/GNPSDocumentation/tutorials/rdd/). 9 To exemplify RDD, we created a food metabolomics reference data set as there is an unmet need to retrospectively and empirically read out food and beverage information from human metabolomics data and to complement the current state-of-the-art mass spectrometry nutrition readout approaches that target up to ~150-200 metabolites. 10,11 The food reference data set consists of untargeted metabolomics and detailed metadata for ~3500 foods (Table S1). It contains 107,968 unique MS/MS spectra merged from a total of 1,907,765 spectra. This data accessible through GNPS and archived in the NPJ recommended repository MassIVE. Expansion of the food source data is accomplished by creation of additional data sets and deposition in GNPS/MassIVE with their metadata.
For RDD, the food source data is subjected to molecular networking 14,15 together with human metabolomics datasets (Figure 2a). Using information on the controlled research diets of participants of a sleep and circadian study data set 12 , we were able to report if a given food category was consumed and if it agreed with the reported diet ( Figure 2b). Of the 15 food categories, eight represented direct matches, three matches to fermented versions of the non-fermented foods consumed (e.g. yogurt instead of milk), and four categories were not documented to be consumed during the study. Evidence of caffeinated beverage consumption was observed only in two individuals -in the rst 48hrs in one volunteer and once in a second volunteer in the middle of the study -consistent with the elimination of caffeinated beverages in the controlled diet. This demonstrates that RDD can be used to successfully obtain diet information from untargeted metabolomics data and monitor diet adherence in controlled diet studies.
We also tested mismatched food inventories by performing a crossover with US or Italian foods to the clinical cohorts in those same regions. Crossover revealed that spectral match rates were 5-6% in the reciprocal tests, in comparison to 15-30% when the regional foods were used (Figure 2c, p=0.019). These observations shows that the RDD concept is applicable to metabolomics but also that RDD works optimally when the source data includes regiospeci c foods.
Because RDD can be performed retrospectively, we co-analyzed the food reference dataset with 28 public human datasets (Table S2, Figure 2d). RDD increased spectral usage by 5.1±3.3 fold over structural MS/MS library matches. The inclusion of region/study speci c food data signi cantly contributed to the increase in spectral matches (Figure 2d; P = 0.0028, Games-Howell test). With molecular networking, which can capture metabolized versions of molecules, spectral data usage increased by 6.8±3.5 fold. The data usage increased by 26.8±3.3% for stool data (P=2.8e-16, Games-Howell test), 27.5±5.2% for plasma data (P=0.0040) and 41±4.6% for other human data (P=0.00020). Further inclusion of connected nodes, representing potential metabolism via molecular transformations, results in a total increase of 43.7±3.1% (fecal; P=6.9e-10), 51.2±6.9% (plasma; P=2.8e-06), and 58.0±4.2% (other; P=1.4e-06) of MS/MS spectra that can now be leveraged as empirical readout of diet (Figure 2d). To assess if RDD can reveal dietary preferences, a data set of omnivores and vegans was analyzed. PCA of the spectral matches to diet revealed separation between the dietary preferences ( Figure 2e) and that there were more MS/MS matches to dairy, meat, and seafood (P=0.0021, 2.2e-10 and 7.7e-7 respectively) in the omnivores while more MS/MS matches to legumes, eshy fruit, and vegetables to the data from vegans (P=2.2e-10, 0.0096 and 0.029, respectively, Figure 2f). Because many MS/MS spectra from foods may overlap, when using only the unique MS/MS only, the results can provide additional speci city (Figure 2g). When performing RDD on an Alzheimer's disease population 16 , it revealed that individuals with lower diet diversity consumed more dairy, sugar, soda, and coffee and that this diet type was more prevalent in the Alzheimer's dementia group. This shows that RDD can be used to retrospectively stratify clinical studies based on their diet composition.
Going forward, datasets of personal care products, medications (not just active ingredients but also formulations), microbiota, microbial isolates, etc. might also be used as source reference data. Potential applications of RDD metabolomics include understand diet and nutritional intake, medication use, consumption of illegal substances, environmental allergens, food ingredients/adulteration, and personal care products to inform of potential exposures and health implications.
Methods, data and code availability, supporting tables 1,2 and supporting gure 3 are available as supporting information. Retrospective reference data-driven based analysis work ow. a. Depicts the traditional untargeted mass spectrometry analysis based on structural library matching; b. Integrates the use of reference MS/MS data. RDD can leverage known and unknown mass spectrometry features.
Step 1: Dataset selection; Step 2: Library search against compound-based reference libraries; Step 3: Spectral alignment to identify related or identical features -across samples; Step 4: Visualization of spectral similarity by molecular networking. Settings for spectral matching are set to ~1% FDR estimated by Passatutto.13 Figure 2 RDD with food reference data. a. Food RDD analysis schema. b. Food spectral counts (1% FDR13) observed in plasma from a sleep restriction and circadian misalignment study that controlled the diet of the participants.12 Solid circles represent MS/MS matches to foods consumed during the study, whereas grey circles represent MS/MS matches to fermented versions of foods consumed. c. A crossover experiment between centenarian data from Italy and a sleep and circadian study from the US, for both fecal and plasma samples. Study region speci c foods consumed by those individuals (yes) vs a different set of study region speci c foods (no), (Welch's t-test). d. Library spectral matches (initial), spectral matches to region or study speci c foods (SSF), spectral matches to the food reference data collected via the Global FoodOmics project (GFOP), both (SSF & GFOP), expansion with molecular networking (Total). Left, stool data; middle, plasma data; right, other human biospecimens. Signi cant differences are determined by Welch's F-test. e. PCA of the food counts color coded by vegan (brown) vs omnivore data (green) using level 3 food ontology. f. Statistical analysis for the food spectral match level 3 ontology counts in relation to omnivore and vegan data (Wilcoxon test). g. Same as f. but level 4 ontology using unique spectral counts. The mass spectrometric analysis of animal products also contain detectable molecules from their diets resulting in spectral matches to vegan data.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. June2021SIRDDReferencedatabasedmetabolomicsNatureBiotechsubmission.docx TableS220200701GlobalFoodOmicsmetadataanddescriptors.xlsx ReportingSummary.pdf EditorialChecklist.pdf