BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

A substantial fraction of metabolic features remains undetermined in mass spectrometry (MS)-based metabolomics, and molecular formula annotation is the starting point for unraveling their chemical identities. Here we present bottom-up tandem MS (MS/MS) interrogation, a method for de novo formula annotation. Our approach prioritizes MS/MS-explainable formula candidates, implements machine-learned ranking and offers false discovery rate estimation. Compared with the mathematically exhaustive formula enumeration, our approach shrinks the formula candidate space by 42.8% on average. Method benchmarking on annotation accuracy was systematically carried out on reference MS/MS libraries and real metabolomics datasets. Applied on 155,321 recurrent unidentified spectra, our approach confidently annotated >5,000 novel molecular formulae absent from chemical databases. Beyond the level of individual metabolic features, we combined bottom-up MS/MS interrogation with global optimization to refine formula annotations while revealing peak interrelationships. This approach allowed the systematic annotation of 37 fatty acid amide molecules in human fecal data. All bioinformatics pipelines are available in a standalone software, BUDDY (https://github.com/HuanLab/BUDDY). BUDDY is a bottom-up tandem MS (MS/MS) interrogation method for de novo molecular formula annotation with significance estimation.

MS/MS reveals the structural information of chemicals by measuring the mass-to-charge ratios (m/z) of ions fragmented from selected precursor ions. MS/MS is essential in liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics for metabolite identification. A routine process is to match experimental MS/MS against reference MS/MS spectra [1][2][3][4] . However, given the great sensitivity of MS instruments, a considerable amount of uncharacterized chemical signatures ('dark matter') remains in untargeted metabolomics 5 . These unidentified features might present unique bioactivities and play critical roles in understanding biological mechanisms. Unfortunately, these features do not have available reference spectra in spectral databases or have not yet been reported in literature ('unknown unknowns' 6 ). As such, unknown annotation has become a challenging yet pivotal research topic in metabolomics, exposomics and others 1,[7][8][9][10][11][12][13][14] .
Molecular formula determination is a great starting point to illuminate unrecognized metabolites, as unambiguous molecular formula annotation can dramatically reduce the number of potential chemical candidates. A common practice for molecular formula annotation relies on mass searching against metabolome databases (for example, Human Metabolome Database (HMDB) 15 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 16 and Chemical Entities of Biological Interest (ChEBI) 17 ) or even larger chemical databases (for example, PubChem 18 and ChemSpider 19 ). However, this approach is substantially affected by the mass accuracy, limiting the search scope within existing databases and thus hindering the discovery of novel formulae. Moreover, it does not fully utilize available information such as the isotope pattern in MS1 (the first stage of MS scan) and MS/MS. MZmine first included isotope pattern matching, MS/MS fragmentation analysis and a combination of heuristic rules 20 in molecular formula determination 21 . SIRIUS annotates molecular formulae by producing all mathematically possible formulae 22 , decomposing isotope patterns 23 and integrating fragmentation trees 24 into MS/MS analysis. SIRIUS works in a top-down manner in which formula candidates are first generated using MS1 information, and MS/MS explanation comes last. Figure 1 illustrates the schematic workflows of top-down and bottom-up approaches for molecular formula annotation. The top-down approach (for example, SIRIUS 7 ) generates the entire potential candidate space using MS1 information, assesses the MS/MS interpretability for each formula candidate (for example, fragmentation tree 24 ) and ranks them by heuristically combining their MS1 and MS/MS scores. In comparison, our bottom-up approach has three distinct features, including (1) generation of an MS/MS-explainable candidate space, (2) machine learning-assisted candidate ranking and (3) FDR estimation.
Bottom-up MS/MS interrogation starts with decomposing the query MS/MS into multiple fragment-neutral loss pairs. The masses of the fragments and neutral losses are individually searched against the Bottom-up processing represents specifying individual base components separately and assembling them into a top-level system. Bottom-up approaches have been widely adopted in neighboring fields such as shotgun sequencing 25,26 and shotgun proteomics 27,28 , where sheared DNA strands and digested proteins are analyzed. Here we propose bottom-up MS/MS interrogation to enable accurate molecular formula determination with significance estimation 10,29 . While a couple of bioinformatics tools have integrated MS/MS analysis into their candidate ranking 7,9,21,24 (formula or structure candidates), we prioritize the significance of MS/MS in candidate generation. The 'bottom-up' methodological design dramatically shrinks the candidate searching space and allows for the broad discovery of unreported biochemically feasible formulae. Implementing machine-learned ranking (MLR) and false discovery rate (FDR) estimation enhances the annotation accuracy and enables significance control, respectively. Furthermore, we embed global optimization on an experimental basis, aiming to refine formula annotations of individual peaks while revealing peak-peak Article https://doi.org/10.1038/s41592-023-01850-x well-curated formula database archiving >3.5 million unique molecular formulae from 26 chemical databases (Methods) to produce subformula candidate sets. During the search, both even-electron and odd-electron (radical) ion species are considered by searching against either the hydrogen-adjusted formula database (for example, C 6 H 7 ) or the original neutral molecular formula database (for example, C 6 H 6 ). This could be unexpectedly crucial given that >60% of the reference spectra in the NIST Tandem Mass Spectral Library 2020 (NIST20) contain ≥10% odd-electron fragment ions 30 . Next, the subformula candidates for each fragment-neutral loss pair are stitched together to produce a unique formula candidate space in which all the formula candidates can explain this fragment-neutral loss pair at the valid subformula level. As such, each fragment-neutral loss pair adds to the size of the entire MS/MS-explainable candidate space as a new dimension. Candidate spaces contributed from all fragment-neutral loss pairs are merged, dereplicated and then subjected to SENIOR rules 31 to filter out chemically implausible formulae (for example, CH 5 ). The above processes (Methods and Extended Data Fig. 2) create a pooled space in which any candidate can explain at least one peak in the query MS/MS. MS1 isotope pattern matching is conducted optionally to generate MS1 isotope similarities. The following MLR assesses both the intrinsic properties of candidate formulae and their performance in MS/MS interpretation, and an MLR prediction score is assigned to each candidate. MLR scores are then converted into estimated posterior probabilities using Platt calibration 32 , and FDR estimates are calculated accordingly (Methods).
In our approach, searching the masses of fragments and neutral losses within the curated formula database allows us to focus on the molecular formulae that are composed of two database-archived subformula blocks. The process of stitching the subformulae of fragments and neutral losses together makes it possible to create novel precursor molecular formulae that extend beyond the current chemical space. Specifically, our bottom-up approach makes annotations within a space of trillions of both known and unknown molecular formulae, which are derived from >12 trillion (3.5 million × 3.5 million) possible subformula combinations.

Candidate space shrinkage
The first key feature of the bottom-up approach lies in its capability to drastically shrink the candidate space for lower computational cost and higher annotation accuracy. To demonstrate this, we performed a repository-scale candidate space comparison between the top-down and bottom-up approaches using the curated NIST20 library (21,976 unique chemicals; Methods). For each query MS/MS, we first generated its entire potential candidate space via mathematical mass decomposition 22 , which encompasses all the molecular formulae that match the precursor mass. The MS/MS-explainable candidate space was then produced using bottom-up MS/MS interrogation.
The results show that a narrower searching space can be obtained in 87.8% of total queries by prioritizing MS/MS-explainable candidates (Fig. 2a). Compared with the top-down approach, the bottom-up approach shrinks the candidate space by 4.2% to >99.9%, 42.8% on average (median of 40.8%). To further explore the candidate space shrinkage in the domain of precursor mass, we plotted the candidate counts of all queries in both spaces as in Fig. 2b. Overall, both candidate spaces expand as precursor mass increases, but the MS/MS-explainable candidate space grows at a drastically lower rate. As estimated by locally weighted scatterplot smoothing, the MS/MS-explainable candi date space is 7.9-fold narrower than the entire space at m/z 400, 12.2-fold at m/z 600 and 20.7-fold at m/z 800. A more obvious size difference between the two spaces can be observed at higher m/z. Even for m/z < 200, a statistical significance was found between the space sizes (P < 2.2 × 10 −16 ). However, it is noteworthy that the extent of candidate space shrinkage relies on spectral quality; unwanted noise or isotopic peaks can produce formula candidates that contaminate the MS/MS-explainable candidate space. To this end, we consider it significant to conduct MS/MS data preprocessing (Methods and Supplementary Notes 1 and 2) to obtain cleaner spectra before formula annotation.
The MS/MS-explainable candidate space can be shrunk further by applying cutoffs on MS/MS explanation qualities, such as the total fragment intensity explained (normalized, in percentage). However, the process of candidate space shrinkage is always accompanied by the risk of mistakenly disposing of correct answers outside the candidate space, which we term misdisposition. This means that there are extreme cases where the correct formula cannot explain a single fragment-neutral loss pair in its MS/MS. The chance of misdisposition remains as low as 0.05% (1 of 2,000 annotations). With a higher threshold of the total fragment intensity explained, the candidate  The chance of disposing of correct answers (that is, misdisposition rate) also increases with a higher cutoff but remains <2% when the cutoff of total explained fragment intensity is 50%.
Article https://doi.org/10.1038/s41592-023-01850-x count reduces further at the cost of a higher misdisposition rate (Fig. 2c). In our bottom-up approach, we do not include any additional MS/MS-related cutoffs for candidate removal as experimental MS/MS can be informative to varying extents or even be contaminated due to collision energy or precursor window width selections 33,34 . Hence, we prioritize a more accurate candidate ranking process while reserving a relatively reasonable fraction of formula candidates.

Accuracy of bottom-up molecular formula annotation
The  Article https://doi.org/10.1038/s41592-023-01850-x and is thus anticipated to have a higher chance of being correctly annotated. We inspected annotation accuracy as a function of spectral entropy for compounds of different m/z ranges (Fig. 3c). As expected, for spectra with m/z < 400, MS/MS spectral information does not add much to the formula prediction accuracy, as we are searching within a rather narrow candidate space. However, more diverse fragments are needed to enhance the annotation performance on MS/MS spectra with m/z ≥ 400, and the trend of this contribution is more obvious for even larger masses (m/z > 500). In this regard, we highly recommend the acquisition of information-rich MS/MS data for accurate downstream interpretations.
Bottom-up molecular formula annotation was also evaluated on experimental data using four public liquid chromatography-MS/MS (LC-MS/MS)-based metabolomics datasets collected from various sample matrices, MS instruments and ionization polarities (Supplementary Table 3). Ground truths were constructed via spectral matching (level 2a (ref. 38)), where a dot product score >0.7 and a minimum of 6 matched fragments were required. We identified 146, 194, 83 and 118 metabolites in the tested datasets shown in Fig. 3d, from top to bottom, respectively. BUDDY consistently outperforms SIRIUS for all datasets in both annotation accuracy and computational speed (Fig. 3e,  Supplementary Tables 3-7 and Supplementary Fig. 1). The reasons for the slightly worse performance in the American Gut Project 39 dataset (78.1%) are complex (Extended Data Fig. 4), including low MS/MS spectral quality, MS instrumental mass accuracy (only the American Gut Project dataset was collected on Quadrupole Time-of-Flight (Q-TOF) and not Orbitrap) and others. Additionally, we tested how MS/MS spectra aided in formula annotation and generated annotation results without using MS/MS (Fig. 3f); metabolic features without MS/MS data can also be annotated in BUDDY. In all cases, the employment of MS/MS was positive, and the most significant improvement occurred in the Chagas disease dataset (20.6%). Moreover, BUDDY offers the option to use meta-scores 40 (for example, appearance in chemical databases) for the bottom-up annotation. In reanalyzing the above metabolomics datasets with meta-scores, an increased annotation accuracy (up to 13.2%) was observed in all datasets (Fig. 3g).

Platt calibration and FDR estimation
MLR prediction scores are transformed into calibrated probabilities, on top of which FDR estimation can be achieved (Methods). Using MassBank 4 as an initial validation, we explored MLR score distributions of both correct and incorrect annotations (

Bottom-up MS/MS interrogation discovers unreported formulae
As an initial test of bottom-up MS/MS interrogation to unravel unrecognized metabolites, we collected MS/MS spectra of five novel compounds discovered in references 10,11 , three of which were confirmed by nuclear magnetic resonance experiments. In all cases, BUDDY ranked the correct molecular formulae as top 1 (Extended Data Fig. 6), with the estimated FDR ranging from 0.6% (C 33 H 49 NO 5 , m/z 540.3690) to 16.3% (C 14 H 20 N 4 O 2 S, m/z 309.1383).
As a repository-scale application, we retrieved MS/MS libraries of annotated recurrent unidentified spectra (ARUS) 41 collected from human plasma and urine samples. These MS/MS spectra were obtained in data-dependent acquisitions on an Orbitrap using either higher energy collision dissociation or ion trap fragmentation. Adduct annotation was performed using the National Institute of Standards and Technology (NIST) pipeline and initially included in the ARUS MS/MS libraries. MS1 isotopic data are not available. Data preprocessing led to 155,321 unidentified spectra (Methods). Overall, bottom-up MS/MS interrogation resolved 153,079 spectra (98.6%) and determined 5,191 formulae absent from HMDB or KEGG at <5% estimated FDR (Fig. 5a  and Supplementary Tables 8-11). In particular, we discovered 173 and 53 completely novel molecular formulae absent from PubChem in plasma and urine, respectively, 215 of which formulae were also absent from ChemSpider 19 . We manually inspected the MS/MS of a novel formula (m/z 674.4387, C 35 H 64 NO 9 P; Fig. 5b) and annotated its structure as PC(18:2/9:0(CHO)) using characteristic fragmentation patterns (Supplementary Note 4). Moreover, to demonstrate the biochemical feasibility of newly discovered formulae, we tried to connect them with existing formulae via common biochemical transformations ( Fig. 5c and Supplementary Table 12). Transformations with up to two steps can link 95.4% (plasma) and 100% (urine) of novel formulae to known formulae (Supplementary Table 13). The remaining unlinked formulae are large-mass formulae with >60 carbons and could all be represented as homologs of known lipids with different acyl side chains.
We additionally performed an orthogonal validation against an MS/MS matching-based structural analog search (Supplementary Note 5), and consistent formula annotation results were obtained on 97.8% of tested spectra at <5% estimated FDR (Supplementary Table 14).    We showcased an HMDB-absent annotation of glycine-conjugated 12-ketolithocholic acid (Fig. 5d). Searching this MS/MS against the entire public data repository using Mass Spectrometry Search Tool 42 , this bile acid derivative was detected broadly in 28 metabolomic datasets collected from humans, mice and bacteria. Another example of xenobiotic compounds detected in humans is depicted in Fig. 5e, where zolpidem is known as a hypnotic drug for patients with insomnia.

Experiment-specific global peak annotation
Inspired by references 11,13 , we extended the bottom-up MS/MS interrogation method beyond individual feature annotations up to the molecular network level to reveal peak-peak relationships on an experimental basis (Methods). In brief, the goal of global peak annotation is to select the optimal molecular network while considering both candidate posterior probabilities of individual features (node) and valid feature interrelationships (edge). This process updates the low-confidence peak annotations with the aid of MS/MS similarity-based peak mutual connections. Notably, BUDDY incorporates an additional step of experiment-specific mass deviation estimation to boost its annotation performance (Supplementary Note 6). We reanalyzed the tomato dataset ( Supplementary Fig. 3) using three mass tolerances (5, 10 and 15 ppm). Even when the mass tolerance tripled to 15 ppm, the top 1 annotation accuracy only dropped by 1.0%. This distinct feature allows BUDDY to capture the actual mass deviations better and, on the other hand, provide more accurate FDR estimates. Furthermore, we conducted an experiment-level formula annotation comparison between BUDDY global peak annotation and ZODIAC 13 on all the metabolomics datasets mentioned above. BUDDY global peak annotation outperforms ZODIAC, with higher annotation accuracy and much lower computational cost (Supplementary Table 15).

Application of BUDDY in untargeted metabolomics
As a complete application of BUDDY in untargeted metabolomics, we analyzed the NIST human fecal material standards dataset (MSV000086989), as shown in Fig. 6a. Of 6,215 extracted metabolic features, 213 features (3.4%) were identified (level 2a) via an MS/MS library search against NIST20 using a dot product score >0.7 and matched fragment count ≥6. BUDDY further annotated 5,733 unidentified features (92.2%); global peak annotation linked 5,092 annotated features (81.9%) to identified metabolites directly or indirectly, paving the way for downstream annotation propagation. At <5% estimated FDR, we discovered 134 formulae absent from HMDB and 215 formulae absent from KEGG (Extended Data Fig. 7).
Next, we investigated the impact of global peak annotation from two aspects, individual peak annotations (node) and peak interconnections (edge). The global peak annotation process updated individual peak annotations for 1,348 (21.7%) features (Fig. 6b). Further analyses indicate that the annotation-changed peaks have comparably larger m/z values than unchanged peaks. Also, there were more MS/MS-unassigned peaks (65.7%) in annotation-changed peaks than in unchanged peaks (45.0%). This agrees with our intuition that higher m/z features without MS/MS spectra tend to be annotated with low confidence and are thus prone to be corrected in global peak annotation. On the other hand, global peak annotation increased the total count of peak interconnections by 53.7%. Focusing on the edges with MS/MS similarity ≥0.6 using the modified cosine score 43 , 16.6% more MS/MS edges were constructed with updated formula annotations. We specifically showcased one annotation-changed peak in Fig. 6c. After global peak annotation, this peak was successfully connected to five identified metabolites and eight formula-annotated peaks via MS/MS edges. Corresponding biochemical transformations further reveal the structural insight, a dihydroxylated long-chain fatty acid, supported by subformula annotations of MS/MS fragments (Supplementary Fig. 4).
Finally, we systematically explored novel fatty acid amide (FAA) molecules in human feces. FAAs are essential signaling compounds that can be synthesized both endogenously and by the gut microbiota. Bacteria in the gut microbiota produce structural mimics of endogenous FAAs to interact with human cellular targets and thus modulate their hosts 44 . We generated 1,325 plausible FAAs using the fatty acids and amines present in the human gut 45 and searched their molecular formulae against BUDDY annotations. Putative FAAs were first refined via characteristic fragmentation patterns. Further confirmation was performed by matching MS/MS against reference MS/MS of the corresponding fatty acids or amines using the reverse dot product (Supplementary Table 16). This exploration resulted in 37 annotated FAA molecules (Fig. 6d), and nine were molecular formulae absent from HMDB and KEGG. We manually investigated the feature annotated as N-valeryl histamine, which has an MS/MS spectrum that shows a reverse dot product of >0.99 with histamine (Extended Data Fig. 7). The local molecular network surrounding N-oleoyl ethanolamine (Fig. 6d) helps further reveal other N-acyl ethanolamine analogs through MS/MS similarities and biochemical transformations. To examine whether FAA molecules are diet-related or indicate dietary preferences, we inspected their patterns in human feces between omnivores and vegans (27 samples per group). Using a fold change cutoff (>1.2) and statistical analysis (adjusted P < 1 × 10 −3 ), 30 FAA molecules were found to be significantly altered between omnivores and vegans (Fig. 6e). These findings show great potential for revealing the mechanisms of host-gut microbiota interactions and interindividual differences in dietary response from the perspective of small molecules.

Discussion
In bottom-up processing, individual base blocks are specified and assembled into a top-level system. Herein, we propose an MS/MS interpretation strategy in a bottom-up fashion, where each fragment ion and neutral loss pair offers a unique dimension to aid formula annotation. In the bottom-up approach, the MS/MS-explainable candidate space could be substantially narrower than the entire candidate space (for example, ~20-fold at m/z 800). This design considerably reduces the computational cost for large-mass features from hours to seconds per query. More importantly, incorporating MLR and FDR estimation significantly improves the performance of bottom-up interrogation compared with the top-down approach (up to a 68.7% accuracy increase) in uncovering both known and unknown molecular formulae while informing estimates of annotation confidence.
A more distinguished advantage of the bottom-up strategy is that it allows exploration of novel formulae beyond the current chemical space while focusing on biochemically feasible ones. Although metabolome databases such as HMDB and KEGG often serve as Article https://doi.org/10.1038/s41592-023-01850-x reference repositories for biochemical studies, we confidently discovered >5,000 database-unarchived molecular formulae in human samples and implied their potential biological interests. These novel formulae could also complement molecular-formula-oriented data acquisition approaches, such as HERMES 46 . At the same time, it remains controversial whether to integrate meta-scores (for example, citation frequency) in unraveling metabolic identities or not; meta-scores can help annotate known molecules 40 at the cost of potentially discarding novel metabolites 10 . To this end, we offer an option of meta-score inclusion in BUDDY.  Our MLR module was dedicatedly designed to automatically integrate MS1 and MS/MS scores and rank formula candidates. Careful consideration went into mimicking experimental conditions and improving the generalizability of trained models, including precursor m/z shifting and MS1 isotope simulation. Data augmentation employed in training spectra further enhances the model robustness.
Certainly, MS/MS spectral quality is essential for annotation performance. Contamination fragments diminish the annotation confidence by lowering MS/MS explanation coverage or, even worse, generating incorrect candidate formulae that misguide subsequent interpretation. In this regard, annotation of unidentified features may benefit from targeted MS/MS profiling 46 or more advanced data-driven spectral deconvolution approaches [47][48][49] . We also highly recommend that metabolomics practitioners consider mixed or ramped collision energies to collect information-rich experimental MS/MS spectra in an untargeted manner.
Efforts have been made to control FDR in MS/MS spectral matching-based metabolite identification 29,37,50 , yet the FDR estimation for unknown annotation 10 remains elusive. In our approach, Platt scaling monotonically maps MLR prediction scores onto a probability scale. However, note that accurate FDR estimates are expected only when experimental data quality is comparable to that of reference data. This means that both explicit MS1 isotope patterns and informative MS/MS spectra are responsible for reliable FDR estimates. We argue that accurate FDR control in untargeted metabolomics is still in its infancy, and both heuristic insights and statistical approaches should be involved.
Furthermore, we integrate bottom-up MS/MS interrogation with global optimization, aiming to synergistically resolve all the individual metabolic features on an experimental basis. For a more comprehensive and confident global annotation in untargeted metabolomics, combining orthogonal information, such as retention time (RT) [51][52][53] and collision cross-section 54 , will be considered for future improvements. Meanwhile, molecular networking should embrace a variety of MS/MS similarity measures 8,55-57 for structural analog searches to construct faithful connections among MS/MS spectra. Finally, in the pursuit of unambiguous structure annotation, the next generation of global networking may rely on a more convincing level of molecular substructures 58 beyond molecular formula. With the advent of cheminformatics tools for in silico generation of hypothetical metabolites 59,60 , we anticipate more structural insights into the undiscovered biosphere in recognition of a broad scope of unreported small molecules and their biological functions.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41592-023-01850-x. Article https://doi.org/10.1038/s41592-023-01850-x

Generation of MS/MS-explainable candidate space
We here describe the algorithmic details of generating MS/MSexplainable candidate space. A schematic illustration is shown in Extended Data Fig. 2. Let D be the curated neutral formula database; we prepare D +H and D −H based on the database D by respectively adding or removing a hydrogen atom from each neutral formula record, with all the monoisotopic masses recalculated.
Given an MS/MS spectrum with precursor mass M P and fragment masses {M F,1 …M F,i …M F,m }, let {M NL,1 …M NL,i …M NL,m } denote the masses of neutral losses, which can be easily calculated as M NL,i = M P − M F,i . For simplicity but without loss of generality, suppose that this MS/MS is positively ionized.
We first perform the mass searching process. Let M e be the mass of an electron. For each fragment i, . Given a predefined mass tolerance, we search mass M ′ F,i against database D +H (D −H if negatively ionized) to obtain candidate formula set C F,i,even in the case that fragment i is an even-electron ion. Simultaneously, we search M ′ F,i against the database D to obtain the candidate formula set C F,i,odd in the case that fragment i is an odd-electron (radical) ion. Similarly, the neutral loss mass M NL,i is searched against the databases D and D −H (D +H if negatively ionized) to obtain the candidate sets C NL,i,even and C NL,i,odd , respectively.
Next, we conduct the formula stitching, which adds a fragment formula and a neutral loss formula together to produce a precursor formula. For example, the formula stitching result of C 3 H 5 + and H 2 O is C 3 H 7 O + . Note that during the stitching process, even-electron fragment ions only pair with even-electron neutral losses, and odd-electron fragment ions only pair with odd-electron neutral losses. For each fragment i, we generate the formula space Ω i,even by stitching every candidate formula in the set C F,i,even with every candidate formula in the set C NL,i,even . Similarly, the formula space Ω i,odd is generated by iterating over all the combinations of C F,i,odd and C NL,i,odd . The formula spaces Ω i,even and Ω i,odd are then merged to produce the formula space Ω i , the space in which each candidate can explain the fragment i and its neutral loss at the formula level.
Finally, all formula spaces {Ω 1 …Ω i …Ω m } are combined into a pooled MS/MS-explainable candidate space, Ω. The space Ω is then dereplicated and subjected to SENIOR rules 31 to filter out chemically infeasible candidate formulae. This leads to the final MS/MS-explainable candidate space, ready to be further refined and ranked. Notably, databases D +Na or D +K will also be prepared and searched against if the query MS/MS is in the sodiated or potassiated adduct form to consider fragment ions in the form of [F + Na] + or [F + K] + .  2). Other settings regarding formula annotation include elemental restriction, chemical database restriction, elemental ratio restriction 20 and so on. Experiment-specific global peak annotation is modeled as an integer linear programming problem, and it is a parameter-free module. BUDDY supports the import of both single and batch queries (metabolic feature table, MGF or mzML file).
On the technical aspect, BUDDY was written in C# on the Universal Windows Platform. To speed up computation, BUDDY automatically invokes multiple processing cores and performs parallel programming. More details can be found in the user manual (https://github.com/HuanLab/BUDDY). A quick start guide, in the form of a YouTube video (https://www.youtube.com/watch?v=Ne_ Y0vZ0WKI), is also available.

Formula database construction and curation
To construct the neutral formula database, a total of 26 chemical repositories were downloaded, combined and curated. These chemical repositories cover small molecules of metabolites, lipids, natural products, xenobiotics, drugs, toxins, contaminants and so on ( Supplementary  Fig. 5 and Supplementary Table 17). Currently, the neutral formula database embedded in BUDDY has 3,514,066 unique valid molecular formula records in total, incorporating African Natural Products Database (2,157) (1,060); the unique molecular formula record count in each repository is shown in parentheses. All of the above-mentioned chemical repositories were downloaded after December 2020. For each repository, we retrieved the following chemical information if provided: chemical name, molecular formula, CAS number, PubChem CID, KEGG ID, HMDB ID, SMILES, InChI and InChIKey. The PubChem Identifier Exchange Service (https:// pubchem.ncbi.nlm.nih.gov/idexchange) was used for conversion among various chemical identifiers. Chemical elements were restricted to C, H, N, O, P, S, F, Cl, Br, I, Si, B, Se, Na and K. Charged and radical formulae were neutralized by a proton or hydrogen adjustment. Formulae with double-bond equivalent (DBE) values ≥−5 and monoisotopic masses ≤1,500 were reserved. All formula strings were normalized using the Hill system and dereplicated. The final neutral formula database serves as (1) the original formula searching database, (2) the candidate formula database when database restriction is applied and (3) the formula database for distribution analyses assisting MLR feature generation (see below).

Radical fragment ion
An important feature of BUDDY lies in its automatic recognition of radical fragment ions in MS/MS spectra. BUDDY shows no bias in recognizing odd-electron (radical) fragment ions and even-electron fragment ions. Similar approaches are also applied to neutral loss searching. Details are discussed in Supplementary Note 7.

MS1 isotope similarity
To compare experimental MS1 isotopes and theoretical isotopes of formula candidates, the isotope similarity calculation is implemented Nature Methods Article https://doi.org/10.1038/s41592-023-01850-x in BUDDY. Here we adopted and modified the isotope similarity algorithm from ref. 21 (Supplementary Fig. 6) for its broad scalability and low computational cost. We tested the validity of the isotope similarity algorithm (Supplementary Note 8 and Supplementary Fig. 7), and the results show that experimental isotopes have statistically significantly higher similarity scores with theoretical isotopes of groundtruth molecular formulae than other formulae generated within a 5-ppm mass tolerance.

MLR
In BUDDY, formula candidates are automatically ranked through MLR.
Here we describe the MLR feature design, training data preparation, data augmentation, model training and hyperparameter optimization. The above feature mapping process addresses the potential feature scaling differences among various MLR features. An additional consideration goes into that the training data might be insufficient to cover the entire range of values for every MLR feature, and this scaling process enables the mapping of extreme experimental data that have not been seen by the model onto the same space [0, 1] as the training data. More specifically, the broad range of all the possible unseen values is mapped onto a dense scale which indicates their rareness. This could enhance the model robustness and benefit the discovery of unknown molecular formulae.
Training data preparation. NIST20 was purchased from NIST through Isomass Scientific. The high-resolution NIST20 database contains 1,026,717 MS/MS spectra for 27,613 unique compounds. For model training purposes, NIST20 was curated as follows. First, we removed MS/MS spectra with isotopic adducts (for example, MS/MS spectra for M + 1 ions), multiply charged adducts or adducts containing chemical elements beyond the aforementioned elemental list (for example, [M + Li] + ). MS/MS spectra without InChIKey information were also discarded. To ensure structure-disjoint training, we performed spectral merging next (see refs. 7,9,13 for the importance of structure-disjoint training). For each unique compound in NIST20 (identified by InChIKey), we chose its most frequent adduct form and performed spectra merging for MS/MS collected under different collision energies. Both fragment m/z values and ion intensities were averaged. The above curation processes led to 25,456 positively ionized and 11,583 negatively ionized unique chemicals in NIST20.
Another training data preparation step was precursor m/z shifting. During the MLR prediction, the MLR feature 'precursor m/z error' is essential, and it is calculated as the absolute mass deviation between the experimental precursor m/z and the theoretical m/z of each candidate formula. Note that all precursor m/z values provided in NIST20 are theoretical values generated from the reference molecular formulae. These masses cannot be directly used for model training; the model would be trained that the candidate formula with the smallest absolute precursor m/z error (an absolute m/z error of 0) is the correct answer. Hence, we shifted the theoretical precursor m/z values within a specific range for experimental simulation. As described in ref. 23, mass deviations usually conform to the Gaussian distribution with a standard deviation that is 1/3 of the relative mass tolerance. However, we found lower real-case mass deviations, where the standard deviations of Gaussian curves are closer to 1/5 of the common relative mass tolerance for high-resolution MS data. We thus randomly sampled mass deviations from Gaussian curves within the mass tolerance range and shifted precursor m/z values from their theoretical values by the preselected mass deviations. The following mass tolerances were used: Orbitrap, 5 ppm; Q-TOF, 10 ppm. Each precursor mass was shifted to simulate experimental precursor masses. The MLR feature 'precursor m/z error' was then generated using the shifted precursor mass and the theoretical mass of each candidate formula.
Similarly, MS1 isotope simulation was performed to mimic the experimental conditions. We designed an MS1 isotope simulation workflow and validated its credibility (Supplementary Note 10, Supplementary Fig. 10 and Supplementary Table 21). These simulated isotopic patterns were later compared with theoretical isotopic patterns of candidate formulae to compute the isotope similarity scores used for model training.
Data augmentation. As MS/MS spectra from multiple collision energies in NIST20 were merged for unique chemical compounds before training, there can be many more fragments in the merged MS/MS than in the experimental MS/MS. Training data were thus augmented to enhance the robustness and generalizability of MLR models (Supplementary Note 11). For clarification, we illustrated the entire MS/MS preprocessing for model training in Supplementary Fig. 11.
Model training. MLR models were trained using LightGBM in the ML.NET framework. Both data subsampling and MLR feature subsampling were applied to avoid overfitting. Data subsampling fraction was set as 0.9 and performed every 5 iterations. MLR feature sub sampling fraction was set as 0.8. We set the learning rate as 0.005, and normalized discounted cumulative gain was used as the evaluation metric (Supplementary Note 12). The entire available training data were split into 80/20 for training and testing purposes. Notably, to ensure the structure-disjoint property, augmented spectra derived from the same reference MS/MS always belong to the same set-structures seen by the training data were hidden from the testing set. Hyperparameters were optimized via grid search. These hyperparameters include L2 regularization term weight, number of iterations, maximum bin count per feature and maximum number of leaves in one tree (Supplementary Table 22).

Platt calibration and FDR estimation
In classification problems, probability calibration helps scale the distorted class probability distributions predicted by various classification models so that uninterpretable model prediction scores can be converted into corresponding class probabilities. Similarly, our MLR task can be simplified as a binary classification problem to some extent by dividing all the candidate formulae into two groups, correct and incorrect answers. Thus, probability calibration can also be applied to an MLR score to estimate its posterior probability P(correct|MLR score), the probability that a candidate is correct given the MLR prediction score. Platt calibration 32 (or Platt scaling) is a common approach for probability calibration that learns a logistic regression model which maps scores X ∈ ℝ onto a scale of P∈[0,1]. We completed Platt calibration using the Platt calibrator of ML.NET in C#.
Article https://doi.org/10.1038/s41592-023-01850-x Furthermore, inspired by refs. 10,29, we estimated the FDR using posterior probabilities obtained by Platt calibration. FDR is defined as FDR = FP/ (TP + FP), where TP and FP stand for true positives and false positives, respectively. Given a score threshold x 0 , FDR reflects the percentage of incorrect hits among all the hits receiving scores ≥x 0 . In our case, considering the top N hits for a query metabolic feature that are associated with a certain score threshold, the total number of TP and FP is N (TP + FP = N). As the posterior probability can be interpreted as the expected value for a hit to be correct, we can estimate the number of TP within top N hits as the sum of their posterior probabilities N ∑ i=1 P i . FDR can thus be estimated as In practice, BUDDY will output both calibrated Platt probability and estimated FDR for each hit so users can export batch results and easily apply FDR control to focus on metabolic features with high-confidence predictions. However, we must point out that FDRs can be overestimated due to various reasons, including but not limited to chimeric experimental MS/MS spectra, insufficient MS/MS fragments for annotation and coeluting peaks that affect MS1 isotope information.

Experiment-specific global peak annotation
BUDDY incorporates a global optimization approach for peak annotation corrections within the same LC-MS experiment, aiming to reveal peakpeak relationships while reserving reasonable individual peak annotations 11,13 . Here a molecular network is constructed, where each peak is a node and every valid peak connection is an edge. Global peak annotation selects the optimal network considering both node scores and edge scores, and then some peak annotations generated by bottom-up MS/MS interrogation are updated. Notably, global peak annotation is completely free of meta-scores. Global optimization is achieved through integer linear programming via the OR-TOOLs package developed by Google. Linear constraints are set to guarantee a self-consistent network. More details are discussed in Supplementary Note 13. Tested on the NIST human feces dataset (6,215 peaks), global peak annotation took about 3 min on a personal computer (Intel i7-8700K CPU @ 3.70 GHz, Windows 10 64-bit operation system, 6 cores, 32 GB of RAM).

Candidate space shrinkage analysis
For candidate space shrinkage analysis, we used the curated positively ionized NIST20 library. MS/MS spectra with precursor masses larger than 1,000 Da were removed, as the entire candidate space generation for large-mass chemicals can be extremely computationally expensive (up to hours or even days for a single query). The chemical element set was restricted to CHNOPSFClBrI. We retained MS/MS spectra of [M + H] + . This resulted in 21,976 unique chemicals for candidate space shrinkage analysis. In both candidate spaces, formula candidates within 5 ppm of the query precursor mass were retrieved to represent analysis results from high-resolution MS. Trendlines in Fig. 2b were generated using locally weighted scatterplot smoothing.

Evaluation datasets
BUDDY was evaluated on diverse datasets, including publicly available reference MS/MS libraries, real metabolomics datasets and ARUS MS/MS libraries. spectra with precursor m/z deviations beyond the mass tolerance range were discarded. Furthermore, given the precursor mass (m/z) pre , we removed fragment ions with masses above (m/z) pre − 0.5 in each MS/MS spectrum to ensure precursor ion exclusion. We categorized the reference spectra of each MS/MS library into batches according to their MS instrument type and ion mode. Within each batch, we kept unique chemical compounds (identified by InChIKey) for structure-disjoint evaluation. Well-curated MS/MS libraries were written into separate MGF files that can be directly imported into BUDDY or SIRIUS (settings in Supplementary Note 3). Evaluation results are summarized in Supplementary Table 2. For chemical structural diversity analysis, the molecular complexity 62 and natural product-likeness score 63 were computed in Python using RDKit, an open-source cheminformatics software. All parameters were set as default. We randomly selected 1 million molecules in PubChem for computation.
Public LC-MS/MS datasets. We downloaded four metabolomics datasets on MassIVE, covering diverse sample types collected from different MS instruments in positive or negative ion modes (Supplementary Table 3). Data preprocessing was conducted as follows. Raw LC-MS/MS data were converted into mzML files using MSConvert by ProteoWizard 64 , with all converted files in the centroided mode. MS-DIAL 48 (v.4.70) was used to perform metabolic feature extraction and alignment, and aligned feature tables containing MS1 isotopes and MS/MS spectra were exported (see detailed parameters in Supplementary Note 14). To build up ground truths for formula determination, we conducted metabolite identification by searching experimental MS/MS against NIST20 using dot product 65 . We set the dot product score threshold as 0.7 and the minimum number of shared peaks (other than precursor ions) as 6 for high-confidence identification 29 . For repeatedly identified metabolites, we only reserved the feature with the highest dot product score. This curation led to a total of 541 identified metabolites (level 2a (ref. 38)) in four datasets. Specifically, in the Chagas disease dataset (murine large intestine samples), more large-mass lipid molecules were identified (m/z > 400 for 32.5% of identified compounds), which made it more challenging. The parameters used for method comparison between BUDDY and SIRIUS are detailed in Supplementary Note 15; parameters for the comparison between BUDDY global peak annotation and ZODIAC are listed in Supplementary Note 16.
For the application of BUDDY, we used NIST human fecal material standards (MSV000086989). Peak extraction was completed in MS-DIAL. We then performed blank removal by discarding features with average intensities lower than twice that of the method blank. The remaining features were imported into BUDDY for metabolite identification, bottom-up MS/MS interrogation and global peak annotation (settings in Supplementary Note 17). MS/MS spectra of putative FAAs were compared against the reference MS/MS of both their corresponding fatty acids and amines using the reverse dot product, and we reserved the features with a similarity score >0.7 in either comparison. 41 libraries archive recurring MS/MS spectra of unknown identity in specific biological samples. Currently available ARUS libraries were collected from human plasma and urine using Orbitraps in both ionization polarities (https://chemdata.nist. gov/dokuwiki/doku.php?id=chemdata:arus). We carried out spectral clustering for redundancy removal. Similar MS/MS spectra were identified by precursor mass match (5 ppm) and dot product score (≥0.95). For MS/MS spectra of the same identity, we reserved the spectrum with the most fragments. This led to a total of 45,803 unidentified spectra from plasma and 109,518 spectra from urine. Detailed settings for BUDDY can be found in Supplementary Note 18.

SIRIUS analysis
SIRIUS 7 (v.4.9.2) was downloaded from https://bio.informatik.uni-jena. de/software/sirius. For molecular formula determination, we set the Article https://doi.org/10.1038/s41592-023-01850-x Extended Data Fig. 1 | BUDDY graphical user interface. BUDDY offers an intuitive graphical user interface directly downloadable from GitHub. BUDDY can simultaneously process data files from different sources. For identified metabolites, MS/MS matching information and detailed chemical descriptions are provided to help check the identification quality and further investigate biological insights. If experiment-specific global peak annotation is performed, feature connections are listed with MS/MS similarity, formula difference, and connection description information.