OpenMS 3 enables reproducible analysis of large-scale mass spectrometry data

Mass spectrometry has become an indispensable tool in the life sciences. The new major version 3 of the computational framework OpenMS provides signi�cant advancements regarding open, scalable, and reproducible high-throughput work�ows for proteomics, metabolomics, and oligonucleotide mass spectrometry. OpenMS makes analyses from emerging �elds available to experimentalists, enhances computational work�ows, and provides a reworked Python interface to facilitate access for bioinformaticians and data scientists.


Read Full License
Additional Declarations: Yes there is potential Competing Interest.L.H. is an employee of Laminlabs.H.W. is an employee of Storm Therapeutics.L.B. is an employee of BioNTech SE.L.N. is an employee of OpenMS Consulting.O.K., T.S. and S.W. are o cers in OpenMS Inc., a non-pro t foundation which manages the international coordination of OpenMS development.
Version of Record: A version of this preprint was published at Nature Methods on February 16th, 2024.See the published version at https://doi.org/10.1038/s41592-024-02197-7.
foundation for interoperability is the use of open standard le formats.Users can analyze their data with ready-to-go pipelines.Developers can use the OpenMS library to develop e cient new tools and algorithms in C++.OpenMS uses modern software engineering concepts emphasizing modularity, reusability, and extensive testing.The library contains an extensible GUI module using Qt, which forms the basis for a powerful data viewer 2 .OpenMS offers native C++ compiler support on Windows (MSVC), Linux (Clang and GCC), and macOS (Clang).The permissive BSD license encourages academic and commercial use, facilitating collaborations and joint research.OpenMS is hosted on GitHub, where continuous integration/testing and code reviews ensure that contributions meet high-quality standards.
The SoftWipe 3 score of the whole library is 6.9, which puts OpenMS at rank 7 of 48 bioinformatics tools in terms of software quality (see Supplemental Material for details).Active communication between developers and with end users is a cornerstone of the project; this is facilitated by various online tools (Discord, Gitter, mailing lists, conference calls) and annual meetings.To help new users explore and quickly become productive with OpenMS, the website and documentation 4 (Figure S1) were modernized with the help of professional technical writers.
Built on top of the library, pyOpenMS gives access to algorithms and tools using the Python scripting language, enabling interaction with Python libraries for data science, machine learning, and data visualization.pyOpenMS is easy to install through pip or Conda and scripts can be written by nonexperts.This makes pyOpenMS more accessible to experimentalists and lowers the barrier of entry for the OpenMS ecosystem.Python support is highly useful for prototyping new algorithms and for easy deployment as web applications.In addition, pyOpenMS is a great teaching tool, with a revamped tutorial (Figure S2a) and with the Binder integration 5 , which provides access to pyOpenMS using only a web browser (Figure S2b).
New developments in bottom-up proteomics include the quantMS 6 Next ow work ow for label-free quanti cation that generates tables compatible with MSstats 7 for statistical analysis.Other additions are the protein-protein crosslinking search tools OpenPepXL 8 and OpenPepXLLFfor isotopically labeled and label-free crosslinkers.Furthermore, existing tools (e.g., OpenSWATH) have been updated to support more e cient le formats (e.g., SQLite) and ion mobility analysis.
Recently, various tools and suites for top-down proteomics have been added to OpenMS.FLASHDeconv 9 enables accurate deconvolution of top-down MS datasets orders of magnitude faster than other state-ofthe-art tools such as TopFD 10 and Xtract (Thermo Fisher Scienti c).The latest version of FLASHDeconv 11 also features accurate deconvolution of MS/MS spectra, deconvolution false positive rate (FDR) estimation, and deep-learning-based scoring.In addition, two new tools were introduced, FLASHIda 12 for instrument acquisition control in top-down proteomics and FLASHQuant 13 for label-free quanti cation of top-down MS data.
OpenMS 3 offers signi cant improvements and innovations for metabolomics, where identi cation and quanti cation from complex samples continues to pose challenges.For feature identi cation, OpenMS includes tools that use mass-to-charge and retention time information (FeatureFinderMetaboIdent) as well as formula and structural predictions using isotope patterns and fragmentation spectra of detected features (via SIRIUS and CSI:FingerID 14 ).One of the most widely used tools for untargeted metabolomics data exploration after feature detection is Feature-Based Molecular Networking 7,15 (FBMN) by GNPS.All necessary les can be generated using GNPSExport, a new OpenMS tool that creates a spectral library, a feature quanti cation table, and a meta value table.Furthermore, Ion-Identity Molecular Networking 16 (IIMN) is supported by providing the required table with adduct information.Following the identi cation of unknowns, OpenMS provides methods for accurate quanti cation of known components in complex challenges.Methods include peak integration, retention time alignment, quanti cation and calibration curve tting, and quality control.The combination of targeted and untargeted metabolomics methods offers a complete package for qualitative discovery and quantitative validation experiments 32 .
OpenMS 3 provides several tools for quality control (QC) and monitoring.QC metrics can be calculated for each step of an experiment, from data acquisition (e.g., retention time shifts, mass calibration) to proteomics and metabolomics-speci c analyses (e.g., MS² identi cation rate, contamination and missed cleavages).These metrics can be exported to the HUPO-PSI mzQC format to enable sharing and comparability.
OpenMS added the NucleicAcidSearchEngine 18 (NASE) as the rst integration of an oligonucleotide mass spectrometry-focused tool into anycomputational MS library.NASE implements sequence database search for MS² spectra of RNA oligonucleotides, enabling the identi cation of RNA sequences and modi cations.We demonstrated the utility of NASE in identifying post-transcriptional modi cations of mRNAs, tRNAs, and rRNAs.Recently we included more digestion enzymes to enable multi-site-speci c cleavage, added support for DNA in addition to RNA, and improved integration between NASE and other OpenMS tools.

The wider OpenMS ecosystem contains applications (work ows or tools) that are either based on
OpenMS or interact closely with results produced by OpenMS.For example, the R package MSstats 7 enables statistical analysis and signi cance testing for many experimental designs.Triqler 19 is another compatible package for statistical analysis of fold changes and error probabilities.The Python package pmultiqc and the R package PTXQC 20 for computing and visualizing quality control (QC) metrics are also compatible with OpenMS.pyOpenMS seamlessly integrates with the mass spectrometry query language (MassQL 21 ), a Python package to capture complex mass spectrometry patterns.Furthermore, peak and feature maps can be exported to Pandas DataFrames for convenient downstream processing.In another example, the SmartPeak application wraps OpenMS methods for targeted and untargeted metabolomics and lipidomics work ows into a native C++ GUI and CLI application suitable for non-bioinformaticians and industry settings where automated work ows for Big Data are needed.
OpenMS is optimally suited for use in high-throughput environments such as core facilities or industry.
The OpenMS tools can be assembled into pipelines by work ow managerssuch as KNIME 22 , Galaxy 23 , Snakemake 24 , and Next ow 25 .Many tools are optimized for parallel execution on clusters or multi-core processors, making them highly scalable, both on local machines and in a cloud environment.OpenMS is easily deployable on high-performance clusters with containerized builds (e.g., Docker or Singularity).
Examples of such work ows include quantMS, an nf-core 31 work ow that combines identi cation and quanti cation of proteins in DDA and DIA mass spectrometry data and includes statistical analysis and quality control reports (Figure S4, middle panel).MHCquant 26 is another nf-core pipeline for the identi cation, quanti cation, and binding a nity prediction of MHC-bound peptides.bSLIM labeling 27 is an example of a novel analysis developed as a KNIME work ow using OpenMS nodes for MS signal processing.The latter two pipelines signi cantly extend the capabilities of OpenMS into the domains of immunopeptidomics and structural proteomics.
For untargeted metabolomics, the UmetaFlow 28 pipeline combines all the major metabolomics tools with the powerful feature detection and quanti cation capabilities of OpenMS and linkage with SIRIUS and GNPS, as well as data integration.It was developed in Jupyter Notebooks using pyOpenMS and independently in Snakemake, making it compatible for big data, easy to use, and reproducible.The DIAMetAlyzer 29 KNIME work ow takes a different approach by integrating DDA and targeted DIA analysis that allows for false-discovery rate estimation based on a target-decoy approach.It performs DDA based candidate identi cation and constructs a target/decoy library, which is then used for DIA target extraction and statistical validation.
More information on the aforementioned containers and work ows can be found in the Methods section.
As mass spectrometry continues to be used in larger and more diverse types of experiments, it is paramount that the software tools keep up with the technology.Since the last major OpenMS version 2 35 , there has been a surge in utilizing mass spectrometry to study RNA modi cation, a blossoming in the eld of metabolomics, and an increase in the usage of top-down methods in proteomics.The development of OpenMS has been geared toward providing tools for the analysis of these growth areas.The OpenMS team has improved code maintainability and accessibility for both the user and developer base by modernizing documentation (Figure S1, bottom left) and diversifying communication channels.Furthermore, the team has focused on the creation of bindings exposing the OpenMS core library to Python, and on containerization and support for a diverse array of work ow systems (Figure S4), to maintain scalability for increasingly complex datasets.

Online Methods
Improvements in the C++ library has traditionally integrated external tools into its ecosystem through wrappers that make their inputs, outputs, and parameters compatible with the rest of the software suite.A prime example of this is the integration of external peptide search engines.OpenMS 3 adds support for newer tools such as Comet and MSFragger and at the same time has deprecated support for tools that are not supported anymore like OMSSA.In other elds where adequate solutions were not available, new native OpenMS tools were implemented.These areas include top-down proteomics, protein-protein and protein-nucleotide crosslinking, and RNA oligonucleotide identi cation.This required the development of e cient algorithms for generating theoretical spectra, for matching and scoring, as well as the implementation of novel data structures with support for molecules beyond linear peptides, all of which are now available to developers (see Fig S3 bottom for visualization of a cross-linked peptide pair).Another addition to the library is the support for hyper ne isotopic distributions through the integration of IsoSpec 30 algorithms.The library now also makes use of modern features of C++17, in contrast to C++03 in OpenMS 2.0.Several additional algorithms were moved from the source code of speci c tools into the library to allow access to complete tool functionality from Python bindings or other C++ tools.pyOpenMS pyOpenMS relies on Cython and autowrap to create the interface between C++ and Python.Since OpenMS 2.0, the development of autowrap 34 has continued as part of OpenMS.Additions include support for static functions, autogeneration of type stubs, and improvement of autogenerated docs for display with Sphinx.The pyOpenMS bindings have been expanded to support the majority of the library.pyOpenMS now includes conversion of core OpenMS data structures to Pandas DataFrames, which allows for integration with data science and machine learning packages.pyOpenMS scripts can be easily deployed as web applications using popular web frameworks (Figure S3, top).

Declarations
Figures