Identification of small molecules is a critical task in various areas of life science. Recent advances in mass spectrometry have enabled the collection of tandem mass spectra of small molecules from hundreds of thousands of environments. To identify which molecules are present in a sample, one can search mass spectra collected from the sample against millions of molecular structures in small molecule databases. This is a challenging task as currently it is not clear how small molecules are fragmented in mass spectrometry. The existing approaches use the domain knowledge from chemistry to predict fragmentation of molecules. However, these rule-based methods fail to explain many of the peaks in mass spectra of small molecules. Recently, spectral libraries with tens of thousands of labelled mass spectra of small molecules have emerged, paving the path for learning more accurate fragmentation models for mass spectral database search. We present molDiscovery, a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by (i) utilizing an efficient algorithm to generate mass spectrometry fragmentations, and (ii) learning a probabilistic model to match small molecules with their mass spectra. We show our database search is an order of magnitude more efficient than the state-of-the-art methods, which enables searching against databases with millions of molecules. A search of over 8 million spectra from the Global Natural Product Social molecular networking infrastructure shows that our probabilistic model can correctly identify nearly six times more unique small molecules than previous methods. Moreover, by applying molDiscovery on microbial datasets with both mass spectral and genomics data we successfully discovered the novel biosynthetic gene clusters of three families of small molecules.
Availability: The command-line version of molDiscovery and its online web service through the GNPS infrastructure are available at https://github.com/mohimanilab/molDiscovery.