Enhanced correlation-based linking of biosynthetic gene clusters to their metabolic products through chemical class matching

doi:10.21203/rs.3.rs-1391827/v2

Download PDF

software

Enhanced correlation-based linking of biosynthetic gene clusters to their metabolic products through chemical class matching

https://doi.org/10.21203/rs.3.rs-1391827/v2

This work is licensed under a CC BY 4.0 License

Journal Publication

published 23 Jan, 2023

Read the published version in Microbiome →

You are reading this latest preprint version

Background

It is well-known that the microbiome produces a myriad of specialized metabolites with diverse functions. To better characterize their structures and identify their producers in complex samples, integrative genome and metabolome mining is becoming increasingly popular. Metabologenomic co-occurrence-based correlation methods facilitate the linking of metabolite mass fragmentation spectra (MS/MS) to their cognate biosynthetic gene clusters (BGCs) based on shared absence/presence patterns of metabolites and BGCs in paired omics datasets of multiple strains. Recently, these methods have been made more readily accessible through the online NPLinker framework. However, co-occurrence-based approaches usually result in too many candidate links to manually validate.

Results

To automatically reduce the long lists of potential BGC-metabolite links, we match natural product (NP) ontologies previously independently developed for genomics and metabolomics and developed NPClassScore: an empirical class matching score that we also implemented in the NPLinker framework. By applying NPClassScore on three paired omics datasets totalling 189 bacterial strains, we show that the number of links are reduced by on average 63% as compared to using a co-occurrence-based strategy alone. We further demonstrate that 96% of experimentally validated links in these datasets are retained and prioritised when using NPClassScore.

Conclusion

The matching genome-metabolome class ontologies provide a starting point for selecting plausible candidates for BGCs and MS/MS spectra based on matching chemical compound class ontologies. NPClassScore expedites genome/metabolome data integration, as relevant BGC-metabolite links are prioritised, and researchers are faced with substantially fewer proposed BGC-MS/MS links to manually inspect. We anticipate that our addition to the NPLinker framework will aid in the discovery of novel NPs and understanding complex metabolic interactions in the microbiome.

Multi-omics

genome mining

genomics

metabolome mining

metabolomics

chemical compound classification

natural products

specialized metabolites

Complex microbial communities are nearly everywhere and rely on specialised metabolites to mediate host-microbe and microbe-microbe interactions. Such specialised metabolites, also called natural products (NPs), cover vast numbers of different scaffolds that constitute an incredible chemical diversity. Microbially derived NPs are also a prolific source for many types of drugs, such as antibiotics and anticancer agents [1]. This explains the drive to mine the microbiome for novel NPs and understand its largely unexplored and complex chemical interactions. Currently, the most common technique for analysing the microbial metabolome is liquid chromatography followed by tandem mass spectrometry (MS/MS or MS2), allowing for the discovery of NPs in complex mixtures [2, 3]. Especially in microbes, the biosynthetic pathways for synthesising NPs are often encoded by physically clustered sets of genes in the genome, known as biosynthetic gene clusters (BGCs). With the growing availability of genomic data in the last decade, multiple genome mining approaches for the identification of BGCs have appeared, such as antiSMASH and DeepBGC [4, 5].

Not only is the availability of metabolomic and genomic data growing independently, paired datasets of both types of omics data are currently recorded in platforms such as the Paired Omics Data Platform (PoDP) [6]. Leveraging genomic and metabolomic data together facilitates rapidly assessing the novelty of metabolites and linking them to their producing organisms and biosynthetic loci [7, 8]. Identifying candidate biosynthetic genes for a given metabolite provides complementary structural information inferred from the genome and metabolome for structural elucidation. Whilst promising, such integrative omics mining is still challenging: despite community efforts like MIBiG [9], the typical number of validated paired data points, for which both the gene cluster and MS/MS spectral data of the metabolites produced from it are recorded, available within one experiment generally remains low. This hampers training and validation. One route to partially solve this challenge is by focusing on well-known and well-understood natural product classes [7]; however, this would severely decrease the discovery potential for novel chemistry. Therefore, generalized methodologies are required that can identify links between BGCs and metabolites without the necessity for large amounts of highly specific genetic or biochemical labelled data.

Recently, NPLinker was developed, which provides a framework for the systematic linking of MS/MS spectra and BGCs [10]. Currently, NPLinker can integrate the output of molecular networking through Global Natural Products Social Molecular Networking (GNPS), which dereplicates and clusters MS/MS spectra into molecular families (MFs), with the output of BiG-SCAPE, which groups BGCs into gene cluster families (GCFs) [11, 12]. The MS/MS spectra or MFs and GCFs are used as inputs for two different scores: a co-occurrence-based score (originally devised by Doroghazi et al. [13]) that essentially considers if a GCF and a spectrum (or MF) occur together in the same strains, and a feature-based score that relies on comparison with public reference libraries. Unfortunately, the main co-occurrence-based score often produces a large list of potential links per GCF or spectrum, mainly because many BGCs are co-conserved in the same strains across long periods of evolutionary time [14]. This makes it hard to prioritise and find correct links, even when using reasonable cut-offs on the co-occurrence score. Hence, additional complementary scoring systems are needed to trim the list to a comprehensible length. The current feature-based score that is implemented in NPLinker only solves the problem of long candidate lists in the few cases when there is sufficient similarity to known BGCs and spectra, as it compares BGCs and spectra to entries in MIBiG, a repository of experimentally validated BGCs [9, 15]; BGCs and spectra are then linked if they both have high similarity to the same MIBiG entry. Hence, it is often still challenging to find links when prioritising for novel chemistry, even within well-known NP classes such as non-ribosomal peptides (NRPs) or polyketides (PKs).

Instead of using similarity to entries in public reference libraries, we can obtain general knowledge about the structure of an unknown NP in the form of likely-occurring scaffolds or substructures by using chemical compound classification strategies and using that to filter for (more) plausible candidates based on matching chemical compound classes. Over the last years, several general chemical compound classification ontologies have been constructed. One of these is ClassyFire: a hierarchical ontology consisting of superclass, class, and subclass categories and seven more detailed levels [16]. As an example, many PKs are classified at the superclass level as ‘phenylpropanoids and polyketides’, benzenoids, or lipids and lipid-like molecules. Recently, NPClassifier was developed specifically for NP classifications, taking both chemistry and biosynthetic pathways into account [17]. Compared to ClassyFire, it tailored the categorisation of NPs into seven major pathways, such as polyketides and ‘amino acids and peptides’, which are subsequently split into more detailed superclasses and classes, like macrolides and polyene macrolides. Furthermore, predicting these structure-based ontologies directly from MS/MS spectra has also become possible through CANOPUS and MolNetEnhancer [18, 19]. Similarly, chemical classifications can also be derived from genomic sequence, as antiSMASH uses class-specific detection rules to detect different types of BGCs, such as various types of PK synthases and NRP synthetases. However, a connection between structure-and genome-based classifications is currently still missing, as the ontologies from both classifications are not directly comparable.

To reduce the number of false positive BGC-MS/MS links in multi-omics analyses and thus accelerate NP discovery, we here introduce an automated approach to match these BGC and compound class ontologies. We used the matched class ontologies as a basis for the development of NPClassScore: NPLinker Class-based matching Score for linking BGCs and MS/MS-spectra. We also implemented NPClassScore in the NPLinker framework as a new feature-based linking score with the main purpose of removing unlikely BGC-MS/MS links. To bridge the different class ontologies from genome and metabolome mining, we used the MIBiG repository that contains experimentally validated BGCs and their products: antiSMASH genome-based classifications and manually annotated genome-based MIBiG classifications of the BGCs were matched to the chemical structure ontologies NPClassifier and ClassyFire [9]. The matching was automated by counting the genome-metabolome ontology connections between all the class-terms for each MIBiG entry and using relative counts to assess the validity of matching class-terms for the genome-and structure-based ontologies (Fig. 1b-c). Here, we demonstrate that these automatically matched ontologies are sensible as well as effective in removing irrelevant candidate links between BGCs and MS/MS spectra, while prioritising previously verified BGC-MS/MS spectra-metabolite links in three paired omics datasets from actinobacteria and cyanobacteria.

Matching class ontologies between known BGC-structure pairs

Currently, there are 1,926 experimentally validated BGCs with their corresponding structures present in the MIBiG v2.1 repository [9]. We used the manually annotated MIBiG classes and antiSMASH 5 class predictions for the BGCs alongside NPClassifier and ClassyFire class assignments for the structures to count all interactions between biosynthetic and structural classes (Fig. 1a). Based on the prevalence in MIBiG, we can then infer which class terms match frequently between the different ontologies. We note that the NPClassifier ontology is designed with natural products in mind, thus taking both chemistry and biosynthetic context into account, indeed leading to more direct matches between genome-based and structure-based classifications (Fig. 1b). For example, 76% of polyketide BGCs match to the NPClassifier ‘Polyketides’ pathway, most RiPPs and NRPs match to the ‘Amino acids and Peptides’ pathway, and 65% of terpene BGCs match to the ‘Terpenoids’ pathway. The most general superclass ontology of ClassyFire seems to be less suitable, as matches are more sparsely distributed across the different superclasses (Fig. S1). For instance, polyketide BGCs are distributed almost equally across five different ClassyFire superclasses. However, we also note that a certain degree of complementarity between the two chemical compound ontologies does exist, since, for example, NRPs match for 75% to the ‘Organic acids and derivatives’ superclass from ClassyFire, which is higher than for the NPClassifier pathway ontology.

At more detailed structure-based classifications levels, like NPClassifier superclasses or classes, matches between genome-and structure-based classifications become more distributed as there are more options, and small distinctions in, for example, the structure-based ontologies are not reflected in current BGC classifications as done automatically by antiSMASH. For example, different type 1 polyketide synthase products, like products with the NPClassifier superclasses ‘Macrolides’, ‘Aromatic polyketides’, and ‘Linear polyketides’, still match to the generic type 1 polyketide synthase BGC class resulting in matches that are less conclusive (Fig. S2-S3). Another difficulty for class matching is the fact that many different hybrid classes exist that will make it impossible to reach perfect matches between most classes. Some NPs consist of very complex tailored scaffolds for which a combination of different types of biosynthetic machinery is needed, resulting in complex MIBiG classes like ‘Polyketide-NRP-Other’ for bromoalterochromide A. Additionally, some MIBiG records have very loose cluster boundaries with flanking genes that can trigger erroneous antiSMASH rules, and may therefore lead to erroneous biosynthetic class assignments. In contrast, the chemical classifications are less affected by the presence of different structural scaffolds, as ClassyFire has a priority system to only consider the most important class-terms and NPClassifier will only return at most two terms for its class level. However, the presence of multiple functional groups can sometimes cause challenges in putting a structure in “one” chemical compound class. Nevertheless, both genome and metabolome based chemical compound classification systems seem to work well for most structures. Furthermore, the more characterised BGCs will be deposited in MIBiG, the more these difficulties will average out and improve the class matching usefulness. Similarly, depositing BGCs from a larger variety of classes will address biases in current data availability, as some classes, such as PKSs, are more abundant in the database than others.

NPClassScore can filter out many false positive links

Based on the matched genome-based and structure-based ontologies, we constructed NPClassScore: the NPLinker Class-based matching Score for linking BGCs and MS/MS-spectra. We implemented NPClassScore in the NPLinker framework, where it can be used as an additional filtering step to assess the validity of a predicted link between a gene cluster family (GCF), and an MS/MS spectrum or molecular family (MF) [10]. NPClassScore consists of a scoring table for each of the 28 pairs of the matched genome-and structure-based ontology levels derived from MIBiG. The scores in the tables are made by dividing the counts for each class match by the total occurrence of either the genome-based class or the structure-based class (Fig. 1c; Table S1). This resulted in two types of scoring tables, one coming from the genome side and one from the metabolome side. NPClassScore takes the genome-based and structure-based classes from a proposed link as input, looks up the matching scores between these two classifications in the scoring tables, and reports the class match with the highest score from one of the scoring tables. Thus, the NPClassScore indicates how plausible the link is between a BGC and a possible product based on how often their classes match among BGCs from MIBiG and their experimentally validated metabolic products.

Predicted antiSMASH classes are directly used as input for NPClassScore and the general BiG-SCAPE classes are converted to MIBiG classes (Table S2-S3). In order to predict ClassyFire and NPClassifier ontologies from MS/MS spectra, we used predictions from CANOPUS and MolNetEnhancer within NPClassScore (Fig. 2) [18, 19]. CANOPUS is a command-line tool that is part of the SIRIUS framework and can very accurately predict compound classifications if the right fragmentation trees are calculated. We implemented CANOPUS to run within NPLinker, but as it depends on calculating fragmentation trees, especially time-wise, it is only suitable to be used for the lower masses, below 850 Da. To also capture compound classifications for masses above 850 Da, we used MolNetEnhancer, which relies on propagating annotations between MS/MS spectra within MFs. Currently, MolNetEnhancer only provides ClassyFire predictions and has to be run on the GNPS platform, from which the results can be imported into NPLinker [12]. As the default for NPClassScore, MolNetEnhancer is used as well when there is no CANOPUS prediction for an MS/MS spectrum with a mass below 850 Da, but CANOPUS and MolNetEnhancer can also be used separately.

To assess how well NPClassScore removes improbable links, we used NPLinker on three paired omics datasets from the PoDP that consist of: 154 Streptomyces and Salinispora strains [20], 24 cyanobacterial strains [6], and 11 Nocardia strains [21] (Table S4). These are the largest datasets that contain multiple verified BGC-MS/MS-metabolite links in the PoDP, totalling 26 validated links across the three datasets. The three datasets will be referred to by their taxonomic descriptors. We analysed the three datasets separately using NPLinker and first used a co-occurrence-based strategy (standardized Metcalf score) to identify possible links between GCFs and MS/MS spectra, after which we used NPClassScore to filter the number of linked spectra per GCF [10, 13]. After filtering, the number of candidate MS/MS spectrum links for all GCFs decreased substantially for all datasets. In the Streptomyces/Salinispora dataset, the average number of candidate links per GCF decreased by 68% from 550 to 177. In the smaller Cyanobacteria and Nocardia datasets the number of candidate links per GCF decreased from 27 to 13, and from 206 to 64, representing decreases of 53% and 69%, respectively. Averaging over the three datasets this constitutes an average decrease in candidate links per GCF of 63% (Fig. 3a; Table S5). As the NPClassScore filtering depends on the chosen cut-off, we tried different cut-offs and decided on a cut-off of 0.25 as a default, as around this value there is a marked drop in the number of links per GCF for all datasets (Fig. S4-S6). This is also defendable from a theoretical perspective, as this cut-off means that a given class match should occur for at least 25% of the total occurrences of the class among MIBiG entries. Additionally, this threshold results in many more GCFs with manageable numbers of candidate MS/MS links, which can be analysed manually: in the large Streptomyces/Salinispora dataset, it yields 92 GCFs with 10 or fewer candidate links and 270 GCFs with 25 or fewer candidate links. In contrast, without filtering based on NPClassScore, only 5 GCFs would have fewer than 10 candidate links and only 42 GCFs would have fewer than 25 candidate links in the same dataset (Fig. 2b). Similar trends can be seen for the other two datasets (Fig. S7). Thus, using NPClassScore constitutes a real advantage for users as they can now realistically inspect a much larger percentage of predicted candidate links that are also more likely to be real.

Validated links are retained and get higher ranks after NPClassScore filtering

To assess if the filtered links based on NPClassScore are sensible, we used our three selected paired omics datasets from the PoDP, in each of which several previously experimentally validated BGC-MS/MS spectral-chemical structure links had been recorded. In these datasets, we checked whether these validated links were retained and whether they gained a higher rank in the lists of predicted links upon NPClassScore application. The Streptomyces/Salinispora, Cyanobacteria and Nocardia datasets contain 11, 6 and 9 validated links on the PoDP. Out of the 26 total validated links, 2 could not be found due to a missing spectrum in the datasets. Of the remaining links, 23 out of 24 passed the default NPClassScore threshold, constituting an accuracy of 96% (Fig. 3c). Additionally, this confirms our default NPClassScore threshold of 0.25 as substantial numbers of validated links are removed beyond this threshold (Fig S8). As an example, from the Streptomyces/Salinispora dataset, we found a link between GCF 534 (present in 54 strains), and spectrum 89513 (present in 67 strains), representing the link for staurosporine based on the PoDP [6] (Fig 3d). With a co-occurrence score of 9.0 this link was initially ranked second in the list of 100 potential links for GCF 534. After filtering using NPClassScore, 16 potential links were left, and spectrum 89513 was ranked first; its NPClassScore was 0.78 from the antiSMASH-ClassyFire-superclass scoring table, matching indole to Organoheterocyclic compounds. Similarly, our analysis retrieved the validated link for rosamicin: the rosamicin-biosynthesis-associated GCF 944 (present in 2 strains) was linked to spectra 130529 and 141312, each of which were present in 1 of the 2 strains (Fig. 3d). With a co-occurrence score of 8.7, the links with both spectra were ranked at a shared eighth position in the list of 275 candidate links. After filtering using NPClassScore, 38 candidate links were left in total, and both spectra were jointly ranked at the first position; their NPClassScore scores were 0.76 from the MIBiG-NPClassifier-pathway scoring table, matching Polyketide to Polyketides.

Of note, 5 out of 26 validated links did not pass the co-occurrence scoring threshold implemented in NPLinker, meaning an accuracy of 75% for the entire NPLinker workflow including NPClassScore. Most probably, for these links the clustering of MS/MS spectra and BGCs into dereplicated MS/MS spectra and GCFs do not agree with each other, i.e., their MS/MS spectra are similar enough to be clustered together but BGCs are not, or vice versa. An example that supports this hypothesis is nocobactin in the Nocardia dataset, where the actual link passed the NPClassScore cut-off with a score of 0.59, but the BGCs did not cluster together with our currently applied BiG-SCAPE cut-off, whereas the MS/MS spectra did cluster with the current settings. Regarding the 21 correctly retained validated links, they were not only retained, but also ranked higher in the lists with candidate links due to removing false positive links (Table 1). Out of 21 validated links, 12 are even ranked at the first position after NPClassScore filtering, compared to 5 links being ranked first when using just co-occurrence scoring. This shows that after NPClassScore filtering, the candidate links that are retained at high rankings are more reliable and worth exploring manually.

Table 1. All validated links from the three datasets as listed on the PoDP. The standardized Metcalf score and NPClassScore of all the links are stated as well as the rank of the verified link in the candidate list before and after NPClassScore filtering. The rank number may be shared with a number of other links due to their scores being the same which is indicated in parentheses. Retimycin and nocardimicin have no information as their MS/MS spectra could not be located in their respective datasets.

Name	Dataset	Rank NPClassScore (shared with n other links)	Rank Metcalf (shared with n other links)	Standardized Metcalf	NPClassScore
staurosporine	Streptomyces/Salinispora	1	2	9.0	0.78
rosamicin	Streptomyces/Salinispora	1 (6)	8 (38)	8.7	0.76
desferrioxamine	Streptomyces/Salinispora	1	1	9.5	0.36
rifamycin	Streptomyces/Salinispora	152	257	4.4	0.45
lomaiviticin	Streptomyces/Salinispora	30 (18)	381 (151)	3.0	0.96
arenimycin	Streptomyces/Salinispora	1 (4)	1 (12)	12.4	0.96
enterocin	Streptomyces/Salinispora	1 (8)	4 (81)	12.4	0.40
salinamide	Streptomyces/Salinispora	1 (40)	1 (84)	12.4	0.64
cyclomarin	Streptomyces/Salinispora	3 (49)	4 (83)	8.7	0.91
retimycin	Streptomyces/Salinispora	-	-	-	-
actinomycin	Streptomyces/Salinispora	1 (97)	1 (159)	12.4	0.76
anabaenopeptin	Cyanobacteria	2 (3)	3 (6)	4.2	0.93
micropeptin	Cyanobacteria	-	incorrectly discarded	-	1.00
kawaguchipeptin	Cyanobacteria	1 (37)	2 (39)	4.7	0.71
microcyclamide	Cyanobacteria	1 (13)	1 (18)	4.7	1.00
microcystin	Cyanobacteria	11 (4)	13 (8)	3.2	0.64
microcystin RR	Cyanobacteria	-	incorrectly discarded	-	0.64
mycobactin	Nocardia	1 (90)	2 (133)	3.2	0.64
mycobactin	Nocardia	incorrectly discarded	-	-	0.03
nocardimicin	Nocardia	-	-	-	-
nocobactin	Nocardia	-	incorrectly discarded	-	0.59
nocobactin	Nocardia	-	incorrectly discarded	-	0.59
nocobactin	Nocardia	-	incorrectly discarded	-	0.64
formobactin	Nocardia	1 (338)	4 (648)	2.12	0.64
nocardimicin	Nocardia	2 (255)	11 (427)	2.12	0.59
carboxynocobactin	Nocardia	1 (338)	4 (648)	2.12	0.46

Additionally, by using multiple classification ontologies, and their multiple hierarchical levels at the same time, most BGCs and MS/MS spectra will have at least one match in the NPClassScore tables. This ensures that NPClassScore can almost always be used to assess the validity of a proposed link based on chemical class information. In this way, NPClassScore is used as a lenient filtering mechanism, as a potential link is already retained when there is a match in only one of the scoring tables that passes the threshold. We do note that CANOPUS and MolNetEnhancer both give quite different chemical class predictions for the spectra in our dataset (Table S6-S7). Looking at the 6,606 spectra with predictions from both tools in the Streptomyces/Salinispora dataset, the ClassyFire superclass ‘Lipids and lipid-like molecules’ is predicted 2,686 times by MolNetEnhancer and 1,301 times by CANOPUS. Similar stark differences can be seen for other superclasses like, ‘Organic acids and derivatives’ and ‘Phenylpropanoids and polyketides’. Although we show that NPClassScore can filter down the results and prioritise actual BGC-MS/MS spectrum links, the choice of software tool for structure-based ontology prediction will have a large influence on the final results.

We made a step forward in integrative genome-metabolome mining by automated matching of genome-derived and structure-based chemical compound class ontologies and implementing the empirical NPClassScore to assess possible BGC-MS/MS spectrum links. To facilitate its use, NPClassScore is implemented in the NPLinker framework, and the NPClassScore scoring tables can be further updated upon new releases of MIBiG. For now, we rely mostly on CANOPUS as a predictive software for chemical classes from MS/MS spectra with lower masses, and MolNetEnhancer for larger-sized molecules. In the future, other methods, like mass-spectral embeddings such as Spec2Vec and MS2DeepScore, might be better suited to deal with bigger metabolites and decrease run times [8, 22, 23]. Although we made good efforts to show that most validated links are retained by NPClassScore in three datasets that contain many of the validated links in the PoDP, the lack of verified genome-metabolome links remains one of the current bottlenecks to validate new methods. We anticipate that our method will help record more validated links to move this field forward. Integrative genome-metabolome mining is a complex problem that will require many different solutions and smart ways to integrate those. Combining and streamlining NPLinker with other novel linking methods, such as NPOmix, will be key for advancing this field to understand complex microbial communities and prioritise NP discovery [24]. Other possible routes are to implement additional feature-based scores such as on the basis of shared substructures as inferred to be present from the genomic and metabolomic data, the latter for example through the use of data-driven approaches like MS2LDA and MotifDB [25, 26]. The class matching matrices developed here could also be used to select a reduced list of plausible candidates in structure databases for predicted BGCs based on matching classes. Our contribution will aid researchers in finding correct BGC-MS/MS spectrum links more easily, and will facilitate acceleration of efforts to connect metabolites to their producer strains and elucidate their roles in microbial ecosystems and the microbiome.

Matching class ontologies in MIBiG repository

All entries from MIBiG 2.0 were downloaded from mibig.secondarymetabolites.org in json format. The SMILES and MIBiG (sub)classes were extracted from the json files, and the predicted antiSMASH 5 classes for the MIBiG BGCs were retrieved from the MIBiG website. For each MIBiG entry, the ClassyFire and NPClassifier chemical compound classifications were retrieved through the GNPS API (ccms-ucsd.github.io/GNPSDocumentation/api/) using the SMILES or RDKit generated InChIKeys. ClassyFire classes were retrieved until the subclass level. Genome-and structure-based classes were related to each other by splitting up hybrid classes and counting the connections between all class-terms, except for PKS-NRP hybrids that were grouped together at the MIBiG class level. The count of a pair of class-terms from the genome-and structure-derived ontologies is then used to express the match between the pair of class-terms. As different classes have different overall occurrences, we used the relative counts as a score to assess the validity for a match between a genome-and structure-based class. For each matching pair of class-terms, the score was calculated by dividing the count of the match by the total occurrence of the class-term, either starting from the genome-based class or from the structure-based class. From the 3 genome-based and 9 structure-based class levels, this created in total 54 scoring tables of two types, one coming from the genome side and one from the metabolome side, akin to NPLinker that can create a link between a BGC and spectrum or vice versa. Code and scoring tables are available at github.com/louwenjjr/mibig_classifications.

Predicting chemical compound classifications from mass spectra

Currently, CANOPUS and MolNetEnhancer are used within NPLinker to predict ClassyFire and NPClassifier ontologies directly from MS/MS spectra. By default, CANOPUS is used for masses below 850 Da, and MolNetEnhancer for masses above 850 Da and for spectra without a CANOPUS prediction. CANOPUS and MolNetEnhancer can also be used separately. Sirius v4.9.3 is used to run CANOPUS directly on the mgf file resulting from GNPS molecular networking. By default, the ‘formula zodiac structure canopus’ setting and the ‘--maxmz’ cuttoff set to 850 are used. The CANOPUS results are related to the spectra and molecular families (MFs) with classifications_to_gnps.py from canopus_treemap (github.com/kaibioinfo/canopus_treemap). For the CANOPUS predictions, the ClassyFire cut-off is 0.5 and the NPClassifier cut-off 0.33. All resulting ClassyFire classes are sorted by priority at each class level, while NPClassifier classes are sorted by their probabilities. To obtain classes for MFs, classes of all spectra belonging to an MF are counted at each level, and classes that occur in at least 20% of the spectra in the MF are kept. ClassyFire classes are again sorted by priority and NPClassifier classes by class occurrence in the MF. Currently, MolNetEnhancer has to be run externally, for example on the GNPS platform [12]. The output file ClassyFireResults_Network.txt can be downloaded from the GNPS platform and used directly as input for NPLinker.

NPClassScore in NPLinker

The MIBiG scoring tables are used in NPLinker v1.2 (github.com/NPLinker/nplinker) by the new scoring method NPClassScore, which can create links between BGCs or GCFs and MS/MS spectra or MF. Out of the 54 total scoring tables, 28 are used by NPClassScore, as some classes could not be predicted from BGCs or MS/MS spectra, like MIBiG-subclass and the is_glycoside class level from NPClassifier. Within NPClassScore, GCFs are annotated with an MIBiG class, based on their BiG-SCAPE class, and with antiSMASH classes of the children BGCs if the class occurs in at least half of the BGCs in the GCF (Table S2-S3). Spectra and MFs get their ClassyFire and NPClassifier annotations from CANOPUS or MolNetEnhancer predictions. For all the different combinations of class ontologies the scores are looked up in the scoring tables and returned from high to low, where the highest score is returned by NPClassScore and can be used to filter out candidate links after using other scoring methods such as co-occurrence-based scoring (Fig. 2). A demo notebook to perform NPClassScore linking within NPLinker is available at https://github.com/NPLinker/nplinker/blob/master/notebooks/npclassscore_linking/NPClassScore_demo.ipynb.

Preparing the datasets and running NPLinker

The molecular network for the Streptomyces/Salinispora dataset used in Crüsemann et al. was cloned to GNPS version 28.2, which consists of MassIVE accessions: MSV000078836, MSV000078839 and MSV000079284, containing data for 159 strains of Streptomyces and Salinispora (gnps.ucsd.edu/ProteoSAFe/status.jsp?task=9ba6f1296adb494db4dac117110a420a) [20]. From these accessions in the Paired-omics Data Platform, the corresponding genomes were downloaded from NCBI if they had RefSeq or GenBank identifiers. On the 104 retrieved genomes, antiSMASH 6 was run and the output was merged with antiSMASH 3 data from the other 50 strains used in Crüsemann et al. [20]. The used strain mappings file was combined from the three PoDP accessions (github.com/NPLinker/nplinker/tree/master/notebooks/npclassscore_linking/crusemann_strain_mappings.csv). The Cyanobacteria and Nocardia datasets were accessed through their PoDP accessions, MSV000084950 and MSV000084771, respectively. The molecular networks listed in the PoDP were used, while antiSMASH 6 was run on the listed 24 and 11 genomes, respectively. CANOPUS was run for the three datasets within NPLinker with the aforementioned default settings, which took around 24 hours for the Streptomyces/Salinispora dataset and less for the other two datasets. MolNetEnhancer was run for the three datasets on GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=6a07ff87c7574329b397a779a716fc69, https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=bb106ef439094975a5035b5d7c7f762e, https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=1de521d0932f414682ef4052496ff8a6). NPLinker v1.2 was first run through the docker version with a BiG-SCAPE cut-off of 0.3, after which the datasets were further analysed within jupyter notebooks (https://github.com/NPLinker/nplinker/tree/master/notebooks/npclassscore_linking). Standardized Metcalf scoring was used with a cut-off of 2.5, after which candidate links were filtered out if their NPClassScore score was below a cut-off of 0.25. By using CANOPUS and MolNetEnhancer together, most MS/MS spectra were annotated with structure-based ontologies. We note that we discarded candidate links without structure-based classes, due to the 1,988 MS/MS spectra across the datasets without a CANOPUS or MolNetEnhancer prediction (Table S4). We decided to do this as it did not hamper our effort of finding the validated links in the three datasets. Not filtering out such links results in an average decrease of 49% of the candidate links per GCF (Table S8). This functionality can be easily switched on or off by toggling the .filter_missing_scores attribute of the NPClassScore scoring method within NPLinker.

Validating experimentally validated BGC-MS/MS spectrum links

To locate the BGCs for the validated links as listed in the PoDP, we used cblaster with default settings with the MIBiG BGC listed on the PoDP as query and the antiSMASH gbk files of each of the three datasets as database [27]. To identify the correct MS/MS spectra for the validated links, we found the MF, cluster or scan id as listed on the PoDP in our molecular networks. If that failed, for example because we ran a new molecular network for the Streptomyces/Salinispora dataset, we compared parent masses that occurred in the same strains as listed on the PoDP. No MS/MS spectrum could be found this way for retimycin and nocardimicin. For each validated link, the ranked position was recorded before and after filtering with NPClassScore, taking into account that ranks can be shared as multiple candidate links can have the same scores. We note that the Nocardia dataset exhibited quite low standardized Metcalf scores for the validated links, which is why we changed the standardized Metcalf threshold from 2.5 to 2 for this dataset for finding the validated links. This is probably due to the Nocardia dataset being the smallest out of the three datasets, as well as the incongruence between the BGC and MS/MS spectrum clustering cut-offs.

MS/MS: mass fragmentation spectra

BGC: biosynthetic gene cluster

NP: natural product

PoDP: paired omics data platform

GCF: gene cluster family

MF: molecular family

NPClassScore: NPLinker Class-based matching Score for linking BGCs and MS/MS-spectra

NRP: non-ribosomal peptide

PK: polyketide

OrgHetCyc: Organoheterocyclic compounds

Availability and requirements

Project name: NPClassScore implemented in NPLinker.

Project home page: https://github.com/NPLinker/nplinker

Operating system(s): Platform independent

Programming language: Python

Other requirements: See requirements.txt at https://github.com/NPLinker/nplinker.

License: Apache-2.0 License.

Any restrictions to use by non-academics: See license.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

All code used during this study is available on github at https://github.com/NPLinker/nplinker and https://github.com/louwenjjr/mibig_classifications. A demo notebook to perform NPClassScore linking within NPLinker, and notebooks to replicate this study are available at https://github.com/NPLinker/nplinker/notebooks/npclassscore_linking.

All data used in this study is public. The MIBiG json files analysed during the current study are available in the MIBiG repository, https://dl.secondarymetabolites.org/mibig/mibig_json_2.0.tar.gz. The molecular networking results analysed during the current study are available in the GNPS platform, https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=9ba6f1296adb494db4dac117110a420a, http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=cefe9408d6e64d7691490c7ac796fbea, and http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a50dc9e2fc4f4a26aef72b0e05fdb123, for the Salinispora/Streptomyces, Cyanobacteria and Nocardia datasets, respectively. The RefSeq and GenBank accessions for the Salinispora and Streptomyces strains analysed during the current study are available in the following PoDP accessions, https://pairedomicsdata.bioinformatics.nl/projects/0ff7a302-49af-4130-a440-59e284d4d365.4, https://pairedomicsdata.bioinformatics.nl/projects/297c364c-b154-4edd-a7d5-68decf9effa2.4, and https://pairedomicsdata.bioinformatics.nl/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4. The RefSeq and GenBank accessions for the strains from the Cyanobacteria and Nocardia dataset are available in the following PoDP accessions, https://pairedomicsdata.bioinformatics.nl/projects/84b56cd3-218b-4e57-906f-a9003615ed07.2, and https://pairedomicsdata.bioinformatics.nl/projects/a4837d37-1df1-4153-b365-c06841574235.3.

Competing interests

MHM is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. JJJvdH is a member of the Scientific Advisory Board of NAICONS Srl., Milano, Italy.

Funding

All authors are grateful to the Netherlands eScienceCenter for financial support (ASDI eScience grant, ASDI.2017.030, and an Open eScience Call, NLESC.OEC.2021.002).

Authors' contributions

JJJvdH and MHM were responsible for the concept and supervision of the study. JJRL performed all analyses and implemented all the code. JJRL drafted the report, after which all authors contributed to revising the draft and completing the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank Andrew Ramsay for valuable insight and assistance in implementing NPClassScore in NPLinker, Catarina Loureiro for useful feedback, and both the genome and metabolome mining communities for their encouragement.

Atanasov AG, Zotchev SB, Dirsch VM, Orhan IE, Banach M, Rollinger JM, Barreca D, Weckwerth W, Bauer R, Bayer EA et al: Natural products in drug discovery: advances and opportunities. Nature Reviews Drug Discovery 2021, 20(3):200–216.
Jarmusch SA, van der Hooft JJJ, Dorrestein PC, Jarmusch AK: Advancements in capturing and mining mass spectrometry data are transforming natural products research. Natural Product Reports 2021, 38(11):2066–2082.
Beniddir MA, Kang KB, Genta-Jouve G, Huber F, Rogers S, van der Hooft JJJ: Advances in decomposing complex metabolite mixtures using substructure- and network-based computational metabolomics approaches. Natural Product Reports 2021, 38(11):1967–1993.
Blin K, Shaw S, Kloosterman AM, Charlop-Powers Z, van Wezel GP, Medema Marnix H, Weber T: antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Research 2021, 49(W1):W29-W35.
Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, Durcak J, Wurst M, Kotowski J, Chang D et al: A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Research 2019, 47(18):e110-e110.
Schorn MA, Verhoeven S, Ridder L, Huber F, Acharya DD, Aksenov AA, Aleti G, Moghaddam JA, Aron AT, Aziz S et al: A community resource for paired genomic and metabolomic data mining. Nature Chemical Biology 2021, 17(4):363–368.
van der Hooft JJJ, Mohimani H, Bauermeister A, Dorrestein PC, Duncan KR, Medema MH: Linking genomics and metabolomics to chart specialized metabolic diversity. Chemical Society Reviews 2020, 49(11):3297–3314.
Louwen JJ, Van Der Hooft JJ: Comprehensive large-scale integrative analysis of omics data to accelerate specialized metabolite discovery. Msystems 2021, 6(4):e00726-00721.
Kautsar SA, Blin K, Shaw S, Navarro-Muñoz JC, Terlouw BR, van der Hooft JJJ, van Santen JA, Tracanna V, Suarez Duran HG, Pascal Andreu V et al: MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Research 2019, 48(D1):D454-D458.
Hjörleifsson Eldjárn G, Ramsay A, van der Hooft JJJ, Duncan KR, Soldatou S, Rousu J, Daly R, Wandy J, Rogers S: Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions. PLOS Computational Biology 2021, 17(5):e1008920.
Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, Kautsar SA, Tryon JH, Parkinson EI, De Los Santos ELC, Yeong M, Cruz-Morales P, Abubucker S et al: A computational framework to explore large-scale biosynthetic diversity. Nature Chemical Biology 2020, 16(1):60–68.
Wang M, Carver JJ, Phelan VV, Sanchez LM, Garg N, Peng Y, Nguyen DD, Watrous J, Kapono CA, Luzzatto-Knaan T: Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature biotechnology 2016, 34(8):828–837.
Doroghazi JR, Albright JC, Goering AW, Ju KS, Haines RR, Tchalukov KA, Labeda DP, Kelleher NL, Metcalf WW: A roadmap for natural product discovery based on large-scale genomics and metabolomics. Nat Chem Biol 2014, 10(11):963–968.
Chase AB, Sweeney D, Muskat MN, Guillén-Matus DG, Jensen PR, Ravel J: Vertical Inheritance Facilitates Interspecies Diversification in Biosynthetic Gene Clusters and Specialized Metabolites. mBio 2021, 12(6):e02700-02721.
Soldatou S, Eldjárn GH, Ramsay A, van der Hooft JJJ, Hughes AH, Rogers S, Duncan KR: Comparative Metabologenomics Analysis of Polar Actinomycetes. Marine Drugs 2021, 19(2):103.
Feunang YD, Eisner R, Knox C, Chepelev L, Hastings J, Owen G, Fahy E, Steinbeck C, Subramanian S, Bolton E: ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. Journal of cheminformatics 2016, 8(1):1–20.
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, Van Der Hooft JJ, Dorrestein PC, Gerwick WH, Cottrell GW: NPClassifier: A deep neural network-based structural classification tool for natural products. Journal of Natural Products 2021, 84(11):2795–2807.
Dührkop K, Nothias L-F, Fleischauer M, Reher R, Ludwig M, Hoffmann MA, Petras D, Gerwick WH, Rousu J, Dorrestein PC et al: Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature biotechnology 2021, 39(4):462–471.
Ernst M, Kang KB, Caraballo-Rodríguez AM, Nothias L-F, Wandy J, Chen C, Wang M, Rogers S, Medema MH, Dorrestein PC et al: MolNetEnhancer: Enhanced Molecular Networks by Integrating Metabolome Mining and Annotation Tools. Metabolites 2019, 9(7):144.
Crüsemann M, O'Neill EC, Larson CB, Melnik AV, Floros DJ, da Silva RR, Jensen PR, Dorrestein PC, Moore BS: Prioritizing Natural Product Diversity in a Collection of 146 Bacterial Strains Based on Growth and Extraction Protocols. J Nat Prod 2017, 80(3):588–597.
Männle D, McKinnie SMK, Mantri SS, Steinke K, Lu Z, Moore BS, Ziemert N, Kaysser L, Savage DF: Comparative Genomics and Metabolomics in the Genus < i > Nocardia</i>. mSystems 2020, 5(3):e00125-00120.
Huber F, Ridder L, Verhoeven S, Spaaks JH, Diblen F, Rogers S, van der Hooft JJJ: Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships. bioRxiv 2020:2020.2008.2011.245928.
Huber F, van der Burg S, van der Hooft JJJ, Ridder L: MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. Journal of Cheminformatics 2021, 13(1):84.
Leao TF, Wang M, da Silva R, van der Hooft JJ, Bauermeister A, Brejnrod AD, Glukhov E, Gerwick L, Gerwick WH, Bandeira N: A supervised fingerprint-based strategy to connect natural product mass spectrometry fragmentation data to their biosynthetic gene clusters. bioRxiv 2021.
Rogers S, Ong CW, Wandy J, Ernst M, Ridder L, van der Hooft JJJ: Deciphering complex metabolite mixtures by unsupervised and supervised substructure discovery and semi-automated annotation from MS/MS spectra. Faraday Discussions 2019, 218(0):284–302.
van der Hooft JJ, Wandy J, Barrett MP, Burgess KE, Rogers S: Topic modeling for untargeted substructure exploration in metabolomics. Proc Natl Acad Sci U S A 2016, 113(48):13738–13743.
Gilchrist CLM, Booth TJ, van Wersch B, van Grieken L, Medema MH, Chooi Y-H: cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinformatics Advances 2021, 1(1).

Competing interest reported. MHM is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. JJJvdH is a member of the Scientific Advisory Board of NAICONS Srl., Milano, Italy.

NPClassScorerevisionSI20220617final.docx
Additional file 1 – Supplementary informationSupplementary information listing all supplementary figures and tables.

Download PDF

Journal Publication

published 23 Jan, 2023

Read the published version in Microbiome →

Editorial decision: Major revision
27 Oct, 2022
Reviews received at journal
24 Sep, 2022
Reviewers agreed at journal
23 Sep, 2022
Reviewers invited by journal
23 Sep, 2022
Editor assigned by journal
21 Jun, 2022
Submission checks completed at journal
19 Jun, 2022
First submitted to journal
17 Jun, 2022

You are reading this latest preprint version

Enhanced correlation-based linking of biosynthetic gene clusters to their metabolic products through chemical class matching

Status:

Journal Publication

Version 2

Abstract

Figures

Background

Results & Discussion

Matching class ontologies between known BGC-structure pairs

NPClassScore can filter out many false positive links

Validated links are retained and get higher ranks after NPClassScore filtering

Conclusion

Implementation

Matching class ontologies in MIBiG repository

Predicting chemical compound classifications from mass spectra

NPClassScore in NPLinker

Preparing the datasets and running NPLinker

Validating experimentally validated BGC-MS/MS spectrum links

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 2