Matching class ontologies between known BGC-structure pairs
Currently, there are 1,926 experimentally validated BGCs with their corresponding structures present in the MIBiG v2.1 repository [9]. We used the manually annotated MIBiG classes and antiSMASH 5 class predictions for the BGCs alongside NPClassifier and ClassyFire class assignments for the structures to count all interactions between biosynthetic and structural classes (Fig. 1a). Based on the prevalence in MIBiG, we can then infer which class terms match frequently between the different ontologies. We note that the NPClassifier ontology is designed with natural products in mind, thus taking both chemistry and biosynthetic context into account, indeed leading to more direct matches between genome-based and structure-based classifications (Fig. 1b). For example, 76% of polyketide BGCs match to the NPClassifier ‘Polyketides’ pathway, most RiPPs and NRPs match to the ‘Amino acids and Peptides’ pathway, and 65% of terpene BGCs match to the ‘Terpenoids’ pathway. The most general superclass ontology of ClassyFire seems to be less suitable, as matches are more sparsely distributed across the different superclasses (Fig. S1). For instance, polyketide BGCs are distributed almost equally across five different ClassyFire superclasses. However, we also note that a certain degree of complementarity between the two chemical compound ontologies does exist, since, for example, NRPs match for 75% to the ‘Organic acids and derivatives’ superclass from ClassyFire, which is higher than for the NPClassifier pathway ontology.
At more detailed structure-based classifications levels, like NPClassifier superclasses or classes, matches between genome-and structure-based classifications become more distributed as there are more options, and small distinctions in, for example, the structure-based ontologies are not reflected in current BGC classifications as done automatically by antiSMASH. For example, different type 1 polyketide synthase products, like products with the NPClassifier superclasses ‘Macrolides’, ‘Aromatic polyketides’, and ‘Linear polyketides’, still match to the generic type 1 polyketide synthase BGC class resulting in matches that are less conclusive (Fig. S2-S3). Another difficulty for class matching is the fact that many different hybrid classes exist that will make it impossible to reach perfect matches between most classes. Some NPs consist of very complex tailored scaffolds for which a combination of different types of biosynthetic machinery is needed, resulting in complex MIBiG classes like ‘Polyketide-NRP-Other’ for bromoalterochromide A. Additionally, some MIBiG records have very loose cluster boundaries with flanking genes that can trigger erroneous antiSMASH rules, and may therefore lead to erroneous biosynthetic class assignments. In contrast, the chemical classifications are less affected by the presence of different structural scaffolds, as ClassyFire has a priority system to only consider the most important class-terms and NPClassifier will only return at most two terms for its class level. However, the presence of multiple functional groups can sometimes cause challenges in putting a structure in “one” chemical compound class. Nevertheless, both genome and metabolome based chemical compound classification systems seem to work well for most structures. Furthermore, the more characterised BGCs will be deposited in MIBiG, the more these difficulties will average out and improve the class matching usefulness. Similarly, depositing BGCs from a larger variety of classes will address biases in current data availability, as some classes, such as PKSs, are more abundant in the database than others.
NPClassScore can filter out many false positive links
Based on the matched genome-based and structure-based ontologies, we constructed NPClassScore: the NPLinker Class-based matching Score for linking BGCs and MS/MS-spectra. We implemented NPClassScore in the NPLinker framework, where it can be used as an additional filtering step to assess the validity of a predicted link between a gene cluster family (GCF), and an MS/MS spectrum or molecular family (MF) [10]. NPClassScore consists of a scoring table for each of the 28 pairs of the matched genome-and structure-based ontology levels derived from MIBiG. The scores in the tables are made by dividing the counts for each class match by the total occurrence of either the genome-based class or the structure-based class (Fig. 1c; Table S1). This resulted in two types of scoring tables, one coming from the genome side and one from the metabolome side. NPClassScore takes the genome-based and structure-based classes from a proposed link as input, looks up the matching scores between these two classifications in the scoring tables, and reports the class match with the highest score from one of the scoring tables. Thus, the NPClassScore indicates how plausible the link is between a BGC and a possible product based on how often their classes match among BGCs from MIBiG and their experimentally validated metabolic products.
Predicted antiSMASH classes are directly used as input for NPClassScore and the general BiG-SCAPE classes are converted to MIBiG classes (Table S2-S3). In order to predict ClassyFire and NPClassifier ontologies from MS/MS spectra, we used predictions from CANOPUS and MolNetEnhancer within NPClassScore (Fig. 2) [18, 19]. CANOPUS is a command-line tool that is part of the SIRIUS framework and can very accurately predict compound classifications if the right fragmentation trees are calculated. We implemented CANOPUS to run within NPLinker, but as it depends on calculating fragmentation trees, especially time-wise, it is only suitable to be used for the lower masses, below 850 Da. To also capture compound classifications for masses above 850 Da, we used MolNetEnhancer, which relies on propagating annotations between MS/MS spectra within MFs. Currently, MolNetEnhancer only provides ClassyFire predictions and has to be run on the GNPS platform, from which the results can be imported into NPLinker [12]. As the default for NPClassScore, MolNetEnhancer is used as well when there is no CANOPUS prediction for an MS/MS spectrum with a mass below 850 Da, but CANOPUS and MolNetEnhancer can also be used separately.
To assess how well NPClassScore removes improbable links, we used NPLinker on three paired omics datasets from the PoDP that consist of: 154 Streptomyces and Salinispora strains [20], 24 cyanobacterial strains [6], and 11 Nocardia strains [21] (Table S4). These are the largest datasets that contain multiple verified BGC-MS/MS-metabolite links in the PoDP, totalling 26 validated links across the three datasets. The three datasets will be referred to by their taxonomic descriptors. We analysed the three datasets separately using NPLinker and first used a co-occurrence-based strategy (standardized Metcalf score) to identify possible links between GCFs and MS/MS spectra, after which we used NPClassScore to filter the number of linked spectra per GCF [10, 13]. After filtering, the number of candidate MS/MS spectrum links for all GCFs decreased substantially for all datasets. In the Streptomyces/Salinispora dataset, the average number of candidate links per GCF decreased by 68% from 550 to 177. In the smaller Cyanobacteria and Nocardia datasets the number of candidate links per GCF decreased from 27 to 13, and from 206 to 64, representing decreases of 53% and 69%, respectively. Averaging over the three datasets this constitutes an average decrease in candidate links per GCF of 63% (Fig. 3a; Table S5). As the NPClassScore filtering depends on the chosen cut-off, we tried different cut-offs and decided on a cut-off of 0.25 as a default, as around this value there is a marked drop in the number of links per GCF for all datasets (Fig. S4-S6). This is also defendable from a theoretical perspective, as this cut-off means that a given class match should occur for at least 25% of the total occurrences of the class among MIBiG entries. Additionally, this threshold results in many more GCFs with manageable numbers of candidate MS/MS links, which can be analysed manually: in the large Streptomyces/Salinispora dataset, it yields 92 GCFs with 10 or fewer candidate links and 270 GCFs with 25 or fewer candidate links. In contrast, without filtering based on NPClassScore, only 5 GCFs would have fewer than 10 candidate links and only 42 GCFs would have fewer than 25 candidate links in the same dataset (Fig. 2b). Similar trends can be seen for the other two datasets (Fig. S7). Thus, using NPClassScore constitutes a real advantage for users as they can now realistically inspect a much larger percentage of predicted candidate links that are also more likely to be real.
Validated links are retained and get higher ranks after NPClassScore filtering
To assess if the filtered links based on NPClassScore are sensible, we used our three selected paired omics datasets from the PoDP, in each of which several previously experimentally validated BGC-MS/MS spectral-chemical structure links had been recorded. In these datasets, we checked whether these validated links were retained and whether they gained a higher rank in the lists of predicted links upon NPClassScore application. The Streptomyces/Salinispora, Cyanobacteria and Nocardia datasets contain 11, 6 and 9 validated links on the PoDP. Out of the 26 total validated links, 2 could not be found due to a missing spectrum in the datasets. Of the remaining links, 23 out of 24 passed the default NPClassScore threshold, constituting an accuracy of 96% (Fig. 3c). Additionally, this confirms our default NPClassScore threshold of 0.25 as substantial numbers of validated links are removed beyond this threshold (Fig S8). As an example, from the Streptomyces/Salinispora dataset, we found a link between GCF 534 (present in 54 strains), and spectrum 89513 (present in 67 strains), representing the link for staurosporine based on the PoDP [6] (Fig 3d). With a co-occurrence score of 9.0 this link was initially ranked second in the list of 100 potential links for GCF 534. After filtering using NPClassScore, 16 potential links were left, and spectrum 89513 was ranked first; its NPClassScore was 0.78 from the antiSMASH-ClassyFire-superclass scoring table, matching indole to Organoheterocyclic compounds. Similarly, our analysis retrieved the validated link for rosamicin: the rosamicin-biosynthesis-associated GCF 944 (present in 2 strains) was linked to spectra 130529 and 141312, each of which were present in 1 of the 2 strains (Fig. 3d). With a co-occurrence score of 8.7, the links with both spectra were ranked at a shared eighth position in the list of 275 candidate links. After filtering using NPClassScore, 38 candidate links were left in total, and both spectra were jointly ranked at the first position; their NPClassScore scores were 0.76 from the MIBiG-NPClassifier-pathway scoring table, matching Polyketide to Polyketides.
Of note, 5 out of 26 validated links did not pass the co-occurrence scoring threshold implemented in NPLinker, meaning an accuracy of 75% for the entire NPLinker workflow including NPClassScore. Most probably, for these links the clustering of MS/MS spectra and BGCs into dereplicated MS/MS spectra and GCFs do not agree with each other, i.e., their MS/MS spectra are similar enough to be clustered together but BGCs are not, or vice versa. An example that supports this hypothesis is nocobactin in the Nocardia dataset, where the actual link passed the NPClassScore cut-off with a score of 0.59, but the BGCs did not cluster together with our currently applied BiG-SCAPE cut-off, whereas the MS/MS spectra did cluster with the current settings. Regarding the 21 correctly retained validated links, they were not only retained, but also ranked higher in the lists with candidate links due to removing false positive links (Table 1). Out of 21 validated links, 12 are even ranked at the first position after NPClassScore filtering, compared to 5 links being ranked first when using just co-occurrence scoring. This shows that after NPClassScore filtering, the candidate links that are retained at high rankings are more reliable and worth exploring manually.
Table 1. All validated links from the three datasets as listed on the PoDP. The standardized Metcalf score and NPClassScore of all the links are stated as well as the rank of the verified link in the candidate list before and after NPClassScore filtering. The rank number may be shared with a number of other links due to their scores being the same which is indicated in parentheses. Retimycin and nocardimicin have no information as their MS/MS spectra could not be located in their respective datasets.
Name
|
Dataset
|
Rank NPClassScore (shared with n other links)
|
Rank Metcalf (shared with n other links)
|
Standardized Metcalf
|
NPClassScore
|
staurosporine
|
Streptomyces/Salinispora
|
1
|
2
|
9.0
|
0.78
|
rosamicin
|
Streptomyces/Salinispora
|
1 (6)
|
8 (38)
|
8.7
|
0.76
|
desferrioxamine
|
Streptomyces/Salinispora
|
1
|
1
|
9.5
|
0.36
|
rifamycin
|
Streptomyces/Salinispora
|
152
|
257
|
4.4
|
0.45
|
lomaiviticin
|
Streptomyces/Salinispora
|
30 (18)
|
381 (151)
|
3.0
|
0.96
|
arenimycin
|
Streptomyces/Salinispora
|
1 (4)
|
1 (12)
|
12.4
|
0.96
|
enterocin
|
Streptomyces/Salinispora
|
1 (8)
|
4 (81)
|
12.4
|
0.40
|
salinamide
|
Streptomyces/Salinispora
|
1 (40)
|
1 (84)
|
12.4
|
0.64
|
cyclomarin
|
Streptomyces/Salinispora
|
3 (49)
|
4 (83)
|
8.7
|
0.91
|
retimycin
|
Streptomyces/Salinispora
|
-
|
-
|
-
|
-
|
actinomycin
|
Streptomyces/Salinispora
|
1 (97)
|
1 (159)
|
12.4
|
0.76
|
anabaenopeptin
|
Cyanobacteria
|
2 (3)
|
3 (6)
|
4.2
|
0.93
|
micropeptin
|
Cyanobacteria
|
-
|
incorrectly discarded
|
-
|
1.00
|
kawaguchipeptin
|
Cyanobacteria
|
1 (37)
|
2 (39)
|
4.7
|
0.71
|
microcyclamide
|
Cyanobacteria
|
1 (13)
|
1 (18)
|
4.7
|
1.00
|
microcystin
|
Cyanobacteria
|
11 (4)
|
13 (8)
|
3.2
|
0.64
|
microcystin RR
|
Cyanobacteria
|
-
|
incorrectly discarded
|
-
|
0.64
|
mycobactin
|
Nocardia
|
1 (90)
|
2 (133)
|
3.2
|
0.64
|
mycobactin
|
Nocardia
|
incorrectly discarded
|
-
|
-
|
0.03
|
nocardimicin
|
Nocardia
|
-
|
-
|
-
|
-
|
nocobactin
|
Nocardia
|
-
|
incorrectly discarded
|
-
|
0.59
|
nocobactin
|
Nocardia
|
-
|
incorrectly discarded
|
-
|
0.59
|
nocobactin
|
Nocardia
|
-
|
incorrectly discarded
|
-
|
0.64
|
formobactin
|
Nocardia
|
1 (338)
|
4 (648)
|
2.12
|
0.64
|
nocardimicin
|
Nocardia
|
2 (255)
|
11 (427)
|
2.12
|
0.59
|
carboxynocobactin
|
Nocardia
|
1 (338)
|
4 (648)
|
2.12
|
0.46
|
Additionally, by using multiple classification ontologies, and their multiple hierarchical levels at the same time, most BGCs and MS/MS spectra will have at least one match in the NPClassScore tables. This ensures that NPClassScore can almost always be used to assess the validity of a proposed link based on chemical class information. In this way, NPClassScore is used as a lenient filtering mechanism, as a potential link is already retained when there is a match in only one of the scoring tables that passes the threshold. We do note that CANOPUS and MolNetEnhancer both give quite different chemical class predictions for the spectra in our dataset (Table S6-S7). Looking at the 6,606 spectra with predictions from both tools in the Streptomyces/Salinispora dataset, the ClassyFire superclass ‘Lipids and lipid-like molecules’ is predicted 2,686 times by MolNetEnhancer and 1,301 times by CANOPUS. Similar stark differences can be seen for other superclasses like, ‘Organic acids and derivatives’ and ‘Phenylpropanoids and polyketides’. Although we show that NPClassScore can filter down the results and prioritise actual BGC-MS/MS spectrum links, the choice of software tool for structure-based ontology prediction will have a large influence on the final results.