The study of key marker genes together with the development of the “-omics” techniques has increased our understanding of marine biogeochemical cycles [14, 55]. Both metagenomics and amplicon approaches are commonly used to target functional genes and describe their ecological patterns. Amplicon sequencing is easy to implement, relatively cheap, and effective in capturing large numbers of variants. However, due to the high variability of protein-coding genes, primer biases are common and can result in the misrepresentation of the relative contribution of certain taxa. In contrast, metagenomics is PCR-free and less biased for functional gene analysis, but it generally retrieves fewer copies of marker genes and hinders the comprehensive study of functional groups with low abundances in the environment. For example, the nifH gene, a genetic marker of nitrogen-fixing populations, is usually studied based on amplicon sequencing since the number of variants retrieved with metagenomic surveys is too low [56, 57], if not undetectable. Likewise, most of the studies of aerobic anoxygenic phototrophic (AAP) bacteria are based on the partial amplification of the pufM gene (e.g., [17, 26–28, 31, 33]), while only few have approached their study based solely on metagenomics [34, 35]. Yet, the application of metagenomics allows the description of new metabolisms and new taxa in marine microbiomes, such as the discovery of new nitrogen fixation pathways in surface ocean heterotrophs [56], and the description of new AAP phylogroups previously not reported [34]. In this study we combined both approaches to unveil the biases of existing primers for the pufM gene, design new ones, and test the performance of different primer combinations in a variety of marine environments.
The debate regarding primer biases in the case of the pufM gene comes from long [26–28, 38]. A common point of these studies that employ the primer pair pufMF/pufM_WAW is the dominance of sequences from phylogroup K (Gammaproteobacteria) in AAP communities. Lehours et al., [29] reported > 80% of relative abundance of this phylogroup in samples from the Mediterranean Sea and considered possible primer biases favoring the amplification of the Gammaproteobacteria but discarded that it was the case since the same primer pair in Arctic waters led to low abundances of phylogroup K. Still, it is likely that arctic bacterial communities have a very different composition, due to temperature differences and the effect of ice melt in salinity. Gammaproteobacteria was also predominant in samples from the coastal Blanes Bay [27, 28]. Although the seasonal trends of communities retrieved both by metagenomics and amplicon sequencing had a similar tendency for the predominant groups [27], primers pufMF/pufM_WAW overestimated the contribution of phylogroup K and failed to amplify sequences from some groups that appeared in the metagenomic approach. Besides the prevalence of Gammaproteobacteria in marine AAP communities, most of the studies also reported the presence of members of the Alphaproteobacteria [26–28, 33, 38, 53, 58, 59]. Our results show that the number of mismatches of sequences affiliated to Gammaproteobacteria and some Alphaproteobacteria orders are low in comparison to the mismatches in phylogroups A, B, C and D, that are almost absent from amplicon-based studies (Fig. 1). In fact, these phylogroups were described for the first time following the shotgun metagenomic sequencing approach of Yutin et al., [34]. These phylogroups have no cultured representatives and even though they were abundant in samples from the Global Ocean Expedition (GOS, [34]), they have been barely retrieved in other studies, which might be explained by the high number of mismatches within the region of the commonly used pufMF primer (Fig. 1, 2b). On the contrary, the universal primers UniF, UniR and pufM_WAW [22] have a very good phylogenetic coverage (Fig. 1). While the reverse primer pufM_WAW have been extensively used in combination with pufMF, primers UniF/UniR have never been used in marine environments and were discarded in previous studies as they repeatedly failed in amplifying under different conditions [38] and authors unpublished observations). Both primers, pufMF and UniF, hybridize in the same region of the pufM gene (with 3-nucleotides shift between them), but UniF has 10 degenerate nucleotides, a very low GC content, and low melting temperature (Table 1), which might explain why it is problematic in vitro.
The primers designed in this study aimed at both improving the coverage of the commonly used ones and producing longer amplicons. While we did not succeed in the design of primers in the upstream region, we were able to produce an oligonucleotide with a higher hybridization ratio than pufMF, and a lower number of degenerate nucleotides than UniF, while maintaining the amplicon size of ~ 200 bp. The detailed analysis of the nucleotide composition of primer pufMF (Fig. 2b) showed that most of the mutations in the primer region can be associated to different phylogroups and generally represent silent mutations. The only exceptions are sequences from phylogroup C and D that have a histidine and a tryptophan respectively, instead of a tyrosine in the first position (of the primer region), and sequences from phylogroup D that encodes for a tyrosine instead of a phenylalanine in position five. We identified 7 nucleotides with problematic mismatches, represented in bold and underlined in Fig. 2, that were considered when redesigning the new primer pufMF_Y. Nevertheless, we cannot discard that other regions or combinations might work, nor that designs of primers that include both the pufL and the pufM gene might produce better tools.
Finding universally conserved regions in a functional gene is challenging and sometimes it is not possible to generate universal primers, as it happens, among others, with genes nirS and nirK (NO-forming nitrite reductase genes) that rely on the use of clade-specific primers [60] Although the use of universal primers is appropriate to describe microbial communities, even perfectly matched primers can exhibit preferential amplification, thus beyond the in-silico testing, analyses with environmental samples is also important for primer evaluation [61]. To test different primer combinations, we used samples from different marine environments (coastal vs open ocean) and from different seasons, to include the spatial and seasonal variability that exists in AAP assemblages, as well reported previously [26, 27, 31, 34]. The metagenomic assay provides a biases-free representation of the most abundant components of AAP communities in samples from Blanes Bay and Malaspina (Fig. 3b). Even though the number of copies retrieved is low, and some could not be taxonomically assigned, comparing these communities to those obtained through amplicon-sequencing is the best approach to analyze the biases of each primer combination. A previous analysis with samples of Blanes Bay already identified discrepancies in the taxonomic composition of communities with the methods, such as sequences from phylogroups A, B, C and L that were only retrieved by metagenomics and were absent in the amplicon approach [27]. In this study, using the same extracted DNA from BBMO samples, and different primer combinations, we have been able to amplify sequences from some of these phylogroups, which in fact constitute over 50% of the relative abundance in two samples (Fig. 3 de). Likewise, we obtained a great proportion of these groups in samples from Malaspina, that were completely missed with primers pufMF/pufM_WAW, as shown in Fig. 3c and in Gazulla et al., [26]. These results are consistent with the reports from Yutin et al., [34] and Cuadrat et al [35], in which they describe a high proportion of sequences affiliated to phylogroups A, B, C, and D in AAP communities from different marine environments. Noteworthy, we also retrieved some sequences taxonomically related to the new “Candidatus Luxescamonaceae” family, recently described by Graham et al., [36].
The alpha diversity estimates obtained with primers pufMF_Y/pufM_WAW and UniF/UniR (Fig. S1) surpassed by far the estimates described in samples of the BBMO [27] and the Malaspina Expedition [26] with primers pufMF/pufM_WAW. Interestingly, these previous studies had provided the most comprehensive datasets for AAP bacteria but were clearly underestimating AAP diversity. The taxonomic composition of samples from the metagenomic and the amplicon approaches with the new primer combinations was quite similar, especially for primers UniF/UniR, which reproduced with high fidelity the taxonomic composition observed with the metagenomes for almost all samples. In turn, primers pufMF_Y/pufM_WAW seem to represent quite well the communities from coastal waters, where Gammaproteobacteria from phylogroup K are abundant, yet however, they might still have a higher affinity for phylogroup K since in samples of the open ocean this group appears overrepresented to the detriment of others, such as phylogroup D and G (Rhodobacterales-like).
The outcome of amplification approaches is highly impacted by the phylogenetic resolution of the primers employed, but other factors such as the efficiency in terms of number of reads, or the microbial communities and the environment that is being analyzed, might also influence. Both pufMF_Y/pufM_WAW and UniF/UniR provided similar results and multiplied the number of reads obtained with the previous primers. While primers UniF/UniR were described several years ago, and their phylogenetic coverage was already proven to be good [22], primers pufMF/pufM_WAW showed better in vitro performance ([38], and authors unpublished observations) and in fact, in this study, amplification with UniF/UniR required troubleshooting during PCR amplification for Illumina sequencing (see Methods). In contrast, primers pufMF_Y/pufM_WAW offered good results in terms of yield (number of reads) and generated longer amplicons (203bp), which translated into more sequences and higher taxonomic resolution.
Overall, our results indicate that the taxonomic composition of primers pufMF/pufM_WAW is biased towards phylogroup K (Pseudomonadales) (Fig. 3c), that becomes overestimated in almost all samples. The same happens with Sphingomonadales-like and Rhizobiales representatives, and with a small cluster of Alphaproteobacteria sequences that could not be assigned to any specific order or phylogroup. Previous studies had reported high abundances of Gammaproteobacteria in the Mediterranean Sea [27, 28, 38], the Baltic Sea [53], the North Pacific Ocean [58], the Arctic Sea [59], the east coast of Australia [33] or the tropical and subtropical global ocean [26]. Since these studies analyzed AAP communities with the pufMF/pufM_WAW primers, it is likely that some of these results misrepresent the real composition of AAP communities, just as we have shown for samples of the Malaspina Expedition or Blanes Bay. Here, we show that the taxonomic structure of AAP communities is very different to what has been described in previous studies, at least for this subset of samples, and perhaps this conclusion can be extended to other environments. Yet, albeit the exposed primer biases and the overrepresentation of Gammaproteobacteria, the published previous results (e.g., the adaptation of different phylogenetic clades [31], their seasonal trends [27], or the ecological processes shaping AAP assemblages in the surface global ocean [26] should not be discarded because, when we compared the community structure retrieved with the three primer combinations, the samples clustered in a very similarly way (Fig. S2) and distance matrices were strongly correlated.
To conclude, we used an extensive pufM database to show the limitations of the forward primer pufMF and propose some alternatives to determine the composition and diversity of AAP communities in marine environments. We revised existing primers for the pufM gene in the literature, designed new ones and selected those with the best performance that were tested with environmental samples and benchmarked against metagenomics. We show that the phylogenetic coverage of primer pufMF is very low for some taxonomic groups and, as a result, amplification with this primer is biased towards phylogroup K (Gammaproteobacteria) and some orders of the Alphaproteobacteria class. Although Gammaproteobacteria are relevant components of AAP communities, several species with no culture representatives, like phylogroups A, B, C and D, have been entirely underrepresented in the past and are in fact dominating an important fraction of AAP assemblages. For future studies analyzing marine AAP bacteria, we recommend using primers pufMF_Y/pufM_WAW or UniF/UniR, to guarantee an unbiased representation of their taxonomic composition.