Influence of 16S rRNA reference databases in amplicon-based environmental microbiome research

The reference databases play a pivotal role in amplicon microbiome research, however these databases differ in the sequence content and taxonomic information available. Studies on mock community and human health microbiome have revealed the problems associated with the choice of reference database on the outcome. Nonetheless, the influence of reference databases in environmental microbiome studies is not explicitly illustrated. This study analyzed the amplicon (V1V3, V3V4, V4V5 and V6V8) data of 128 soil samples and evaluated the impact of 16S rRNA databases, Genome Taxonomy Database (GTDB), Ribosomal Database Project (RDP), SILVA and Consensus Taxonomy (ConTax), on microbiome inference. The analyses showed that the distribution of observed amplicon sequence variants was significantly different (P-value < 2.647e−12) across four datasets, generated using different databases for each amplicon region. In addition, the beta diversity was also found to be altered by different databases. Further investigation revealed that the microbiome composition inferred by various databases differ significantly (P-value = 0.001), irrespective of amplicon regions. This study, found that the core-microbiome structure in environmental studies is influenced by the type of reference database used. In summary, this present study illustrates that the choice of reference database could influence the outcome of environmental microbiome research.


Introduction
Environmental microorganisms render several ecosystem services but these organisms are often not cultivable in the laboratory conditions (Dick and Baker 2013;Steen et al. 2019) due to which the identity and the functional importance of several microorganisms remain undetermined. For example, the soil ecosystem consists of diverse organisms but only a small ratio of soil microbes could be cultivated (Janssen et al. 2002;Pham and Kim 2012) and thus the major proportion of soil microorganisms is still left unidentified. However, the unprecedented growth of next generation sequencing based microbiome approaches greatly assists the scientific community to understand the diversity of cultivable as well as non-cultivable environmental microorganisms (Handelsman 2004). In particular, the amplicon-based or marker-gene assisted microbiome approach has been popularly employed to study the diversity and role of microorganisms in different environments. In fact, the amplicon microbiome approach has been used in many global microbiome projects (Gilbert et al. 2014; Thompson et al. 2017;Delgado-Baquerizo et al. 2018;Bižić et al. 2020;Oliverio et al. 2020).
In a standard amplicon microbiome workflow, the phylogenetic marker such as the 16S rRNA (generally the hypervariable regions of 16S rRNA) is amplified and sequenced on high-throughput sequencing platform. Later, the amplicon sequences are processed and analyzed using a suitable reference database which transforms the sequence information into human readable units (microbial taxa) thereby the diversity and functions of microorganisms are analyzed. Thus, the reference database plays a pivotal role in microbiome analysis. Today, several 16S rRNA reference databases are available for microbiome research. Nonetheless, the sequence content and taxonomic information available in common databases differ (Balvočiūtė and Huson 2017) and the comparison of reference databases revealed fallible nature of different databases (Robeson et al. 2021). In particular, the analyses of mock community data using different reference databases illustrated discripancies in the taxonomy inference of microbial taxa (Park and Won 2018). The problems associated with the inference of oral microbiomes due to reference databases is well documented (Sierra et al. 2020). The earth microbiome project (EMP) also hinted at the problem of inconsistency that could arise when different reference databases are employed in the analyses (Thompson et al. 2017). The authors of the EMP study observed that the sequences generated from EMP samples match 46% and 45% of 16S rRNA sequences in Greengenes and SILVA databases, respectively, which suggests that the choice of database could alter the results of microbiome studies. Nonetheless, the influence of reference databases on the outcome of environmental microbiome studies is not explicitly illustrated. Thus, this study aimed to explore different 16S rRNA reference databases and find how these databases could affect the microbiome results.
This study included the amplicon data of four different 16S rRNA hypervariable regions, V1V3, V3V4, V4V5, and V6V8 to assess how the choice of reference database could affect the results of microbiome analyses. The data of four amplicon regions were used in the study, as different hypervariable regions of 16S rRNA have varying degrees of phylogenetic resolution which may influence the microbiome results (Yang et al. 2016;Johnson et al. 2019;Soriano-Lerma et al. 2020). Hence, analyzing the data of four different amplicon regions could help to generalize the finding of this study.

Amplicon data
The amplicon data of 128 soil samples for V1V3 (N = 32), V3V4 (N = 32), V4V5 (N = 32), and V6V8 (N = 32) hypervariable regions, generated by Soriano-Lerma et al. (Soriano-Lerma et al. 2020) were retrieved from the Sequence Read Archive (SRA) database (https:// www. ncbi. nlm. nih. gov/ sra) (Bio-Project Accession Number: PRJNA612815). The details of sampling methodology, location of samples, sample processing and amplicon sequencing procedure could be found in Soriano-Lerma et al. (2020). Briefly, Soriano-Lerma et al. (2020) selected eight locations along the pedogenic gradient in a Mediterranean calcareous mountain for soil sample collections. The samples were collected in four spatial replicates from each location. Further, the total DNA was isolated from all the 32 soil samples and the 16S rRNA amplicon regions, V1V3, V3V4, V4V5 and V6V8 were amplified. The amplicons were sequenced on the Illumina MiSeq platform and the dataset was submitted to SRA. The SRA accession numbers of samples are given in the Supplementary Table 1. The paired-end data of each sample as submitted in the SRA database was retrieved using the SRA toolkit.

Analysis of amplicon data
The amplicon sequencing data was analyzed on R version 3.6.3 using DADA2 tool (Callahan et al. 2016) which generates Amplicon Sequencing Variants (ASVs). Recent studies have highlighted that ASVs could infer the microbiome structure better than the conventional Operational Taxonomic Units (OTUs) (Callahan et al. 2016;Caruso et al. 2019) which is based on clustering approach. Thus, this study adopted ASVs to analyze the environmental microbiome data. Initially, the quality profiles of sequencing reads were visualized and the bad quality reads were filtered. Briefly, the first 25 bp at the beginning of the R1 and R2 reads were trimmed; the length of R1 reads of all the amplicon data was truncated to 280 bp whereas the length of R2 reads of V1V3, V3V4, V4V5, and V6V8 reads were truncated to 260 bp, 250 bp, 220 bp, and 210 bp, respectively. Further, the R1 and R2 reads were cleaned by setting the maxEE as 2.0 for both R1 and R2 reads for all the amplicon regions. After quality filtering of sequencing reads, the error rates for the dataset was estimated using the learnErrors command in DADA2 and the error rates were further used to clean the data. Finally, the cleaned R1 and R2 reads were merged and the non-chimeric sequences were obtained for downstream analyses.

Taxonomy analyses
The objective of this study is to evaluate the influence of 16S rRNA reference databases in microbiome analyses. Thus, the four reference databases, Genome Taxonomy Database (GTDB) (Parks et al. 2018(Parks et al. , 2020, Ribosomal Database Project (RDP) (Cole et al. 2014), SILVA (Quast et al. 2012) and Consensus Taxonomy (ConTax) (Liland et al. 2017), available on the IdTaxa online classifier were used for comparison. The taxonomy assignments were carried out based on the web-based IdTaxa classifier (Murali et al. 2018). The confidence threshold 70% was used for taxonomy assignment. Briefly, the dataset obtained from DADA2 for each amplicon region was analyzed using all the four reference databases, separately. The resulting datasets were used for downstream evaluations.

Downstream analyses
The downstream analyses were performed in R version 3.6.3 using the packages phyloseq (McMurdie and Holmes 2013), microbiome, IdTaxa (Murali et al. 2018), Decipher (Wright 2016), ggplot2 (Wickham 2016), tidyverse (Wickham et al. 2019), dplyr, ape (Paradis et al. 2004), vegan, hclust and venn. Primarily, the ASVs without known phylum were removed from the dataset. Also, the ASVs were removed, if the ASVs exist in only one sample and/or have less than 0.001 proportion of minimum sample depth in the dataset. The alpha and beta diversity was estimated based on the rarefied dataset. The minimum sequencing depth in the dataset was used as the rarefying depth. The Bray-Curtis measure was used for beta diversity calculations. The PCoA plots for each amplicon region were constructed using a subset of eight samples from the original dataset to avoid the problems in visualizing the plots. The core-microbiome structure was inferred based on the following criteria: the relative abundance of taxa is more than 0.001 with the prevalence of at least 10% of samples in the dataset. The permutational multivariate analysis of variance (PERMANOVA) test was used to assess the differences in microbial composition across different datasets. The Kruskal-Wallis test was used to evaluate the variations in the distribution of ASVs across datasets.

Results
A total number of 1,978,388, 1,491,453, 1,696,973, and 1,477,463 paired-end sequences were analyzed for V1V3, V3V4, V4V5, and V6V8 amplicon regions, respectively. The merged non-chimeric sequences (ASVs) were examined using different reference databases to understand the effect of reference databases on the outcome of microbiome studies.

Alpha and beta diversity
The observed ASVs were estimated using the rarefied datasets. The distributions of observed ASVs based on different reference databases for each amplicon region are shown in Fig. 1 The distribution of observed ASVs in different datasets based on different reference databases was found to vary significantly (P-value < 2.647e−12) for all the amplicon regions. It is noteworthy that the SILVA database retained a higher number of observed ASVs as compared to other databases. The beta diversity analysis based on the Bray-Curtis dissimilarity index was carried out to understand the relationship between samples. The rarefied datasets were used for beta diversity analysis. The PCoA plots for different amplicon regions are shown in Fig. 2. The relationship between samples was found to be affected by the amplicon regions. Importantly, the same samples analyzed using different databases were not clustered together in some of the amplicon data.

Taxonomy inference
The taxonomy of ASVs was assigned using four different databases, GTDB, RDP, SILVA and ConTax, separately and the results were compared. The results revealed that the taxonomic resolution of ASVs vary with different reference databases. For instance, the SILVA database identified the genus level taxonomy of 2987, 3178, 3040, and 3337 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. However, the GTDB, RDP, and ConTax databases could reveal the genus of only 846 to 1418, 973 to 1902, 1011 to 1558, and 1168 to 1628 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. The details of taxonomic inference of ASVs by different databases are given in Table 1. The proportions of ASVs with taxonomic information were found to vary significantly (P-value < 2.2e−16) across different datasets based on different reference databases.
Further, the composition of microbiome defined by different databases for each amplicon region was found to differ ( Supplementary Fig. 1). The discrepancies in the microbiome structure based on different databases was also examined using Bray-Curtis distance and the variations was noticed to be significant (PERMANOVA test; P-value = 0.001), irrespective of amplicon regions. The comparison results showed that SILVA has the tendency to annotate the taxonomy of more proportions of ASVs as compared to other databases. The genus level taxonomic inferences of ASVs by different databases are shown in Fig. 3 and the assessment results of order and family of ASVs by different reference databases are shown in Supplementary  Fig. 2. Core-microbiome structure The core-microbiome structure depending on different reference databases was also investigated. The number of core-microbiome taxa at various taxonomic levels inferred by different databases is shown in Fig. 4. The results illustrate that the preference of databases could impact the core-microbiome which is further supported by the comparison results of core-microbiome taxa. The structure of core-microbiome (class-level) and comparison of number of core-microbiome taxa inferred by different databases are shown in Fig. 5 and Supplementary Fig. 3, respectively. The results suggest that the SILVA database consistently infer a higher number of core-microbiome taxa as compared to other databases, irrespective of amplicon regions.

Discussion
In recent times, the amplicon based microbiome research has attained a great height in environmental research. One of the key components in microbiome analysis is the choice of reference databases which transform the sequence information into human readable format allowing us to understand microbial diversity and its role in the environment. Today, several 16S rRNA reference databases are available for microbiome analyses. However, the effect of reference databases of choice in environmental microbiome studies is not explicitly illustrated. Thus, this study aimed to evaluate the impact of 16S rRNA database preference on the outcome of environmental microbiome studies. The current analyses discerned that the outcome of microbiome Primarily, the effect of the database of preference on alpha diversity was examined by analyzing the observed ASVs (Fig. 1) and the investigation found that the distribution of observed ASVs can be influenced by databases being used, irrespective of amplicon regions. The beta diversity analyses also showed that the choice of database could influence the relationship among the samples (Fig. 2). The reason for observing the discrepancies in alpha and beta diversity for the same set of samples presumably due to difference in microbiome composition obtained using different databases. The datasets created in this study were primarily cleaned based on the phylum level information of ASVs, assigned by the respective reference database. The taxonomy assignment results clearly show that the degree of taxonomic inference of ASVs vary for different databases which resulted in variations in the number of ASVs across datasets during the downstream analyses. The alpha and beta diversity are based on the composition of sequence (sequencing depth)/ microbial taxa of the datasets (Lundin et al. 2012;Ramakodi 2021a). Thus, the variations in the number of ASVs in different datasets could be attributed to the discrepancies observed in alpha and beta diversity results.
The efficiency of reference databases on the taxonomic inference of ASVs were examined further. The analyses showed significant variations in the taxonomy assignment of ASVs by different databases. The comparison results highlighted that SILVA could infer the taxonomy of more ASVs as compared to other databases ( Fig. 3 and Table 1). Earlier analysis also found the SILVA database to provide better taxonomic resolution (Almeida et al. 2018). Further, the microbiome composition of samples was found to be influenced by the choice of reference database (Supplementary Fig. 1 and 2). The inconsistencies in inferring the taxonomy of ASVs by different databases could be due to the nature of reference databases. The database GTDB utilizes the phylogenomic information of genome sequences obtained from Ref-Seq and Genbank. The GTDB generally follows the List of Prokaryotic names withstanding in Nomenclature (LPSN) (Parte 2014) for the nomenclature of bacteria. The SILVA database is based on phylogenetic information of 16S rRNA and the taxonomic ranking of taxa is based on the Bergey's Taxonomic Outlines (Boone et al. 2001) and LPSN. The RDP database includes the 16S rRNA sequences from the International Nucleotide Sequence Database Collaboration (INSDC) (Arita et al. 2021) databases. The RDP database obtains the information from the Bacterial Nomenclature Up-to-Date (http:// www. dsmz. de/), the taxonomic roadmaps by Bergey's Trust (http:// www. berge ys. org/ outli nes. html) and LPSN for the taxonomic classification of bacteria and Archaea. The ConTax database is the subset of millions of classified 16S rRNA sequences obtained from various sources and created based on some degree of consensus on the classification. Thus, the differences in the nature of reference databases could be associated with the variations observed in the taxonomic assignments of ASVs by different reference databases. Other reasons associated with the inconsistent taxonomy assignment of ASVs by databases could be due to systematic errors that exist in various reference databases (Edgar 2018;Lydon and Lipp 2018).
The major goal of microbiome study is to infer the core-microbiome composition. Thus, this study also evaluated the influence of reference databases on the inference of core-microbiome structure. The comparison results indicated that the database SILVA infers larger number of core-microbiome taxa as compared to other databases (Fig. 4), irrespective of amplicon regions. Further, results indicate that the coremicrobiome structure in environmental studies could be altered by different reference databases (Fig. 5 and Supplementary Fig. 3). For example, in V1V3 Core-microbiome (Class-level) structure obtained through different reference databases vary, irrespective of amplicon region datasets, the phylum Chloroflexi, Planctomycetota, Verrucomicrobiota, Abditibacteriota, Entotheonellaeota and WS2 were identified only in the dataset analyzed by the SILVA database whereas the RDP database revealed the presence of Acidobacteria and Armatimonadetes in the dataset. Similar variations in the core-microbiome compositions due to reference databases were also observed in other amplicon regions. The microbiome data are known to be compositional in nature which means the content of datasets could influence the inference of core-microbiome (Gloor et al. 2017;Ramakodi 2021a, b). The results of this study explicitly reveal that the composition (ASVs) of datasets was influenced by the choice of reference databases and thus, the inconsistencies observed in the core-microbiome structure could be due to the compositional property of microbiome datasets in addition to the systematic errors that exist in different databases.

Conclusion
In microbiome analysis, the reference databases play a central role. Today, a variety of 16S rRNA reference databases are available on public platforms for the taxonomic inference of microbes but the field of microbiome research lacks gold standard for the selection of reference database. Thus, researchers often select a reference database that is easily accessible and/or used widely in the scientific publications. This study illustrates that the choice of reference database could influence the outcome of the analyses. The researchers should be aware of the fact that their results based on any particular reference database may not be conclusive. Hence, interpretation of taxa profiling in microbiome studies need careful considerations before arriving at a strong conclusion about the outcome of the study. Also, this study urges the need to develop proper guidelines for the selection of appropriate databases in environmental microbiome studies to obtain comparable results so as the problems caused by reference databases could be minimized.