Inuence of 16S rRNA Reference Databases in Amplicon-Based Environmental Microbiome Research

Purpose: The reference databases play a pivotal role in amplicon microbiome research but the sequence content and taxonomic information available in common reference databases differ. Studies on mock community and human health microbiome have revealed the problems associated with the choice of reference database on the outcome. Nonetheless, the inuence of reference databases in environmental microbiome studies is not explicitly illustrated. Methods: This study analyzed the amplicon (V1V3, V3V4, V4V5 and V6V8) data of 128 soil samples and evaluated the impact of 16S rRNA databases, Genome Taxonomy Database (GTDB), Ribosomal Database Project (RDP), SILVA and Consensus Taxonomy (ConTax), on microbiome inference. Results: The analyses showed that the distribution of observed amplicon sequence variants was signicantly different (P-value < 2.647e-12) across four datasets, generated based on different databases for each amplicon region. In addition, the beta diversity was also found to be altered by different databases. Further investigation revealed that the microbiome composition inferred by different databases vary signicantly (P-value=0.001), irrespective of amplicon regions. Importantly, the study found that the core-microbiome structure in environmental studies could be altered by the reference databases. Conclusion: In summary, this present study illustrates that the choice of reference database could inuence the outcome of environmental microbiome research.


Introduction
Environmental microorganisms render several ecosystem services but these organisms are often not cultivable in the laboratory conditions (Dick and Baker 2013; Steen et al. 2019) due to which the identity and the functional importance of several microorganisms remain undetermined. For example, the soil ecosystem consists of different varieties of organisms but only a small ratio of soil microbes could be cultivated (Janssen et al. 2002;Pham and Kim 2012) and thus the major proportion of soil microorganisms is still left unidenti ed. However, the next generation sequencing based microbiome analysis greatly assists the scienti c community to understand the diversity of cultivable as well as noncultivable environmental microorganisms (Handelsman 2004). In particular, the amplicon-based or marker-gene assisted microbiome approach has been popularly employed to study the diversity and role of microorganisms in different environments. In fact, the amplicon microbiome approach has been used in many global microbiome projects (Gilbert et  In a standard amplicon microbiome work ow, the marker-gene such as the 16S rRNA (generally the hypervariable regions of 16S rRNA) is ampli ed and sequenced on high-throughput sequencing platform.
Later, the amplicon sequences are processed and analyzed using a suitable reference database which transforms the sequence information into human readable units (microbial taxa) thereby the diversity and role of microorganisms could be delineated. Thus, the reference database plays a pivotal role in microbiome analysis. Today, several 16S rRNA reference databases are available for microbiome research. Nonetheless, the sequence content and taxonomic information available in common databases differ (Balvočiūtė and Huson 2017) and the comparison of reference databases revealed that the composition of databases could affect the results of microbiome studies (Robeson et al. 2020). In particular, the analyses of mock community data using different reference databases illustrated inconsistencies in the taxonomy inference of microbial taxa (Park and Won 2018). The problems associated with the inference of oral microbiomes due to reference databases is well documented (Sierra et al. 2020). The earth microbiome project (EMP) also hinted at the problem of inconsistency that could arise when different reference databases are employed in the analyses (Thompson et al. 2017). The authors of the EMP study observed that the sequences generated from EMP samples match 46% and 45% of 16S rRNA sequences in Greengenes and SILVA databases, respectively, which suggests that the choice of database could alter the results of microbiome studies. Nonetheless, the in uence of reference databases on the outcome of environmental microbiome studies is not explicitly illustrated. Thus, this study aimed to explore different 16S rRNA reference databases and nd how these databases could affect the microbiome results.
This study included the amplicon data of four different 16S rRNA hypervariable regions, V1V3, V3V4,

Analysis of amplicon data
The amplicon sequencing data was analyzed on R version 3.6.3 using DADA2 tool (Callahan et  adopted ASVs to analyze the environmental microbiome data. Brie y, the rst 25bp at the beginning of the R1 and R2 reads were trimmed; the length of R1 reads of all the amplicon data was truncated to 280bp whereas the length of R2 reads of V1V3, V3V4, V4V5, and V6V8 reads were truncated to 260bp, 250bp, 220bp, and 210bp, respectively. Further, the R1 and R2 reads were cleaned by setting the maxEE as 2.0 for both R1 and R2 reads for all the amplicon regions. Finally, R1 and R2 reads were merged and the non-chimeric sequences were obtained for downstream analyses.

Taxonomy analyses
The objective of this study is to evaluate the in uence of 16S rRNA reference databases in microbiome analyses. Thus, the four reference databases, Genome Taxonomy (Murali et al. 2018). The con dence threshold 70% was used for taxonomy assignment. Brie y, the dataset obtained from DADA2 for each amplicon region was analyzed using all the four reference databases, separately. The resulting datasets were used for downstream evaluations.

Downstream analyses
The downstream analyses were performed in R version 3.6.3 using the packages phyloseq (McMurdie  and Holmes  the ASVs without known phylum were removed from the dataset. Also, the ASVs were removed, if the ASVs exist in only one sample and/or have less than 0.001 proportion of minimum sample depth in the dataset. The alpha and beta diversity was estimated based on the rare ed dataset. The minimum sequencing depth in the dataset was used as the rarefying depth. The Bray-Curtis measure was used for beta diversity calculations. The PCoA plots for each amplicon region were constructed using a subset of eight samples from the original dataset to avoid the problems in visualizing the plots. The coremicrobiome structure was inferred based on the following criteria: the relative abundance of taxa is more than 0.001 with the prevalence of at least 10% of samples in the dataset. The permutational multivariate analysis of variance (PERMANOVA) test was used to assess the differences in microbial composition across different datasets. The Kruskal-Wallis test was used to evaluate the variations in the distribution of ASVs across datasets.

Results
A total number of 1,978,388, 1,491,453, 1,696,973, and 1,477,463 paired-end sequences were analyzed for V1V3, V3V4, V4V5, and V6V8 amplicon regions, respectively. The merged non-chimeric sequences (ASVs) were examined using different reference databases to understand the effect of reference databases on the outcome of microbiome studies.
The beta diversity analysis based on the Bray-Curtis dissimilarity index was carried out to understand the relationship between samples. The rare ed datasets were used for beta diversity analysis. The PCoA plots for different amplicon regions are shown in Figure 2. The relationship between samples was found to be affected by the amplicon regions. Importantly, the same samples analyzed using different databases were not clustered together in some of the amplicon data.

Taxonomy inference
The taxonomy of ASVs was assigned using four different databases, GTDB, RDP, SILVA and ConTax, separately and the results were compared. The results revealed that the taxonomic resolution of ASVs vary with different reference databases. For instance, the SILVA database identi ed the genus level taxonomy of 2987, 3178, 3040, and 3337 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. However, the GTDB, RDP, and ConTax databases could reveal the genus of only 846 to 1418, 973 to 1902, 1011 to 1558, and 1168 to 1628 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. The details of taxonomic inference of ASVs by different databases are given in Table 1. The proportions of ASVs with taxonomic information were found to vary signi cantly (P-value < 2.2e-16) across different reference databases.
Further, the composition of microbiome de ned by different databases for each amplicon region was found to differ (Supplementary Figure 1). The discrepancies in the microbiome structure based on different databases was also examined using Bray-Curtis distance and the variations was noticed to be signi cant (PERMANOVA test; P-value=0.001), irrespective of amplicon regions. The comparison results showed that SILVA has the tendency to annotate the taxonomy of more proportions of ASVs as compared to other databases. The genus level taxonomic inferences of ASVs by different databases are shown in Figure 3  Core-microbiome structure The core-microbiome structure depending on different reference databases was also investigated. The number of core-microbiome taxa at various taxonomic levels inferred by different databases is shown in Figure 4. The results illustrate that the preference of databases could impact the core-microbiome which is further supported by the comparison results of core-microbiome taxa. The structure of core-microbiome (class-level) and comparison of number of core-microbiome taxa inferred by different databases are shown in Figure 5 and Supplementary Figure 3, respectively. The results suggest that the SILVA database consistently infer a higher number of core-microbiome taxa as compared to other databases, irrespective of amplicon regions.

Discussion
In recent times, the amplicon based microbiome research attained a great height in environmental research. One of the key components in microbiome analysis is the reference databases which transform the sequence information into human readable format thereby the microbial diversity and their role in the environment could be inferred. Today, several 16S rRNA reference databases are available for microbiome analyses. However, the effect of reference databases of choice in environmental microbiome studies is not explicitly illustrated. Thus, this study aimed to evaluate the effect of 16S rRNA database preference on the outcome of environmental microbiome studies. The current analyses discerned that the outcome of microbiome studies could be in uenced by the choice of reference database.
Primarily, the effect of the database of preference on alpha diversity was examined by analyzing the observed ASVs ( Figure 1) and the investigation found that the distribution of observed ASVs to be in uenced by the databases, irrespective of amplicon regions. The beta diversity analyses also showed that the database could alter the relationship among the samples (Figure 2). Especially, the same samples analyzed using different databases were observed to be clustered with other samples instead of clustering with their respective samples. The reason for observing the discrepancies in alpha and beta diversity for the same set of samples could be due to the discripancies in the microbiome composition obtained using different databases. The datasets created in this study were primarily cleaned based on the phylum level information of ASVs, assigned by the respective reference database. The taxonomy assignment results clearly show that the degree of taxonomic inference of ASVs vary for different databases which resulted in variations in the number of ASVs across datasets during the downstream analyses. The alpha and beta diversity are based on the composition of sequence (sequencing depth)/ microbial taxa of the datasets (Lundin et al. 2012;Ramakodi 2021b). Thus, the variations in the number of ASVs in different datasets could be attributed to the discrepancies observed in alpha and beta diversity results.
The e ciency of reference databases on the taxonomic inference of ASVs were examined further. The analyses showed signi cant variations in the taxonomy assignment of ASVs by different databases. The comparison results showed that SILVA could infer the taxonomy of more ASVs as compared to other databases ( Figure 3 and Table 1). Earlier analysis also found the SILVA database to provide better taxonomic resolution (Almeida et al. 2018). Further, the microbiome composition of samples was found to be in uenced by the choice of reference database (Supplementary Figure 1 and 2). The inconsistencies in inferring the taxonomy of ASVs by different databases could be due to the nature of reference databases. The database GTDB utilizes the phylogenomic information of genome sequences obtained from RefSeq and Genbank. The GTDB generally follows the List of Prokaryotic names with Standing in Nomenclature (LPSN) ( The major goal of microbiome study is to infer the core-microbiome composition. Thus, this study also evaluated the in uence of reference databases on the inference of core-microbiome structure. The comparison results indicated that the database SILVA infers more number of core-microbiome taxa as compared to other databases (Figure 4), irrespective of amplicon regions. Further results indicate that the core-microbiome structure in environmental studies could be altered by different reference databases ( Figure 5 and Supplementary Figure 3). For example, in V1V3 datasets, the phylum Chloro exi, Planctomycetota, Verrucomicrobiota, Abditibacteriota, Entotheonellaeota and WS2 were identi ed only in the dataset analyzed by the SILVA database whereas the RDP database revealed the presence of Acidobacteria and Armatimonadetes in the dataset. Similar variations in the core-microbiome compositions due to reference databases were also observed in other amplicon regions. The microbiome data are known to be compositional in nature which means the content of datasets in uence the coremicrobiome (Gloor et al. 2017; Ramakodi 2021b, a). The results of this study clearly revealed that the composition (ASVs) of datasets was in uenced by the reference databases and thus, the inconsistencies observed in the core-microbiome composition could be due to the compositional property of datasets in addition to the systematic errors that exist in different databases.

Conclusion
In microbiome analysis, the reference databases play a central role. Today, a variety of 16S rRNA reference databases are available on public platforms for the taxonomic inference of microbes but the eld of microbiome research lacks gold standard for the selection of reference database. Thus, researchers often select a reference database that is easily accessible and/or used widely in the scienti c publications. This study illustrates that the choice of reference database could in uence the outcome of the analyses. The researchers should be aware of the fact that their results based on a reference database may not be conclusive. Hence, the problems caused by reference database in microbiome studies need careful considerations before arriving at a strong conclusion about the outcome of the study. Also, this study urges the need to develop proper guidelines for the selection of appropriate databases in environmental microbiome studies to obtain comparable results so as the problems caused by reference databases could be minimized.  Tables   Table 1. Details of taxonomic information assigned to ASVs in different datasets based on different reference databases