A total number of 1,978,388, 1,491,453, 1,696,973, and 1,477,463 paired-end sequences were analyzed for V1V3, V3V4, V4V5, and V6V8 amplicon regions, respectively. The merged non-chimeric sequences (ASVs) were examined using different reference databases to understand the effect of reference databases on the outcome of microbiome studies.
Alpha and beta diversity
The observed ASVs were estimated using the rarefied datasets. The distributions of observed ASVs based on different reference databases for each amplicon region are shown in Figure 1. The results suggest that the observed ASVs vary depending on the database, irrespective of amplicon regions. The median observed ASVs for GTDB (V1V3: 36, V3V4: 234, V4V5: 330, and V6V8: 361), RDP (V1V3: 9, V3V4: 264, V4V5: 335, and V6V8: 325). SILVA (V1V3: 49, V3V4: 439, V4V5: 549, and V6V8: 580) and ConTax (V1V3: 26, V3V4: 205, V4V5: 251, and V6V8: 324) was found to vary for different amplicon regions. The distribution of observed ASVs in different datasets based on different reference databases was found to vary significantly (P-value < 2.647e-12) for all the amplicon regions. It is noteworthy that the SILVA database retained a higher number of observed ASVs as compared to other databases.
The beta diversity analysis based on the Bray-Curtis dissimilarity index was carried out to understand the relationship between samples. The rarefied datasets were used for beta diversity analysis. The PCoA plots for different amplicon regions are shown in Figure 2. The relationship between samples was found to be affected by the amplicon regions. Importantly, the same samples analyzed using different databases were not clustered together in some of the amplicon data.
Taxonomy inference
The taxonomy of ASVs was assigned using four different databases, GTDB, RDP, SILVA and ConTax, separately and the results were compared. The results revealed that the taxonomic resolution of ASVs vary with different reference databases. For instance, the SILVA database identified the genus level taxonomy of 2987, 3178, 3040, and 3337 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. However, the GTDB, RDP, and ConTax databases could reveal the genus of only 846 to 1418, 973 to 1902, 1011 to 1558, and 1168 to 1628 ASVs for V1V3, V3V4, V4V5, and V6V8 amplicons, respectively. The details of taxonomic inference of ASVs by different databases are given in Table 1. The proportions of ASVs with taxonomic information were found to vary significantly (P-value < 2.2e-16) across different reference databases.
Further, the composition of microbiome defined by different databases for each amplicon region was found to differ (Supplementary Figure 1). The discrepancies in the microbiome structure based on different databases was also examined using Bray-Curtis distance and the variations was noticed to be significant (PERMANOVA test; P-value=0.001), irrespective of amplicon regions. The comparison results showed that SILVA has the tendency to annotate the taxonomy of more proportions of ASVs as compared to other databases. The genus level taxonomic inferences of ASVs by different databases are shown in Figure 3 and the assessment results of order and family of ASVs by different reference databases are shown in Supplementary Figure 2.
Core-microbiome structure
The core-microbiome structure depending on different reference databases was also investigated. The number of core-microbiome taxa at various taxonomic levels inferred by different databases is shown in Figure 4. The results illustrate that the preference of databases could impact the core-microbiome which is further supported by the comparison results of core-microbiome taxa. The structure of core-microbiome (class-level) and comparison of number of core-microbiome taxa inferred by different databases are shown in Figure 5 and Supplementary Figure 3, respectively. The results suggest that the SILVA database consistently infer a higher number of core-microbiome taxa as compared to other databases, irrespective of amplicon regions.