Genome sequences and annotation
Genome sequences of all Bifidobacterium strains were downloaded from the Pathosystems Resource Integration Center (PATRIC) and the National Center for Biotechnology Information (NCBI) databases on March 14th, 2018 (n = 497). Duplicate sequences were removed from further analysis. We identified the hosts for each of the strains by searching the PATRIC and NCBI databases or associated publications (n = 449). Based on the concatenation of 107 core genes (see phylogenetic analysis below for details), we removed sequences with many gaps in the core genes from further analysis and only kept unique strains (n = 400). The vast majority of the strains in the databases were derived from human hosts followed by primates, cattle, pigs and bees. For strains isolated from humans (n = 272), we assigned each strain to the most specific category possible, acknowledging that some categories are subsets of other categories: infant (n = 117), adult (n = 20), human blood (n = 13), human milk (n = 10), urogenital (n = 9), elderly (n = 5), child (n = 4), probiotic (n = 3), oral (n = 2), human unspecified (n = 89). Child refers to 2–6 years old while infant usually refers to children anywhere from birth to 1 year old (or reported as infant in their respected studies). A subset of 60 human-derived strains from diverse environments were used for genomic comparisons based on their descriptive isolation source (Additional file 1)..
To compare strains among hosts, we focused on a subset of 129 bifidobacteria strains. These strains included the majority of the non-human bifidobacteria strains in addition to a subset of human strains from adult and infant feces (n = 13), blood (n = 1), vagina (n = 1), and mouth (n = 1). The categories were the following: primate (n = 18), human (n = 16), cattle (n = 15), pig (n = 16), bee (n = 16), rodent (n = 12), probiotic (n = 8), wastewater (n = 7), rabbit (n = 7), chicken (n = 6), other mammals (n = 4; including giraffe, hippopotamus, llama, and wallaby), dairy products (n = 3), soil-plant-associated (n = 1). We recognize that not all the host categories are at the same phylogenetic level (Additional file 2)..
To ensure uniform annotation, we reannotated all the genomes using Prodigal v2.6.3 in Normal Mode to predict Open Reading Frames (ORF) (49). We then used Prokka v1.13 (50) to annotate the sequences.
Phylogenetic analysis
Multilocus phylogenetic trees were constructed using the bcgTree pipeline (51) with the protein fasta files (.*faa) derived from Prodigal v2.6.3. Each of the genome sequences was searched for 107 conserved single-copy genes defined by Dupont et al. 2012 (52) using hmmsearch v3.1b2. The extracted genes were then each aligned using muscle v3.8.31 (53) and polished using Gblocks v0.91b (54) by eliminating poorly aligned areas. The 107 genes were then concatenated, and a phylogenetic tree was built using RAxML v8.2.10 with PROTGAMMABLOSUM62 substitution model and 100 rapid Bootstrap searches (55). We visualized the phylogenetic trees using the iTOL v3 interactive tool (56).
Comparative genomic analysis
We next tested whether some of the variation in the traits encoded by bifidobacteria genomes could be explained by the host or environment from which they were isolated. We used the genome size values provided by the PATRIC metadata to compare the genome size among isolates. For human-derived strains we used the same 60 sequences used in the phylogenetic analysis since they were carefully chosen to encompass variable human environments and tried to keep similar samples sizes when possible between categories; however, for the comparison among multiple hosts and environments we used a subset of the 129 strains to keep sample sizes the same for each category (n = 6); hence, we did not include isolates from the dairy, mammal, and soil categories since their sample sizes were less than 6 strains.
The pan-genome and gene ontology of the 129 selected bifidobacteria strains were established with Roary v3.12.0 (57) using the annotated genome assemblies obtained from Prokka v1.13 (.gff files). To account for the relatively high diversity of this genus, we used a 50% sequence identity for the blastp cutoff (58). The Roary software was able to detect core genes (present in 99%–100% of the strains), soft core genes (present in 95%–99% of the strains), shell genes (present in 15%–95% of the strains), and cloud genes (present in 0%–15%). The presence-absence table given by Roary, depicting the 26,905 gene clusters, was curated by deleting the following genes: core genes present in all 129 strains (minus 352 = total: 26,553), singletons (minus 10,967 = total: 15,586), genes with an average sequence per isolate higher than 1, due to splitting errors (minus 189 = total: 15,397), and genes with hypothetical annotation with no identifiable gene name (minus 9,000 = total: 6,397). The final table containing 6,397 accessory genes was converted into a matrix for further comparisons between core genes and phylogenetic distance against accessory gene composition. We used Phandango (59) to construct the pan-genome alignment by incorporating the RAxML inferred tree and the presence-absence table given by Roary.
Toassess the abundance (number of genes) and diversity (number of different genes) of amino acid biosynthesis genes, the automatic annotation server Ghostkoala was used to obtain gene function assignments based on the KEGG Orthology (60). To identify the CAZymes encoded in each genome, we used the dbCAN2 meta server based on the CAZy database updated on July 13th, 2018 (61,62). The input files for the webserver were protein fasta files (.*faa) derived from Prodigal v2.6.3. This server has the option to utilize three tools to predict CAZymes: i) HMMER search against the dbCAN HMM (hidden MArkov model) database; ii) DIAMOND search against pre-annotated CAZyme sequence database; iii) Hotpep search against the CAZyme short peptide database. We used all three tools at the default parsing thresholds and only considered the CAZymes found by all three tools.
Statistical Analyses
We used ANOSIM in PRIMER–6 Software (63) to test whether the isolation source categories were associated with phylogenetic relatedness and accessory genes of the bifidobacteria strains. To test for a correlation between the similarity in accessory and core gene content, we used the Relate test in PRIMER–6. We used the Tree and reticulogram REConstruction (T-REX) web server (64) to create the distance matrices used in the ANOSIM and Relate tests using the Netwick phylogenetic tree from RAxML. We assessed normality of data using Shapiro-Wilk normality test and its variance with Levene’s test incorporated in RStudio version 1.1.453. To account for the non-normal data and non-equal sample sizes, we used the Kruskal-Wallis (with a calculated significance level of p > 0.05) and Dunn’s post hoc tests (RStudio version 1.1.453) to compare genome size, amino acid biosynthesis genes, and CAZymes between the different strains belonging to varying hosts and environments.To construct heatmaps and boxplots, RStudio version 1.1.453 (http://www.rstudio.com/) was implemented and to help with the optimization of the images created, Adobe® Acrobat® Pro 2017 was used.