Construction genome-based strain types and gene composition profiles to characterize metagenomic samples
To predict the phylogenetic features of L. reuteri in metagenome samples, we profiled the composition of genome-based strain types (GSTs). For this, a species-specific reference genome (SRG), phylogenetic tree and Kraken [10] database were built using complete genomes from the EzBioCloud database [11] (see Additional File 1: Table S1), as described in Figure 1 (a). We created the L. reuteri SRG by concatenating 1,158 core genes from eight complete genomes identified by Roary [12], resulting in an 1,097,896 bp long sequence. All available 151 L. reuteri (as of April 2020) strains isolated from humans, rodents, pigs, poultry, herbivores (goats, sheep, cows and horses) and food sources (Additional File 1: Table S1) were aligned against the SRG using MUMmer [13]. The resulting multiple sequence alignments were used to infer a maximum likelihood phylogenetic tree using RAxML [14]. We then clustered the tree into 20 GSTs by merging adjacent clades until a maximum all-against-all pairwise distance of 20,000 single nucleotide variations (SNVs) was reached. A reference Kraken database was built by core gene sequences from representative genomes of each GST (Additional File 2: Figure S1).
Moreover, we inferred functional features of L. reuteri based on gene composition profiles, which indicate the absence and presence of genes in each sample. As described in Figure 1 (b), a reference pan-genome database was constructed for L. reuteri using the coding sequences (CDSs) from 151 genomes. The database is comprised of a collection of 20,014 L. reuteri-specific gene clusters, which were obtained by using a 90% DNA similarity and 90% alignment coverage threshold. These clusters include 1,149 core ones found in over 95% of the genomes, 8,926 singletons.
These reference databases were used to characterize the metagenome samples by GST abundance and gene composition (Figure 1 (c)). We profiled each sample by searching its reads against our GST database using Kraken [10] and estimated the abundance using Bracken [15]. The gene composition was profiled through assembly using MEGAHIT [16], genes were predicted using Prodigal [17] and annotated using MMseqs2 [18].
Evaluation of the profile estimation using synthetic samples
We evaluated the accuracy of GST classification at the read-level and the composition-level using synthetic samples. The synthetic samples were created using InSilicoSeq [19] with three different complexity levels: four low complexity (LC), four middle complexity (MC) and two high complexity (HC) containing either five or ten randomly selected GSTs, or all twenty GSTs, respectively. The precision at the read-level, defined as the proportion of correct assignments in GST and its ancestors to the total number of assignments, was achieved an average precision of 95.68%, 95.16% and 92.39% in the LC, MC and HC samples, respectively (Additional File 1: Table S2). We also measured the accuracy of the composition-level classification by computing the Pearson correlation coefficient between the estimated and true abundance, achieving 0.9937, 0.9879 and 0.9729 on average in the LC, MC and HC samples, respectively (Additional File 3: Figure S2).
Moreover, the accuracy of gene composition profiling was measured using true positive rate (TPR) and F1 score. We simulated metagenomic samples from twelve reference genomes using InSilicoSeq [19] (see Additional File 4: Figure S3 (a) legend) at four different coverage levels (1×, 5×, 10× and 20×) and constructed gene profiles from these samples. We obtained TPRs of ≥ 75% at 5× coverage, and ≥ 90% at 10× and 20× coverage (Additional File 4: Figure S3 (a)), and F1 scores of ≥ 80% at 5× coverage, and ≥ 90% at 10× and 20× coverage (Additional File 4: Figure S3 (b)). In the case of the real metagenomic samples, the coverage was 23.38× on average, with a standard deviation of 22.41 (Additional File 4: Figure S3 (c)).
Genome-based strain typing to characterize host-associative L. reuteri populations in the real metagenomic gut samples
Using the GST abundance profiles created from the real metagenomic gut samples, the samples were found not to contain a mixture of similarly-distributed GSTs but a few abundant GSTs, which showed association with the host origin (Figure 2).
Since distinct host-specific lineages could be found from the same host origin, the host species were divided into “host groups” based on the most abundant GST (“dominant GST”) in the metagenomic samples. We identified six host groups: one “Pig” group, four “Mouse” groups and one “Dog” group (Figure 2 (a)-(d)). The dominant GSTs of each host group were GST 16 in “Pig” samples (n = 82), GST 9 in “Mouse 1” samples (n = 22), GST 19 in “Mouse 2” samples (n = 20) , GST 6 in “Mouse 3” samples (n = 9), GST 20 in “Mouse 4” samples (n = 6) and GST 1 in “Dog” samples (n = 9) (Figure 2 (e), Additional File 5: Figure S4). Except for the dog samples for which isolated genome sequences were unavailable in the reference database, the host group of the metagenome samples was associated with the isolation sources of the dominant GSTs of the samples (Additional File 2: Figure S1).
These dominant GSTs were consistently found in the placement of 85 medium-to-high quality MAGs into the phylogenetic tree (Figure 4). Unlike the GST profiles, the MAG placements represented the phylogenetic relationship between L. reuteri in the samples and reference strains. For example, the MAGs from the pig samples were completely placed into the GST 16 clade (Figure 4 (a)), whereas those from the dog samples formed their own clade outside the reference GST 1 clade (Figure 4 (d)).
The relative abundance of the dominant GSTs was different in each host group: the median abundance in “Pig” samples, “Mouse 1” to “Mouse 4” samples and “Dog” samples was 87%, 90%, 68%, 68%, 66% and 50%, respectively (Additional File 5: Figure S4). This abundance indicated that a single dominant GST occupied more than half of the L. reuteri population despite the variation in the abundance. However, we additionally found non-dominant GSTs with a relative abundance above 10%. “Mouse 2” and “Mouse 4” samples contained 18% of GST 20 and 17% of GST 19, respectively. Not only the isolation sources of the dominant GSTs but the non-dominant GSTs coincided with the host groups of the samples (Additional File 2: Figure S1).
Furthermore, we checked how distinct each L. reuteri population between host groups was by performing a PERMANOVA [20] test. It revealed that the GST abundance of the samples was significantly different from those of other samples included in the different host groups (Additional File 6: Figure S5).
Functional features of L. reuteri associated with host origin
Phylogenetically we could measure the host-association of L. reuteri using our GST profiles. Furthermore, we also wanted to see if the host-specificity was reflected in the functional profiles. To investigate this, we picked host-specific genes (HSGs) from our gene composition profiles.
We detected significant differences in L. reuteri gene composition between the host groups using the PERMANOVA test (Additional File 7: Figure S6). A set of 4,128 host-specific genes, including 2,172 “Pig” HSGs, 540 “Mouse 1” HSGs, 644 “Mouse 2” HSGs, 354 “Mouse 3” HSGs, 207 “Mouse 4” HSGs and 211 “Dog” HSGs were identified using Fisher’s exact test, and assigned to clusters of orthologous groups (COG) [21] annotation (Figure 3 (a), Additional File 1: Table S3). These HSGs mainly belonged to four functional categories: 1) Replication, recombination and repair, 2) transcription, 3) transport and metabolism of various macromolecules and ions and 4) cell wall/membrane/envelope biogenesis.
We hypothesized that many HSGs might not be represented by our reference pan-genome database. Therefore, we predicted 132,255 CDSs from 85 MAGs and searched them against the pan-genome database. On average, the MAGs contained 60.7% of genes were also found in the gene profiles of the corresponding samples; 10.4 % of them were found in the pan-genome database but not in their corresponding gene profiles and 28.9% of them could not be matched to the pan-genome database. These 17,669 unmatched sequences were clustered into 12,648 gene clusters, indicating that 38.72% of gene sequences could not be annotated through searching against the pan-genome database.
By performing Fisher’s exact test, we identified a set of 1,914 HSGs, and 419 of them were newly found in MAGs. These novel HSGs were annotated based on the eggNOG [22] database (Additional File 1: Table S4), and the proportion of COG functional categories of all HSGs was computed, as represented in Figure 3 (b). We compared the functional structures of HSGs based on the pan-genome to those based on both pan-genomes and MAGs, finding that the ten most abundant categories were conserved despite some differences in the ratio (Figure 3). However, if the host-specific reference genome was absent, a relatively high percentage of HSGs were newly identified from the MAGs. About 37% of dog HSGs were exclusively found in the MAGs, while 2% of pig HSGs and 20% of mouse HSGs were only found in the MAGs (Additional File 1: Table S5).
From the detailed functional description in Additional File 1: Table S3 and Table S4, we observed that transposases, integrases and ABC transporters were found to be host-specific in all host groups. However, some gene functions were not identified from all host groups; for example, host-specific urea amidohydrolases were observed only in the “Mouse 1” and “Mouse 2” groups. These host-specific functions, especially those related to mobile elements and biofilm formation, would reflect differences in the host gut environment and the adaptation mechanism of L. reuteri to their host, which consistently supports the observation of previous studies with isolated strains [6, 7].