Sequencing, quality control and systematic effects
RE-RRS was used to sequence 3,971 New Zealand and 508 Australian sheep rumen samples. These samples were split into eight Groups according to recorded parameters associated with each sample, i.e., age and diet of the sheep at the time of rumen sample collection, length of time the sheep was off feed before the rumen sample was taken, and country the animal lived in (Table 1 and Additional File 1). Most individuals had samples collected as part of multiple Groups, with 3.3 ± 1.4 samples per individual for New Zealand sheep and 1 ± 0 samples per individual for Australian sheep. The largest Groups were GAS and GLS, representing adults and lambs, respectively, that were on a grass (ryegrass-based pasture) diet and whose samples were collected after a short time off feed (2-4 hours). Each Group was further separated into Cohorts, with individuals from the same flock and age who had their rumen samples collected over the same 1-3 day period allocated to the same Cohort. More information regarding the samples can be found in Additional File 1.
Table 1: Number of rumen samples and sheep by Group
Group
|
Dieta
|
Ageb
|
Time off Feedc
|
Feeding
|
Number of Samples
|
Number of Sheep
|
GLS
|
Grass
|
Lamb
|
Short
|
ad libitum
|
1074
|
1074
|
GAS
|
Grass
|
Adult
|
Short
|
ad libitum
|
1080
|
1080
|
GLL
|
Grass
|
Lamb
|
Long
|
ad libitum
|
186
|
93
|
GAL
|
Grass
|
Adult
|
Long
|
ad libitum
|
94
|
94
|
LLS
|
Pellets (L)
|
Lamb
|
Short
|
ad libitum
|
985
|
985
|
LLL
|
Pellets (L)
|
Lamb
|
Long
|
2 × MEMd
|
376
|
188
|
MAS
|
Pellets (M)
|
Adult
|
Short
|
ad libitum
|
176
|
176
|
AUS
|
Australian
|
Adult
|
Short
|
1.5-1.6 × MEMd
|
508
|
508
|
Total
|
4479
|
1708e
|
a Grass: ryegrass-based pasture. Pellets (L): lucerne pellets. Pellets (M): maintenance pellets. Australian: Chaffed lucerne and cereal hay.
b Animal considered an Adult at ~15 months of age.
c Duration between access to feed and rumen sampling: Short = 2-4 hours; collected in the South Island of New Zealand or in Australia. Long = 15-16 hours; collected in the North Island of New Zealand.
d MEM: Metabolizable energy requirements for maintenance.
e Some individuals had rumen samples collected as part of multiple Groups, although all Australian sheep only had one rumen sample.
Table 2 summarizes metrics from sequencing, quality control and metagenome profiling based on the RB and RF pipelines described by Hess et al. (12), and Additional File 2 contains these metrics split by Group. The RB pipeline involved aligning reads to a reference database, consisting of the Hungate1000 Collection (13) plus four Quinella genomes (14), and assigning reads at the genus level. The RF pipeline involved generating a set of tags: unique 65 bp reads, starting from the cut site, that are present in at least 25% of samples (12). Tags were generated using the full set of samples (Tag Set = “All”) for Table 2 and generated within each Group in Additional File 2. Read lengths ranged from 40 to 92 bp for all samples after quality control and removal of the adapter and barcode sequences. The proportion of reads assigned (assignment rate) for the RB approach was significantly lower than the assignment rate for the RF approach (p-value < 2.2 × 10-16, paired t-test); however, 195 of the 4479 samples (4%) had a RB assignment rate that was marginally higher than the RF assignment rate when considering tags generated on the full set of samples. Most of these samples (171/195) were from the AUS Group, which are expected to be different to the other Groups due to country, diet, and breed of the sheep. The Australian samples made up ~11% of the dataset, so any microbes that are unique to the Australian dataset would not be captured in the set of tags, given that tags need to be present in 25% of samples. When considering only tags generated within the AUS Group, there were only 7 that had a greater proportion of reads assigned using the RB than RF approaches.
Table 2: Sequencing statistics across all samples
Metric
|
Parameter
|
Mean ± SD
|
Post-QC sequencing a
|
Reads per sample (thousands)
|
713 ± 215
|
Average read length (bp)
|
riation that library explai
|
riation that library explai
Median read length
|
91.6 ± 0.5
|
Percent of reads ≥ 65bp
|
79.3 ± 3.3%
|
Reference-based
|
Assignment rate b
|
20.8 ± 3.5%
|
Reference-free
|
Number of tags c (thousands)
|
186
|
Assignment rate c
|
29.2 ± 3.8%
|
a After demultiplexing and trimming; QC: Phred quality score of at least 20 and read length of at least 40 bp
b Percent of reads that were assigned to the reference database, consisting of the Hungate1000 Collection plus four Quinella genomes.
c Percent of reads that were assigned to a tag, i.e., unique 65 bp reads, starting from the cut site, that are present in at least 25% of all samples
All sequencing metrics were similar across the different Groups (Additional File 2). Assignment rates for the RB pipeline ranged from 18.4% (LLS) to 23.7% (AUS). The number of tags ranged from 135k (MAS) to 264k (GAS), and the RF assignment rates ranged from 33.6% (LLS, 206k tags) to 39.6% (LLL, 228k tags). In total there were 571,479 tags identified in at least one of the Groups.
Systematic effects
Biological and systematic effects were evaluated for their contribution to variation in the metagenome profiles using the New Zealand samples (Table 3 and Additional File 3). Systematic effects relate to factors associated with the sample processing and sequencing approach. The effects of Library (i.e., the set of ~370 samples that were sequenced simultaneously) explained a significant percent of sample-to-sample variation in the number of reads, and the proportion of reads assigned using the RB and RF approaches. The Library effect was much larger for the proportion of reads assigned using the RF approach than for the other two profiling metrics (Table 3). The library effect explained 2.4 ± 5.2% of the variation in log10 relative abundances of each of the microbes from the RB profiles (Additional File 3). The genus Quinella was a major outlier in terms of the proportion of the variation that library explained (25.4%) and, with this genus removed, the library effect explained 2.0 ± 4.3% of the variation in log10 relative abundances.
Table 3: Percent Variance Explained by Systematic and Host-Specific Biological Factors in New Zealand Sheep Rumen Samples
Profiling Metric
|
Group a
|
Cohort b
|
Library c
|
BatchShelf d
|
Well e
|
BRR f
|
AOD g
|
BDEV h
|
Number of Reads
|
3.74***
|
3.79***
|
4.08***
|
13.94***
|
16.44***
|
0.07
|
0.01
|
0.02
|
RB_Prop
|
15.18***
|
22.67***
|
6.66***
|
7.25**
|
10.36
|
0.14
|
0.05
|
0.30
|
RF_Prop
|
8.77***
|
28.27***
|
34.00***
|
7.66**
|
10.23
|
0.42
|
0.01
|
0.09
|
a Group = the combination of diet, age, time off feed, and country
b Cohort = the set of samples that were collected during the same period
c Library = the set of samples that were sequenced simultaneously
d BatchShelf = the sample freeze-drying and grinding batch, and freeze-drying shelf
e Well = the position on the DNA extraction plate
f BRR = birth rear rank, i.e., the combination of the number of lambs born to the dam at the time the lamb was born, and the number of lambs that were reared
g AOD = age of dam at time the individual was born
h BDEV = birth date deviation within contemporary group
P-values from an F-test: * < 0.05; ** < 0.001; *** < 0.00001
Batch and shelf were related to the process of freeze drying the samples, and the combined BatchShelf effect was found to explain a significant percentage of the variation of all sequencing metrics, with the largest variation being explained in the total number of reads followed by proportion of reads assigned using the RF approach (Table 3). BatchShelf effects explained 6.1 ± 1.6% of the variation in log10 relative abundances of the 61 genera in the reference database, with a significant effect for 25 genera (Additional File 3). The largest BatchShelf effect was observed for Methanobrevibacter.
Well effects refer to the position of each sample on the plate during DNA extraction and library preparation and explained the most variation for the number of reads. Well explained approximately 10.3% of the variation in the proportion of reads for the RB and RF pipelines, although this was not significant (Table 3). Part of the reason Well effects explained a large proportion of the variation of these sequencing metrics is because there were 96 levels of this effect. Out of all the genera in the reference database, Well effects were only significant for Quinella, Staphylococcus and Succinivibrio, explaining 19.4%, 12.5% and 11.9% of the variation in the number of reads assigned, respectively (Additional File 3).
Biological effects
Group (the combination of age, feed, time off feed and country) and Cohort (the set of animals from the same flock that were sampled during the same 1-3 day period) effects explained a significant proportion of the variation in the number of reads, proportion of reads assigned using the RB and RF approaches (Table 3), as well as the log10 relative abundances of all taxa in the reference database (Additional File 3). Of the three profiling metrics, Group effects explained the most variation in the proportion of reads assigned using the RB approach. Cohort effects explained over 20% of the variation for each of the profiling metrics. Group effects explained the most variation in Selenomonas (60.6%), Anaerovibrio (43.8%) and Sarcina (41.4%), while Cohort effects explained the most variation in Ruminobacter (49.1%), Cellulomonas (37.9%) and Succinivibrio (33.2%).
Biological effects of birth rear rank (BRR), age of dam (AOD) and birth date deviation (BDEV) did not have significant effects on the number of reads or the proportion of assigned reads using either the RB or RF approaches (Table 3). BRR, AOD or BDEV explained a small amount of the variation in log10 relative abundance for some of the taxa, but the largest effect was that of BRR on the log10 relative abundance of Sarcina, and this explained only 1.15% of the variation (Additional File 3).
Reference-free tags
Comparison by Group
The main set of tags that was used for the analyses in this paper is the set of “All” tags – those generated using all 4,479 samples. This tag set enables comparison between all samples for the given analyses. We were also interested in how similar the tags from different Groups would be. Therefore, we generated tag sets within each of the groups (i.e., tags present in 25% of samples within each Group), as well as a tag set for the New Zealand samples. Figure 1 compares the similarity of the tag sets generated from the samples in each the different Groups (i.e., Group-specific tag sets). A hierarchical clustering analysis demonstrated that the Australian samples cluster separately from the New Zealand (NZ) samples, which is to be expected due to the different environments and diets between the two countries (Figure 1A). The NZ samples were further separated by diet, followed by the time off feed. The lowest clades, representing the most similar tag sets, separated Groups by the age of the sheep sampled (i.e., lamb or adult), within diet and time off feed. A comparison of the tag sets between NZ and Australia (Figure 1B) showed that most tags were different between NZ and Australian samples, however there were still over 70,000 tags in common between the two countries. Over 117,000 tags were shared between tag sets from sheep that were grazing pasture (Figure 1C). Consistent with the hierarchical analysis in Figure 1A, more tags were found in common between GAS and GLS Groups than any other pair (Figure 1C). A comparison of tag sets between Groups that were fed a pellet diet (Figure 1D) showed that tags from sheep fed the same type of pellets (lucerne pellets) at the same age (lambs) were more similar than the Group of adults fed a different type of pellet (maintenance). Overall, these results show that diet, age, and time off feed when sampled are major drivers of the rumen metagenome.
Reference-free tags
Comparison by Group
The main set of tags that was used for the analyses in this paper is the set of “All” tags – those generated using all 4,479 samples. This tag set enables comparison between all samples for the given analyses. We were also interested in how similar the tags from different Groups would be. Therefore, we generated tag sets within each of the groups (i.e., tags present in 25% of samples within each Group), as well as a tag set for the New Zealand samples. Figure 1 compares the similarity of the tag sets generated from the samples in each the different Groups (i.e., Group-specific tag sets). A hierarchical clustering analysis demonstrated that the Australian samples cluster separately from the New Zealand (NZ) samples, which is to be expected due to the different environments and diets between the two countries (Figure 1A). The NZ samples were further separated by diet, followed by the time off feed. The lowest clades, representing the most similar tag sets, separated Groups by the age of the sheep sampled (i.e., lamb or adult), within diet and time off feed. A comparison of the tag sets between NZ and Australia (Figure 1B) showed that most tags were different between NZ and Australian samples, however there were still over 70,000 tags in common between the two countries. Over 117,000 tags were shared between tag sets from sheep that were grazing pasture (Figure 1C). Consistent with the hierarchical analysis in Figure 1A, more tags were found in common between GAS and GLS Groups than any other pair (Figure 1C). A comparison of tag sets between Groups that were fed a pellet diet (Figure 1D) showed that tags from sheep fed the same type of pellets (lucerne pellets) at the same age (lambs) were more similar than the Group of adults fed a different type of pellet (maintenance). Overall, these results show that diet, age, and time off feed when sampled are major drivers of the rumen metagenome.
Taxonomies of reference-free tags
Tags were assigned to taxa based on comparison against the Ovis aries 3.1 genome assembly (OAR 3.1), the RB reference database (consisting of genomes from the Hungate1000 Collection (13) plus four Quinella genomes (14)) and GenBank (15) to discover what additional organisms were being captured by the RF approach beyond the RB approach. Table 4 shows the proportion of tags that were able to be assigned at the various taxonomic levels for the different sets of tags explored in this study. Most tag sets had approximately 40% of tags assigned at any taxonomic level, and the set of tags that were present in All Groups (i.e., tags present in all eight Groups) had the highest assignment rate, with taxonomy assigned for 47.7% of tags. Between 30% and 35% of tags were typically assigned at the genus level, with the greatest rate for those in All Groups at 41.8%.
Table 4: Percent of tags that are assigned at different taxonomic levels for the different tag sets
Tag Seta
|
Number
of Tags
|
Kingdom
|
Phylum
|
Class
|
Order
|
Family
|
Genus
|
Species
|
GLS
|
246833
|
40.9%
|
37.7%
|
37.0%
|
36.7%
|
35.1%
|
34.1%
|
22.9%
|
GAS
|
264319
|
41.4%
|
38.2%
|
37.5%
|
37.3%
|
35.8%
|
34.8%
|
22.5%
|
GLL
|
231903
|
42.5%
|
39.3%
|
38.6%
|
38.4%
|
37.0%
|
36.1%
|
22.9%
|
GAL
|
215178
|
43.3%
|
40.4%
|
39.7%
|
39.5%
|
38.0%
|
37.2%
|
22.7%
|
LLS
|
206473
|
38.6%
|
35.7%
|
34.9%
|
34.7%
|
33.1%
|
32.3%
|
20.7%
|
LLL
|
227719
|
37.7%
|
35.1%
|
34.4%
|
34.1%
|
32.6%
|
31.8%
|
20.9%
|
MAS
|
135153
|
40.2%
|
37.0%
|
36.3%
|
36.0%
|
34.6%
|
33.8%
|
21.7%
|
AUS
|
150687
|
41.5%
|
38.1%
|
37.4%
|
37.2%
|
35.9%
|
35.1%
|
20.7%
|
All
|
186302
|
42.0%
|
38.7%
|
38.0%
|
37.8%
|
36.3%
|
35.5%
|
22.4%
|
NZ
|
203396
|
41.4%
|
38.3%
|
37.5%
|
37.3%
|
35.8%
|
35.0%
|
22.4%
|
All Groups
|
21119
|
47.7%
|
44.7%
|
44.0%
|
43.9%
|
42.4%
|
41.8%
|
21.7%
|
All Tags
|
571479
|
38.8%
|
36.0%
|
35.2%
|
34.9%
|
33.5%
|
32.6%
|
21.7%
|
a Tags present in 25% of samples in the GLS, GAS, GLL, GAL, LLS, LLL, MAS and AUS Groups; all samples (All) or New Zealand samples (NZ). All Groups is the set of tags that are present in the tag sets of all the eight Groups. All Tags is the set of tags that are present in the tag sets for at least one of the eight Groups.
Figure 2 shows the taxonomic assignment of tags based on comparison against the GenBank and Hungate1000 Collection genomes, as well as alignment against the sheep genome (OAR 3.1). Less half the tags were able to be assigned at any taxonomic level and most of those were assigned to the domain Bacteria, followed by Eukaryota and then Archaea. Within Bacteria, most tags were assigned to the phyla Bacteroidetes and Firmicutes. The majority of Eukaryota tags were from the phylum Chordata, of which 34.5% were assigned to a mammalian genome. Archaeal tags were dominated by methanogens, with 75% of tags assigned to Methanobacteria, followed by Methanomicrobia at just under 10%.
Relative abundances of bacteria and archaea
Microbial profiles were generated for the RF metagenome profiles by considering only bacterial and archaeal taxa from the tags that were assigned at the genus level in the previous section. Figure 3 shows the relative abundances of the RB and RF microbial profiles at the family level, with those with an abundance of <1% in all Groups combined into the “Other” category for visualization purposes. Overall, the microbial profiles were similar, however, there were some differences. There were four families that were present in the RB plot (Figure 3A) that were absent in the RF plot (Figure 3B): Peptostreptococcaceae, Spirochaetaceae, Streptococcaceae and Veillonellaceae. Peptostreptococcaceae, Spirochaetaceae and Veillonellaceae had low abundance and only just met the 1% threshold for the RB plot, but were below the 1% threshold for the RF approach and are captured in the “Other” category of Figure 3B. Streptococcaceae was most abundant in the AUS Group of samples for the RB pipeline and therefore it is possible that it was omitted in the RF pipeline due to Streptococcaceae tags not reaching the 25% prevalence threshold when considering the full set of samples. Rikenellaceae and Streptomycetaceae were present in the RF plot (Figure 3B) but not the RB plot (Figure 3A). This was because these families are not captured within the reference database used for the RB analysis. The “Other” category made up a much larger portion of the microbial profile for each Group in the RF plot (Figure 3B) than the RB plot (Figure 3A), as there were a large number of additional families that are present at low levels (<1% abundance) in the RF microbial profiles.
Prevotellaceae was the most abundant family in all Groups, representing 52% to 66% of RB microbial profiles and 43% to 64% of the RF bacterial or archaeal microbial profiles (Figure 3). Lachnospiraceae was the second most abundant family, representing 10% to 15% of RB microbial profiles and 6% to 9% of RF bacterial or archaeal profiles. Consistent with the comparison of the tag sets between Groups (Figure 1), profiles for lambs and adults on the same diet were very similar to each other (Figure 3), with the largest differences being driven by time off feed and diet. This indicates that the differences in profiles between Groups encompass both the prevalence and abundance of taxa. Pellet diets (LLS, LLL and MAS) tended to have the highest proportions of archaea, with the profiles from grass fed sheep taken a long time off feed (GAL and GLL) having the lowest abundance of archaea.
Relationships between samples from different Groups
Reference-based relationships
Network diagrams were generated using log10 normalized (Figures 4A) and Cohort-adjusted (Figures 4B) RB microbial profiles to visualize the impact of diet, age, and time off feed on these two normalization methods. A set of individuals were from sheep selection lines, selected for high or low methane emissions (16-18), which allowed us to evaluate whether samples clustered based on selection line (Figures 4C and 4D). A clear effect of Group was present for log10 normalized RB microbial profiles, and this effect was largely driven by the different diets the individuals were fed when the samples were collected (Figure 4A), with no clear signal present based on methane selection line (Figure 4C). This was unexpected, given that sheep divergent for methane emissions have different rumen microbial profiles (2). However, additional normalization of each taxon within Cohort successfully removed the effects of diet, age, and time off feed (Figure 4B), resulting in clearer separation of samples based on methane selection line (Figure 4D).
Reference-free relationships
Network diagrams were also generated from RF metagenome profiles. Figure 5 shows network diagrams generated from tags generated on all samples (Tag Set “All” from Table 4), which accounted for 29.2 ± 3.8% of reads, ranging from 25.8% (AUS) to 31.6% (GLS; Additional File 2). The log10 normalized network diagrams (Fig. 5A and C) showed a similar pattern to those generated from the RB microbial profiles, but the RF diagrams showed that samples from the same diet were more tightly clustered and samples from different diets were more distinct from each other. Cohort adjustment of the RF profiles (Fig. 5B and 5D) resulted in reduced signal due to the Group effect, although some clustering in the samples due to diet remained even after the Cohort adjustment was performed. Despite the Group effect still being present, samples tended to cluster by methane selection line after Cohort adjustment (Fig. 5D).
Network diagrams were produced from metagenome profiles using the genus-level assignments of the set of “All” RF tags (Figure 6) to investigate if this would reduce the clustering by Group in the Cohort-adjusted microbial profiles and allow the methane selection line signal to come through. These profiles represented 12.1 ± 2.3% of reads, ranging from 10.5% (LLS) to 13.5% (GLL; Additional File 2). The log10 normalized profiles clustered based on Group (Figure 6A), consistent with RB and RF methods using all tags (Figures 4A and 5A). However, the separation between samples on different diets was not as extreme as for the network diagrams generated using “All” tags. Clustering tags based on taxonomic assignment at the genus level successfully removed any signal due to diet, age or time of feed that was present in the set of “All” tags (Figure 5B), and methane selection line samples clustered primarily by line (Figure 6D), as was found in the RB network diagrams (Figure 4).
Network diagrams were produced from RF metagenome profiles generated using the “All Groups” tag set (i.e., tags were identified in all of the eight Groups) (Figure 7) to evaluate whether this was another method that could reduce the clustering by diet that was observed in the Cohort-adjusted RF profiles (Figure 5B). These metagenome profiles represented 8.1 ± 1.8% of reads, ranging from 7.1% (GAS) to 10.6% (MAS) (Additional File 2). These metagenome profiles (Figure 7) resulted in very similar patterns to those for the RB profiles (Figure 4) and the RF profiles from tags assigned taxonomy (Figure 6) with log10 normalized RF profiles clustering by Group and Cohort, and adjusted RF profiles primarily clustering by methane selection line.
Heritability and repeatability estimates
Reference-based estimates
Heritabilities and repeatabilities for each of the genera in the reference database were estimated based on the set of samples from sheep fed grass (GAS, GLS, GLL, GAL), the set of samples from sheep fed lucerne pellets (LLL and LLS) and all New Zealand samples (Figure 8 and Additional File 4: Supplemental Table 7). The largest heritability estimates for each set of samples were: Succiniclasticum (Grass, 0.20 ± 0.04), Oscillibacter (Lucerne Pellet, 0.21 ± 0.05) and Desulfovibrio (NZ, 0.14 ± 0.02), and the largest repeatability estimates for each set of samples were: Succiniclasticum (Grass, 0.27 ± 0.03; NZ, 0.17 ± 0.02) and Quinella (Lucerne Pellet, 0.34 ± 0.05). There were 10 genera that were within the top 6 most heritable genera for at least one set of samples: Bifidobacterium, Cellulomonas, Corynebacterium, Desulfovibrio, Kandleria, Olsenella, Oscillibacter, Quinella, Succiniclasticum and Succinivibrio.
The correlation of heritability estimates between each of set of samples (Grass and Lucerne Pellet, Grass and NZ, Lucerne Pellet and NZ) was higher than the correlation between repeatability estimates for the same set of samples (Figure 8 and Additional File 4: Supplemental Table 7). The correlation in estimates between Grass and Lucerne Pellet sets was 0.44 for heritability and 0.42 for repeatability, while the correlation in estimates for the Grass and NZ sets was much higher at 0.85 for heritability and 0.79 for repeatability. The correlation in estimates between Lucerne Pellet and NZ sets was slightly lower than those between the Grass and NZ sets, at 0.71 for heritability and 0.65 repeatability.
Reference-free estimates
Heritabilities and repeatabilities were also computed for the RF approach using the “All” tag set with taxonomy assignment at the genus level. This was done for the same sets of samples as the RB heritability and repeatability analyses. Heritability and repeatability estimates tended to be highest for the Lucerne Pellet set and lowest for the NZ set (Figure 9, Additional File 4: Supplemental Table 8). There were 13 genera that were within the top 6 most heritable genera for at least one set of samples: Apis (Eukaryota), Buceros (Eukaryota), Dictyostelium (Eukaryota), Epidinium (Eukaryota), Ichthyophthirius (Eukaryota), Maricaulis (Bacteria), Panicum (Eukaryota), Pelodictyon (Bacteria), Perkinsus (Eukaryota), Quinella (Bacteria), Rhizophagus (Eukaryota), Strongyloides (Eukaryota) and Succiniclasticum (Bacteria). Quinella and Succiniclasticum were the only two of these genera that were present in the reference database used for the RB approach.
As with the RB profiles, the correlation of heritability estimates between the sets of samples was higher than the repeatability estimates from the same set of samples. The correlation of estimates between the Grass or Lucerne Pellet sets was 0.34 for heritability and 0.33 for repeatability. The correlation of estimates between the Grass and NZ sets was the same as the RB profiles for heritability at 0.85 and slightly lower than the RB profiles for repeatability at 0.72. The correlation in heritability and repeatability estimates between Lucerne Pellet and NZ sets was again slightly lower than the estimates between Grass and NZ sets, at 0.63 and 0.48, respectively.