Functional analysis was completed for all 71 ruminal Butyrivibrio and Pseudobutyrivibrio genomes used in this study using EggNOG-mapper (Table 1; Fig. 2). These data show very little variability at high-level functional categories between the genomes, and highlight that all the strains have a large proportion of ‘unknown’ genes (i.e. those not annotated by EggNOG-mapper). Overall, strains have predominant functions relating to carbohydrate transport and metabolism (average 8.91%), cell wall, membrane and envelope genesis (average 8.05%), and amino acid transport and metabolism (average 7.24%). The percentage of these functions stay fairly constant across all strains (standard deviation 0.80%, 0.78% and 0.50% respectively), although some exceptions to this can be seen. Specifically, strain YAB3001 has 188 genes of 3095 (6.07% of total) annotated as being part of an amino acid transport and metabolism pathway, compared with strain NOR37 which has 183 of 2184 (8.40% of total) attributed to the amino acid transport and metabolism pathway. The average percentage for this category across all strains is 7.24%. The general consistency in high-level gene function can also be seen when the functional categories are compared at a genus level (Supplementary Fig. 1).
Gene and Genome-level similarity
Despite the broad similarity seen in high-level functional categories (Fig. 2 and Supplementary Fig. 1), there may be substantial differences at the gene level. In order to evaluate gene level similarities for the 71 strains, phylogenetic trees based on 16S rDNA and 40 conserved marker genes were first constructed to examine phylogenetic relatedness (Supplementary Fig. 2, Fig. 3 respectively). The 40 marker phylogeny revealed six species/clades, containing: 1. P. ruminis, 2. P. xylanivorans, 3. B. fibrisolvens, 4. Butyrivibrio sp., and 5. B. hungatei, and 6. B. proteoclasticus, although it should be noted that P. ruminis and P. xylanivorans are phylogenetically very close, as are B. hungatei, and B. proteoclasticus (Fig. 3). Similar patterns were seen in the 16S rDNA based phylogenetic tree, although the clustering was less distinct in some instances (Supplementary Fig. 2). Similar groupings can be seen again in the 3D scatter plot (Fig. 4), with P. ruminis and P. xylanivorans clustering very closely as well as B. hungatei and B. proteoclasticus.
In order to compare the strains on a genome level Average Nucleotide Identity (ANI) was calculated using PyANI . These data show that the vast majority of strains have less than 75% nucleotide identity (Supplementary Fig. 3). Several small clusters can be seen with nucleotide identity greater than 75%. Groups formed by PyANI are somewhat similar to those seen in the 40 marker tree (Fig. 3), although anomalies can be seen. When dendrograms based on genome coverage are plotted, the data show that many of the strains have <50% coverage, meaning that <50% of the genome can be compared due to a high level of dissimilarity (Supplementary Fig. 4). Due to this reason the strength of the comparative data is weak when coverage is <50%, which in itself shows the level of dissimilarity across strains.
These data as a whole are conflicting in terms of enhancing the taxonomic assignment of the ruminal Pseudobutyrivibrio and Butyrivibrio. Nonetheless, the 40 marker tree allowed resolution of all 5 previously classified Butyrivibrio and Pseudobutyrivibro spp. to varying extents, whilst identifying another possible Butyrivibrio clade/species. The similarity between the 40 marker data and the 16S rDNA and 3D GC content/Genome Size/Number genes made it a logical choice on which to base further analysis.
The 40 marker phylogenetic tree suggests that Butyrivibrio spp. NC3005, MC2013 and TB are actually B. fibrisolvens; Butyrivibrio spp. INlla14, XBB1001, VCB2006, AE3009, MC2021, FCS006, NC2007, VCB2001, XPD2006, AE2032, FCS014, AE2015 and Su6 are actually B. proteoclasticus; Butyrivibrio spp. INlla18, INlla21, NK4153, AE2005, AE3003, M55, LB2008 and YAB3001 are actually B. hungatei; Pseudobutyrivibrio spp. MD2005, OR37, LB2011, JW11 and C4 are actually P. ruminis; Pseudobutyrivibrio spp. AR14, NOR37, YE44, ACV2 and 49 are actually P. xylanivorans; Butyrivibrio spp. NC2002, WCD3002, MB2005, AE3006, VCD2006, WCD2001, OB235, AD3002, AC2005, XPD2022, FC2001, AE3004, LC3010 and WCE2006 actually constitute a potentially new species. As such, all subsequent analyses were completed based on these 6 species/clades.
In order to further scrutinise this taxonomic ambiguity, the pangenomes were investigated using Spine software  based on the 6 species/clades identified. A range of Spine cut-off parameters (% accessory definition, the number of genomes a gene that has to be found within to be considered core and % nucleotide identity) were assessed to ensure optimal non-biased parameters were chosen for downstream analysis (Supplementary Fig. 5 and 6). These data show that, as expected, the higher the core definition percentage is, the greater the proportion of accessory genes. There appears to be a sharp increase in the percentage of the genome that is defined as accessory when the core definition setting is increased from 60 to 70% in P. ruminis, and 70 to 90% in B. fibrisolvens. All other groups show a more consistent estimate of the accessory genome regardless of the core definition percentage used (Supplementary Fig. 5). Contrastingly, as the gene nucleotide identity percentage cutoff increases, there is little change in the proportion of accessory genes observed (Supplementary Fig. 6). It is challenging to define the ‘correct’ cut-off parameters for pangenome analysis, which ultimately correspond to the biological situation. For downstream analysis the parameter of 100% core definition was used, i.e the gene has to be present in all genomes (as per 30), as well as the default 85% gene nucleotide identity cut-off. The total number of core genes per clade decreases, as does the core/pangenome ration, with increasing stringency (Table 2).
At a 100% core definition, B. fibrisolvens has an average core GC content of 44.94%, and an accessory GC content of 40.57%. B. hungatei has 44.73% and 41.04% for average core and accessory GC contents respectively, B. proteoclasticus 45.23% and 42.70%, Butyrivibrio sp. 45.14% and 41.58%, P. ruminis 43.39% and 39.06%, and P. xylanivorans 43.69% and 39.12%. Pseudobutyrivibrio strains had a greater difference than Butyrivibrio strains, with an average difference of 4.32% compared to 3.38%. The greatest difference was seen in strains of P. xylanivorans, with an average 4.57% difference in GC content between core and accessory genome, and the least difference is in B. proteoclasticus with 2.52%. Core genes across each taxon appear to have a higher GC% than their respective accessory genomes (with an average of 44.57% in the core genome, and 40.95% in the accessory at a core definition of 90%) (Supplementary Table 1).
Functional annotation of species/clades, when split into core and accessory genomes (Fig. 5A and B) shows greater functional diversity within the accessory genome. Invariably, a large proportion of the core genome appears to be dedicated to translation, ribosomal structure, and biogenesis. B. fibrisolvens appears to dedicate the largest proportion of its core genome to this category. This can be seen to a much lesser extent in the accessory genomes, with no particular functional role dominating to the same extent. Functional categories appear to be relatively evenly distributed throughout all six accessory genomes. The most evident outlier in the core genome is the B. fibrisolvens core, which has 223 genes in the “Translation, ribosomal structure and biogenesis” category, which is 70.35% of all its core genes. The average for this category in the core genome is 53.18%. The largest category is that of proteins with unknown function, which is much more prevalent in the accessory genome of each taxon. The average composition of unknown genes in the core genomes is 1.70%, whilst for the accessory genome this is 24.65%. B. fibrisolvens seems to have the least diverse core genome, having genes from only 8 of the 21 functional categories, followed by B. proteoclasticus with 9. B. fibrisolvens and B. proteoclasticus also have the fewest core genes annotated, with 317 and 502 respectively. B. hungatei has 989, Butyrivibrio sp. 1445, P. xylanivorans 1541, and P. ruminis 1635.
ClustAGE plots show high levels of genomic dissimilarity within species level groups in terms of Accessory Genomic Elements (AGEs), with each concentric ring representing a genome in the clade (shown by the key), and the AGE positioning in the circle representing their size (Supplementary Fig. 7-12). The plot for B. fibrisolvens shows gene fragments being absent in many genomes in places. This is particularly clear on occasion, for example, at the 750 kbp mark, with only four genomes out of 11 (AB2020, MC2013, NC3005, and D1) possessing an AGE here (Supplementary Fig. 7). In all species level groups, the most dissimilarity seems to be in smaller gene fragments (Supplementary Fig. 7-12). It should be noted that the minimum AGE size represented in each plot is 1500 bp.
Evolution of the Butyrivibrio and Pseudobutyrivibrio genera
In order to evaluate gene ancestry and evolution OrthAgogue  was used to identify orthologous gene affiliations. Orthologous gene distribution on a clade/species level shows that the vast majority of orthologous gene clusters are shared by all species level groups (870) (Fig. 6). As a genus, Pseudobutyrivibrio has more common orthologous genes than Butyrivibrio, with 343 and 223 genes respectively. B. fibrisolvens has the most unique orthologous genes with 143, followed by P. xylanivorans with 132, B. hungatei with 121, Butyrivibrio sp. with 90, P. ruminis with 27 and B. proteoclasticus with 17 (Fig. 6). Comparison on a genus level clearly shows that both genera share the majority of their orthologous genes, with 870 clusters of orthologous genes (COGs) being common to the two. Pseudobutyrivibrio has more unique COGs, with 595 whilst Butyrivibrio has 251 unique COGs (Supplementary Fig. 13). Out of all taxa, Butyrivibrio sp. shares the most amount of COGs across all its members with 1771, followed by P. xylanivorans with 1674, B. hungatei with 1602, P. ruminis with 1585, B. proteoclasticus with 1562 and B. fibrisolvens with 1502 (Supplementary Fig 14-19). At taxon level, B. proteoclasticus has the most inparalogous clusters with 920. This is followed by Butyrivibrio sp. (778), B. fibrisolvens (460), B. hungatei (438), P. ruminis (305), and P. xylanivorans (259) (Supplementary Table 2). No inparalogous genes were shared across all species. B. proteoclasticus had the genome with the most inparalogs, with strain FCS014 having 178. B. fibrisolvens had the fewest total inparalogous genes, with strain D1 having 53, the most of any member of that clade/species (Supplementary Table 2). Further analysis showed that the majority of accessory genes were orthologs; 892 orthologs, 45 inparalogs and 80 co-orthologs, being 87.71%, 4.42%, and 7.87% respectively of a total of 1017 accessory homologous genes. Conversely the majority of core genes were annotated as inparalogous in terms of their evolution; 10 orthologs and 240 inparalogs, with 0 co-orthologs. This equates to 4.0% and 96.0% respectively of a total of 250 core homologous genes. Combined, this gives an overall total of 1269 genes. Of these, 10 are core orthologs (0.79%), 242 are core inparalogs (19.07%), 45 are accessory inparalogs (3.55%), 80 are accessory co-orthologs (6.30%), and 892 are accessory orthologs (70.29%).
Glycosyl hydrolase haplotypes and evolution
Functional annotation of the GH families possessed by each strain showed a lot of similarity based on GH families and their abundances (Fig. 7; Supplementary Fig. 20-25). However, the GH family-level phylogenetic trees show limited gene clustering and GHs from the same family and within the same strain have a tendency to cluster together, as can be seen on pruned versions of the trees (Supplementary Fig. 26 and 27). These phylogenetic trees therefore show that the genera Butrivibrio and Pseudobutyrivibrio possess a high degree of within GH family enzyme isoforms. Irrespective, GH3 was the most abundant with 690 genes present, followed by GH13 with 681, GH43 with 543, GH2 with 463, and GH5 with 216. Homolog types across the GH2 genes show very few inparalogs with most genes being co-orthologous or orthologous across all species/clades (Fig. 8). Evolutionary distribution across GH3 is fairly similar (Fig. 9), although there is a greater proportion of inparalogous genes, which also appear to be more evenly distributed across the species/clades. Only 6 inparalogs have been annotated in the family GH5 (Fig. 10), and 34 co-orthologs, which is comparatively few compared with the higher number of orthologs. The co-orthologs form two broad clusters, and are only found in the Butyrivibrio sp. and B. proteoclasticus clades. There are no co-orthologous genes found in B. fibrisolvens, and very few in B. proteoclasticus. GH13, interestingly, has more inparalogs than co-orthologs, with 70 and 48 respectively (Fig. 11). There is overlap here, with many of these genes being annotated as orthologs, inparalogs and co-orthologs simultaneously which is a function of the algorithms used by Orthagogue. GH43 has a high proportion of co-orthologs, and fewer inparalogs (Fig. 12). The co-orthologs appear to be distributed across the entire tree, but with a higher density towards the far end of the tree. For each of the five GH families, the majority of genes were found in one of the 20 metatranscriptome datasets within the Shi et al. (2014) dataset (Fig 8-12; Suppl. Excel 1). For GH2, 67.17% of genes were found to be expressed, for GH3 62.87%, GH5 56.28%, GH13 66.57%, and GH43 59.41%. Of these, many had an RPKM (Reads Per Kilobase of transcript, per Million mapped reads) value of over 1. These data illustrated that the GH isoforms discovered are not an anomaly of the assembly and are actively expressed.