Phylogenetic systematics of Butyrivibrio and Pseudobutyrivibrio genomes illustrate possession of open genomes rich in orthologous accessory genes with an abundance of carbohydrate-active enzyme isoforms

We show, using both a 16S rDNA and 40 gene marker phylogenetic tree, that 6 species, namely 1. P. ruminis, 2. P. xylanivorans, 3. B. brisolvens, 4. Butyrivibrio sp., 5. B. hungatei, and 6. B. proteclasticus likely exist. Pangenome analysis at 100% core denition showed a high abundance of accessory genes (91.50 to 99.34%) compared with core genes (0.66 to 8.50%), illustrating possession of very open genomes. Across the 71 genomes, 870 COGs (clusters of orthologous genes) were shared by all taxa, suggesting evolution through speciation from a common ancestor. Further analysis of Carbohydrate-Active Enzymes (CAZymes) genes show that most are within the accessory genome and orthologous in descent with numerous within-family CAZyme isoforms apparent, CAZyme family tree lineages show that these isoforms largely group according to the 6 species, suggesting extensive horizontal gene transfer within these families.

(91.50 to 99.34%) compared with core genes (0.66 to 8.50%), illustrating possession of very open genomes. Across the 71 genomes, 870 COGs (clusters of orthologous genes) were shared by all taxa, suggesting evolution through speciation from a common ancestor. Further analysis of Carbohydrate-Active Enzymes (CAZymes) genes show that most are within the accessory genome and orthologous in descent with numerous within-family CAZyme isoforms apparent, CAZyme family tree lineages show that these isoforms largely group according to the 6 species, suggesting extensive horizontal gene transfer within these families.

Conclusions
We show the extensive genomic variation found within Butyrivibrio, and to a lesser extent, Pseudobutyrivibrio. and demonstrate the existence of a new Butyrivibrio species. The Butyrivibrio and Pseudobutyrivibrio genomes are very open with very low % core genomes and high % accessory genomes., and possess a number of GH isoforms that we hypothesise facilitate metabolic plasticity and resilience under dietary perturbations. This study utilizes all currently available genomes and consequently provides a major advancement in our understanding of these important anaerobic bacteria.

Background
The de nition of 'species' in bacteria or archaea is contentious, with some believing that the search for a single, natural way to divide bacteria into species is futile [1,2]. Originally, in the late 19th century, the most important characteristics in terms of taxonomic markers were morphology, growth requirements and pathogenic potential. At the beginning of the 20th century, more biochemical and physiological markers were added to this list, followed by chemotaxonomy, numerical taxonomy, and DNA-DNA hybridisation in the mid-late 20th century. More recently, we have also used genotypic analyses, multilocus sequence analyses, average nucleotide identity, whole genome analyses etc [3,4]. 16S rDNA became a popular metric in the 1980's, with recommendations made that organisms sharing greater than 98% 16S rDNA should be classi ed as a single species [5]. This was further developed to whole genome alignments [6] and phylogenetic clustering [7], nonetheless both of these molecular-based tools faced scrutiny for their seemingly arbitrary cutoff values [8]. 16S rDNA was also criticised on the basis that only a single gene is used as a point of comparison [9]. As a result of the extensive variation in bacterial classi cation, it is inevitable that some degree of subjectivity is seen with respect to taxonomy, and consequently the same group of organisms can be sorted and arranged in many different ways [3,10].
More recently, pangenomic analyses have also been suggested as potential methods for de ning prokaryotic species [11]. For example, Moldovan and Gelfand (2018;12) propose a new procedure for the de nition of bacterial species that combines phylogeny and pangenomes. In their model, they use both a strict species de nition and a weak de nition. The strict de nition states that a species should be monophyletic in a sequence-based tree and should be composed of a genetically similar strain set, and should contain the maximum amount of strains that satisfy both of the previous criteria. The weak de nition allows the species to be either monophyletic (with paraphyletic clades potentially being subspecies) or polyphyletic. Furthermore, Goryunov et al. (2015, 13) use a nucleotide pangenome (that is, a set of aligned nucleotide sequences of orthologous gene fragments covering all genomes) to create phylogenies of mosses, highlighting the potential usefulness of pangenomes in taxonomy once more.
Likewise, orthologous gene analysis has been used for phylogenetic purposes when taxonomic discrimination is challenging [14], whilst also providing information on gene evolution and therefore strain evolution.
The rumen microbiome is an example of a taxonomically ambiguous environment, with horizontal gene transfer being rife due to the contained nature of the rumen, and the intense proximity that it provides [15]. Consequently, our understanding of both the taxonomy and function of the constituent microbes remains vague as it is in constant ux. Recently, our understanding has been enhanced through the Hungate collection [16], which comprise 501 rumen microbial genomes, most of which are bacterial. More recent metataxonomic studies, including the global rumen census, show that a core rumen microbiome can be found in various ruminant breeds across varying geographical locations [17]. These studies show that Prevotella, Butyrivibrio, and Ruminococcus, as well as unclassi ed Lachnospiraceae, Ruminococcaceae, Bacteroidales, and Clostridiales make up the "core bacterial microbiome" across breeds and geographical locations [17]. Nonetheless, the taxonomy of Butyrivibrio and Pseudobutyrivibrio remains a topic of debate.
Butyrivibrio were rst described in 1956 and using classical morphological and biochemical taxonomy were described as being motile, Gram negative, slightly curved rods that produced large amounts of butyric acid via glucose fermentation [18,19]. Despite this initial description, they were later found to be Gram positive, possessing derivatives of teichoic acid in their cell membrane [20]. It was noted upon discovery that there was extensive variation within the genus, and suggested that this variation may lead to di culties in de ning species-speci c patterns, despite butyrate production being a commonality [18,19,21]. The rst isolates were called Butyrivibrio brisolvens, named after their importance in the digestion of brous constituents in ruminant feed [18,19], enabled by their vast array of carbohydrate active enzymes [22]. From here, newly isolated strains were routinely classi ed as B. brisolvens despite morphological variation and vast genetic diversity [23] until 1976, when B. crossotus, a predominantly human isolate (also being found in the rumen on occasion in low abundance), was rst described [24].
Currently there are 4 Butyrivibrio species which reside in the rumen: namely, B. brisolvens, B. crossotus, B. hungatei and B. proteoclasticus (Fig. 1) [18,24,25]. In 2008, B. proteoclasticus was reclassi ed, originally being Clostridium proteoclasticum, on the basis of phylogenetic placement, DNA GC content, and physiological traits [25]. In 1996, a non-succinate-fermenting bacterium, was isolated that closely resembled B. brisolvens but was found to vary su ciently based on 16S rDNA, GC content and cellular fatty acid content, which was named Pseudobutyrivibrio ruminis [26]. Later on in 2003 another species of Pseudobutyrivibrio was taxonomically classi ed as P. xylanivorans based upon fermentation characteristics, DNA GC content and 16S rDNA dissimilarity to Butyrivibrio spp. [23]. Despite these additional species being named, the Butyrivibrio and Pseudobutyrivibrio still possess untapped phylogenetically and genetic diversity [27]. The challenges of de ning prokaryotic taxonomy in general are also compounded by the diverse approaches applied to bacterial systematics [12].
The aims of this study were to re-investigate phylogeny, gene-level functional divergence and evolution in predominantly ruminal Butyrivibrio and Pseudobutyrivibrio using all publically available genomes. This study also investigated gene-centric evolution in the ruminal Butyrivibrio and Pseudobutyrivibrio in relation to their pangenomes. Many of the bacterial genomes in the Hungate collection are from the genus Butyrivibrio and Pseudobutyrivibrio, which makes this study timely, and enables a paradigm shift in our fundamental understanding of these genera.

Functional similarity
Functional analysis was completed for all 71 ruminal Butyrivibrio and Pseudobutyrivibrio genomes used in this study using EggNOG-mapper (Table 1; Fig. 2). These data show very little variability at high-level functional categories between the genomes, and highlight that all the strains have a large proportion of 'unknown' genes (i.e. those not annotated by EggNOG-mapper). Overall, strains have predominant functions relating to carbohydrate transport and metabolism (average 8.91%), cell wall, membrane and envelope genesis (average 8.05%), and amino acid transport and metabolism (average 7.24%). The percentage of these functions stay fairly constant across all strains (standard deviation 0.80%, 0.78% and 0.50% respectively), although some exceptions to this can be seen. Speci cally, strain YAB3001 has 188 genes of 3095 (6.07% of total) annotated as being part of an amino acid transport and metabolism pathway, compared with strain NOR37 which has 183 of 2184 (8.40% of total) attributed to the amino acid transport and metabolism pathway. The average percentage for this category across all strains is 7.24%. The general consistency in high-level gene function can also be seen when the functional categories are compared at a genus level ( Supplementary Fig. 1).
Gene and Genome-level similarity Despite the broad similarity seen in high-level functional categories ( Fig. 2 and Supplementary Fig. 1), there may be substantial differences at the gene level. In order to evaluate gene level similarities for the 71 strains, phylogenetic trees based on 16S rDNA and 40 conserved marker genes were rst constructed to examine phylogenetic relatedness ( Supplementary Fig. 2, Fig. 3 respectively). The 40 marker phylogeny revealed six species/clades, containing: 1. P. ruminis, 2. P. xylanivorans, 3. B. brisolvens, 4. Butyrivibrio sp., and 5. B. hungatei, and 6. B. proteoclasticus, although it should be noted that P. ruminis and P. xylanivorans are phylogenetically very close, as are B. hungatei, and B. proteoclasticus (Fig. 3). Similar patterns were seen in the 16S rDNA based phylogenetic tree, although the clustering was less distinct in some instances ( Supplementary Fig. 2). Similar groupings can be seen again in the 3D scatter plot (Fig. 4), with P. ruminis and P. xylanivorans clustering very closely as well as B. hungatei and B. proteoclasticus.
In order to compare the strains on a genome level Average Nucleotide Identity (ANI) was calculated using PyANI [28]. These data show that the vast majority of strains have less than 75% nucleotide identity ( Supplementary Fig. 3). Several small clusters can be seen with nucleotide identity greater than 75%. Groups formed by PyANI are somewhat similar to those seen in the 40 marker tree (Fig. 3), although anomalies can be seen. When dendrograms based on genome coverage are plotted, the data show that many of the strains have <50% coverage, meaning that <50% of the genome can be compared due to a high level of dissimilarity ( Supplementary Fig. 4). Due to this reason the strength of the comparative data is weak when coverage is <50%, which in itself shows the level of dissimilarity across strains.
These data as a whole are con icting in terms of enhancing the taxonomic assignment of the ruminal Pseudobutyrivibrio and Butyrivibrio. Nonetheless, the 40 marker tree allowed resolution of all 5 previously classi ed Butyrivibrio and Pseudobutyrivibro spp. to varying extents, whilst identifying another possible Butyrivibrio clade/species. The similarity between the 40 marker data and the 16S rDNA and 3D GC content/Genome Size/Number genes made it a logical choice on which to base further analysis.

Pangenomics
In order to further scrutinise this taxonomic ambiguity, the pangenomes were investigated using Spine software [29] based on the 6 species/clades identi ed. A range of Spine cut-off parameters (% accessory de nition, the number of genomes a gene that has to be found within to be considered core and % nucleotide identity) were assessed to ensure optimal non-biased parameters were chosen for downstream analysis (Supplementary Fig. 5 and 6). These data show that, as expected, the higher the core de nition percentage is, the greater the proportion of accessory genes. There appears to be a sharp increase in the percentage of the genome that is de ned as accessory when the core de nition setting is increased from 60 to 70% in P. ruminis, and 70 to 90% in B. brisolvens. All other groups show a more consistent estimate of the accessory genome regardless of the core de nition percentage used ( Supplementary Fig. 5). Contrastingly, as the gene nucleotide identity percentage cutoff increases, there is little change in the proportion of accessory genes observed ( Supplementary Fig. 6). It is challenging to de ne the 'correct' cut-off parameters for pangenome analysis, which ultimately correspond to the biological situation. For downstream analysis the parameter of 100% core de nition was used, i.e the gene has to be present in all genomes (as per 30), as well as the default 85% gene nucleotide identity cutoff. The total number of core genes per clade decreases, as does the core/pangenome ration, with increasing stringency (Table 2).
At a 100% core de nition, B. brisolvens has an average core GC content of 44.94%, and an accessory GC content of 40.57%. B. hungatei has 44.73% and 41.04% for average core and accessory GC contents respectively, B. proteoclasticus 45.23% and 42.70%, Butyrivibrio sp. 45.14% and 41.58%, P. ruminis 43.39% and 39.06%, and P. xylanivorans 43.69% and 39.12%. Pseudobutyrivibrio strains had a greater difference than Butyrivibrio strains, with an average difference of 4.32% compared to 3.38%. The greatest difference was seen in strains of P. xylanivorans, with an average 4.57% difference in GC content between core and accessory genome, and the least difference is in B. proteoclasticus with 2.52%. Core genes across each taxon appear to have a higher GC% than their respective accessory genomes (with an average of 44.57% in the core genome, and 40.95% in the accessory at a core de nition of 90%) (Supplementary Table 1).
Functional annotation of species/clades, when split into core and accessory genomes ( Fig. 5A and B) shows greater functional diversity within the accessory genome. Invariably, a large proportion of the core genome appears to be dedicated to translation, ribosomal structure, and biogenesis. B. brisolvens appears to dedicate the largest proportion of its core genome to this category. This can be seen to a much lesser extent in the accessory genomes, with no particular functional role dominating to the same extent. Functional categories appear to be relatively evenly distributed throughout all six accessory genomes. The most evident outlier in the core genome is the B. brisolvens core, which has 223 genes in the "Translation, ribosomal structure and biogenesis" category, which is 70.35% of all its core genes. The average for this category in the core genome is 53.18%. The largest category is that of proteins with unknown function, which is much more prevalent in the accessory genome of each taxon. The average composition of unknown genes in the core genomes is 1.70%, whilst for the accessory genome this is 24.65%. B. brisolvens seems to have the least diverse core genome, having genes from only 8 of the 21 functional categories, followed by B. proteoclasticus with 9. B. brisolvens and B. proteoclasticus also have the fewest core genes annotated, with 317 and 502 respectively. B. hungatei has 989, Butyrivibrio sp. 1445, P. xylanivorans 1541, and P. ruminis 1635.
ClustAGE plots show high levels of genomic dissimilarity within species level groups in terms of Accessory Genomic Elements (AGEs), with each concentric ring representing a genome in the clade (shown by the key), and the AGE positioning in the circle representing their size (Supplementary Fig. 7-12). The plot for B. brisolvens shows gene fragments being absent in many genomes in places. This is particularly clear on occasion, for example, at the 750 kbp mark, with only four genomes out of 11 (AB2020, MC2013, NC3005, and D1) possessing an AGE here ( Supplementary Fig. 7). In all species level groups, the most dissimilarity seems to be in smaller gene fragments (Supplementary Fig. 7-12). It should be noted that the minimum AGE size represented in each plot is 1500 bp.

Evolution of the Butyrivibrio and Pseudobutyrivibrio genera
In order to evaluate gene ancestry and evolution OrthAgogue [31] was used to identify orthologous gene a liations. Orthologous gene distribution on a clade/species level shows that the vast majority of orthologous gene clusters are shared by all species level groups (870) (Fig. 6). As a genus, Pseudobutyrivibrio has more common orthologous genes than Butyrivibrio, with 343 and 223 genes respectively. B. brisolvens has the most unique orthologous genes with 143, followed by P. xylanivorans with 132, B. hungatei with 121, Butyrivibrio sp. with 90, P. ruminis with 27 and B. proteoclasticus with 17 ( Fig. 6). Comparison on a genus level clearly shows that both genera share the majority of their orthologous genes, with 870 clusters of orthologous genes (COGs) being common to the two. Pseudobutyrivibrio has more unique COGs, with 595 whilst Butyrivibrio has 251 unique COGs ( Supplementary Fig. 13). Out of all taxa, Butyrivibrio sp. shares the most amount of COGs across all its members with 1771, followed by P. xylanivorans with 1674, B. hungatei with 1602, P. ruminis with 1585, B. proteoclasticus with 1562 and B. brisolvens with 1502 ( Supplementary Fig 14-19). At taxon level, B. proteoclasticus has the most inparalogous clusters with 920. This is followed by Butyrivibrio sp. (778) Table 2). Further analysis showed that the majority of accessory genes were orthologs; 892 orthologs, 45 inparalogs and 80 co-orthologs, being 87.71%, 4.42%, and 7.87% respectively of a total of 1017 accessory homologous genes. Conversely the majority of core genes were annotated as inparalogous in terms of their evolution; 10 orthologs and 240 inparalogs, with 0 co-orthologs. This equates to 4.0% and 96.0% respectively of a total of 250 core homologous genes. Combined, this gives an overall total of 1269 genes. Of these, 10 are core orthologs (0.79%), 242 are core inparalogs (19.07%), 45 are accessory inparalogs (3.55%), 80 are accessory co-orthologs (6.30%), and 892 are accessory orthologs (70.29%).

Glycosyl hydrolase haplotypes and evolution
Functional annotation of the GH families possessed by each strain showed a lot of similarity based on GH families and their abundances ( Fig. 7; Supplementary Fig. 20-25). However, the GH family-level phylogenetic trees show limited gene clustering and GHs from the same family and within the same strain have a tendency to cluster together, as can be seen on pruned versions of the trees (Supplementary Fig. 26 and 27). These phylogenetic trees therefore show that the genera Butrivibrio and Pseudobutyrivibrio possess a high degree of within GH family enzyme isoforms. Irrespective, GH3 was the most abundant with 690 genes present, followed by GH13 with 681, GH43 with 543, GH2 with 463, and GH5 with 216. Homolog types across the GH2 genes show very few inparalogs with most genes being co-orthologous or orthologous across all species/clades (Fig. 8). Evolutionary distribution across GH3 is fairly similar (Fig. 9), although there is a greater proportion of inparalogous genes, which also appear to be more evenly distributed across the species/clades. Only 6 inparalogs have been annotated in the family GH5 (Fig. 10), and 34 co-orthologs, which is comparatively few compared with the higher number of orthologs. The co-orthologs form two broad clusters, and are only found in the Butyrivibrio sp. and B. proteoclasticus clades. There are no co-orthologous genes found in B. brisolvens, and very few in B. proteoclasticus. GH13, interestingly, has more inparalogs than co-orthologs, with 70 and 48 respectively (Fig. 11). There is overlap here, with many of these genes being annotated as orthologs, inparalogs and co-orthologs simultaneously which is a function of the algorithms used by Orthagogue. GH43 has a high proportion of co-orthologs, and fewer inparalogs (Fig. 12). The co-orthologs appear to be distributed across the entire tree, but with a higher density towards the far end of the tree. For each of the ve GH families, the majority of genes were found in one of the 20 metatranscriptome datasets within the Shi et al. (2014) dataset (Fig 8-12; Suppl. Excel 1). For GH2, 67.17% of genes were found to be expressed, for GH3 62.87%, GH5 56.28%, GH13 66.57%, and GH43 59.41%. Of these, many had an RPKM (Reads Per Kilobase of transcript, per Million mapped reads) value of over 1. These data illustrated that the GH isoforms discovered are not an anomaly of the assembly and are actively expressed.

Discussion
The role that Butyrivibrio and Pseudobutyrivibrio play in the rumen is not yet fully understood, however, they are known to be heavily involved in the metabolism of a wide variety of carbohydrates [23,32] proteins [33], and lipids [34]. Indeed, Butyrivibrio and Pseudobutyrivibrio dedicate a large proportion of their genetic capacity to the breakdown and reassembly of complex polysaccharides, with the resulting simple sugars undergoing fermentation to produce butyrate, a major source of energy for the ruminant [16,22,32,35]. In this study we show that Butyrivibrio and Pseudobutyrivibrio are genetically highly diverse, but can be classi ed within six species/clades, namely B. brisolvens, B. hungatei, B. proteoclasticus, Butyrivibrio sp., P. rumins and P. xylanivorans. These genera also contain very open genomes with a major proportion of the genes being part of the accessory genome as opposed to the core genome, as shown by the low core/pangenome ratio values. We also show that these bacterial genera possess a diversity of CAZymes and numerous gene haplotypes within each CAZyme family which we hypothesise may provide metabolic plasticity during dietary uctuations. This study delves into the fundamental taxonomy, ecology and evolution of the Butyrivibrio and Pseudobutyrivibrio at a level not possible before the recent increase in available genomes [16].
Based on strain level comparisons, functional categories appear to be relatively consistent across all genomes. It should be of no surprise that a large proportion of each genome (an average of 23.94% across each genome) has been annotated as "Function unknown", given that it is not uncommon for genomes to contain genes of unknown function, and our understanding of gene function tends to be lacking leading to poor annotations. For example, when the "minimal bacterial genome" of Mycoplasma mycoides was produced (that is, the smallest set of genes within that species that is capable of supporting life), 23.95% of that genome was composed of genes with an unknown function [36]. Butyrivibrio and Pseudobutyrivibrio are well known carbohydrate-degrading bacteria [23,26,37,38], due to their expansive CAZyme repertoire. Therefore, it is unsurprising that the "Carbohydrate transport and metabolism" functional SEED category is the most enumerate in the strains of Butyrivibrio and Pseudobutyrivibrio studied. "Cell wall/membrane/envelope biogenesis" and "Amino acid transport and metabolism" categories were also highly prevalent, likely due to the fact that they contain many housekeeping genes. Given that these strains all share the same environment, it is logical that they ful l the same broadly functional tasks, and therefore have similar functional annotations.
A phylogenetic tree based on 40 conserved gene markers [39,40] revealed groups that approximate to classical species, with the exception of P. ruminis strain CF1b, and another species/clade of Butyrivibrio was also evident. 16S rDNA phylogeny on fewer strains performed by Kasperowicz et al. (2009;41) showed that the CF1b strain groups closely with the type strain P. ruminis A12-1, which is concurrent with our own 16S rDNA ndings. Whilst 16S rDNA analysis is thought to be a reliable means of establishing distant relationships between organisms due to it's high information content, conserved nature, and universal distribution [42], high levels of diversity have been found within the 16S rDNA genes of certain genomes [43,44]. The 40 marker genes used are universal, single copy genes that are highly conserved and appear to maintain a constant rate of horizontal transfer; as a result of this, using these 40 markers are thought to provide a more resolved comparison [40]. Although we conclude that Butyrivibrio and Pseudobutyrivibrio can be classi ed within six species/clades, namely B. brisolvens, B. hungatei, B. proteoclasticus, Butyrivibrio sp. (a potentially new species/clade), P. ruminis, and P. xylanivorans, we note that P. ruminis and P. xylanivorans are phylogenetically very close as are B. hungatei and B. proteoclasticus. B. hungatei forms a paraphyletic clade, and it could therefore be suggested that it is a sub-species of B. proteoclasticus [12].
The increasing number of available bacterial genomes has allowed further research into microbial population genomics [45], which has revealed extensive intraspeci c variability in prokaryotic genome content, and led to the coining of terms such as "pangenome" (all the gene families that have been found in the species as a whole), "core genome" ('essential' gene families that are found in all members sequenced thus far), and "accessory genome" ('dispensable' genes that are not in every genome) [46]. This, alongside the analysis of orthologous genes (those derived from speciation events) and paralogous genes (those derived from duplication events) can give an in-depth insight into the taxonomy and evolutionary divergence of a population. Gabaldón and Koonin (2013;47) stated that orthology is the most accurate way of describing differences and similarities in the composition of genomes from different species because orthologs by de nition trace back to an ancestral gene that was present in a common ancestor of the compared species. Pangenome analysis of our strains showed that B. brisolvens as a species/clade possesses the lowest percentage of core genes (2.45%), and P. ruminis the highest (10.38%), with both values being comparatively low illustrating that the genera have 'open' pangenome, meaning that as additional strains are introduced, the pangenome increases in size [29].
The high level of genetic variation already known in Butyrivibrio (23,24,48) increases the probability of newly introduced genes being accessory. It is also likely that B. brisolvens has the highest genotypic diversity, given the historic tendency to classify newly discovered strains of Butyrivibrio as B. brisolvens on the basis of common phenotypic and metabolic characteristics, despite their vast diversity and genetic relatedness [23,49]. Core genomes, when viewed on a species/clade basis, share a high proportion of genes involved in translation, ribosomal structure, and biogenesis. This is likely due to the fact that a core genome is thought to comprise essential gene families, i.e. housekeeping genes, whilst the accessory genome is more likely to encode genes that confer functional variation [46]. GC% is consistently higher in the core genomes of each of the 71 strains. Whilst intra-genomic heterogeneity is common [50], it does not explain the evident divide between core and accessory genes. It has been suggested that GC rich segments can occur as a result of biased gene conversion (BCG) following recombination, whereby DNA repair of mismatched bases holds a bias towards GC nucleotides [51]. If this is assumed to be correct, this GC bias in core genes could be explained by their retention over accessory genes, which are more readily lost and exchanged, and are generally less conserved [52]. The longer these core genes are retained, the more they will be subjected to DNA repair, resulting in an increasing amount of GC bases being incorporated into the core genome. The BCG model is, however, contentious, with GC content being thought to play a role in several cellular processes such as replication and expression regulation [53]. Similarly, it has been suggested that there may be a universal mutation bias towards AT nucleotides as a result of single nucleotide polymorphisms (SNPs), suggesting that there may be pressure to retain high GC content [54. 55] state that there is a bias towards genes that are highly expressed having a higher GC content, indicating that core genes may be more highly expressed than accessory genes.
The vast majority of Clusters Of Orthologous Genes (COGs) are shared across all six species/clades, with 870 clusters being common for all of them. Both Pseudobutyrvibrio species/clades share a high proportion of clusters, with 343 COGs being shared between the two. The four Butyrivibrio groups share slightly fewer at 223, and B. brisolvens appears to have the most clade-unique COGs with 143. This could, again, be attributed to intraspecies genomic diversity. When the number of orthologs, inparalogs and co-orthologs are separated into core and accessory, the core genomes appear to be largely composed of inparalogs (242), only 10 orthologs, and no co-orthologs. Conversely, 45 inparalogs, 892 orthologs, and 80 co-orthologs were annotated as being in accessory genomes. It is possible that the core genome, which is thought to contain genes that are essential for life, has such a high number of inparalogs due to a high transcription demand. Generally, organisms with multi-copy genes yield higher expression levels of these proteins [56]. This is supported by Hutchison et al. (2016, 36), who state that it is common for bacteria to possess multiple copies of genes that are involved in essential, or quasi-essential, functions, and that these genes may or may not be paralogs. Despite this, it is thought that there is a positive selection towards deletion of super uous genes given that a smaller genome facilitates faster replication times [57].
The high proportion of orthologs in the accessory genome could potentially be explained by the presence of multigene families (MGF) within the accessory genome. Walsh and Stephan (2008, 58) state that gene families are assumed to be derived from a common ancestor, meaning that they could be classi ed as orthologous. This could be linked to the tendency for microorganisms in complex environments possessing multiple isomers of the same gene. This confers not only competitive advantage, but also a level of environmental robustness that facilitates metabolism of a wide variety of substrates in a changing environment [59]. It is also logical that many orthologs will be found in the accessory genome, as they're more likely to be single copy within a strain (given that they do not by de nition undergo withingenome duplication, unlike inparalogs), and are therefore more likely to be lost from a genome than inparalogs. In being annotated as an ortholog, a given gene is unlikely to have any recent within-genome duplications (or else it would be annotated as an inparalog), and yet in being core it must be present in 100% of genomes. These criteria, when applied simultaneously, are restrictive. Dagan and Martin (2006;60) state that core orthologous genes should account for approximately one percent of genes within a bacterial population. Given the already small collection of core genes found in Butyrivibrio and Pseudobutyrivibrio, it should not be surprising that these genes account for slightly less than this, at 0.79%, although this is confounded by the fact that many genes were annotated as being both orthologous and inparalogous.
Glycosyl hydrolases (GH) are involved in the breakdown of carbohydrates, including many plant polymers, and are broken down into 111 families (61; http://www.cazy.org/) on the basis of amino acid similarity [62]. Given that ruminants are well known for their ability to degrade diverse plant polymers at high rates, it should not be a surprise that they possess a vast array of GH enzymes. The rumen microbiome is exposed to strong diet-driven selection pressures, meaning that they must constantly compete for available sources of nutrition during dietary uctuations [63]. The clustering that can be seen in all GH families whereby clusters of enzymes within the same species/clade, form a similar clustering to the 40 marker trees, is typical of a multigene family, i.e. a group of genes that have arisen from a common ancestor by duplication. These genes therefore have similar functions and a high level of sequence similarity [64]. This is supported by the annotation of many of the GHs as orthologs. Strain P. ruminis CF1b is again the exception here, grouping with P. xylanivorans more closely, suggesting that it may actually belong with the P. xylanivorans species/clade. Although species/clade level clustering can be seen, a wide variety of GH isoforms is evident. It is not uncommon for extensive sequence variation to be found within a bacterial family, with the resulting enzymes having different substrate speci cities and yielding different products [65]. Ohta (2008;66) further state that many multigene families are present in large numbers within a genome due to an increased demand for their gene product, with genes either being clustered or dispersed throughout the genome. The clustered genes may retain overlapping functions whilst the dispersed genes may diverge to a greater extent. This may again be implicated in the tendency for bacteria in functionally demanding environments, such as the rumen, to possess a vast array of functional isomers allowing resilience under dietary perturbations [59]. The fact that such a large proportion of the GHs were found within the Shi et al. (2014; 67) dataset con rms that they are actively expressed within the rumen and not artefacts of the genome assembly.

Conclusions
In conclusion, this study provides the most in-depth dataset on the phylogenetic systematics and evolution of the ruminal Butyrivibrio and Pseudobutyrivibrio to date. This study highlights the extensive genomic variation found within Butyrivibrio, and to a lesser extent, Pseudobutyrivibrio. We also demonstrate the existence of outlier strains within the existing taxonomy in terms of phylogeny, GC content, genome size, and ANI and suggest the existence of a new Butyrivibrio species. The Butyrivibrio and Pseudobutyrivibrio genomes are also very open with very low % core genomes and high % accessory genomes. Despite genomic variation, taxa appear to retain broadly similar high-level functional pro les, and possess a number of GH isoforms that we hypothesise facilitate metabolic plasticity and resilience under dietary perturbations.

Genomes used in this study
Seventy genomes of Butyrivibrio/Pseudobutyrivibrio isolates were obtained from the Hungate 1000 collection (JGI; 16) and one additional strain, Pseudobutyrivibrio xylanivorans MZ8 (obtained from the Rowett Research Institute, University of Aberdeen), was genome sequenced in this study (Table 1) and submitted to GenBank (BioProject number PRJNA563299). The sample was then sent for sequencing to MicrobesNG (https://microbesng.uk/), where it was sequenced on the Illumina HiSeq 2500 platform, using 2x250bp paired-end reads and with x30 coverage. Following sequencing, the data were put through MicrobesNG's standard analysis pipeline, which included strain identi cation by Kraken [68], de novo assembly of the reads by SPAdes [69]. All 71 genomes were re-annotated using Prokka V1.12 [70] via the Galaxy platform [71] with a similarity e-value cut-off of 1×10 -6 to ensure analytical consistency. The 16S rDNA sequences for these genomes were obtained from the Prokka annotations.

Pangenomics
Pangenomic analysis was done according to the species level groups found on the 40 marker tree, i.e. all strains listed as B. brisolvens underwent pangenomic analysis together, all B. hungatei strains underwent separate analysis, et cetera. Core and accessory genomic fragments were identi ed from the Prokka annotated genomic sequences (.ffn les) using Spine V0.3.1 (Ozer et al., 2014, http://vfsmspineagent.fsm.northwestern.edu/index_age.html; 29) with default parameters (clustering at 85% similarity and identifying core genes when present in 100% of the genomes analysed). A range of other de ned parameters (70-100% similarity and present in 50-100% of genomes) were used to check which parameters are optimal to use beforehand. Accessory elements were visualised using ClustAGE V0.8 and ClustAGE Plot [29] for each group (as identi ed from the phylogenetic analysis described previously). The minimum accessory genomic element (AGE) size to represent in the ClustAGE Plot was set to 1500 bp due to le size restrictions when using ClustAGE Plot.
Core and accessory fasta les were combined into the core genome and the accessory genome for their respective taxa and uploaded to EggNOG [75] to determine genomic subsystem allocation. A stacked histogram was then made using these data to compare core and accessory functionality on a taxon level.
Core and accessory gene GC% content were taken from the Spine statistics les for each run of Spine also.

Gene Evolution
Putative gene orthology within the 71 genomes was determined using OrthAgogue [31] using the amino acid sequences of Prokka-annotated genes (.faa) from all 71 genomes. OrthAgogue was run with the parameters "-b -e 6" , which set the E-value cut-off to 10 -6 (this was also completed using an E-value cutoff of 10 -5 and data obtained was similar; data not shown) and forced OrthoMCL [76] emulation. All other OrthAgogue parameters were default. The clusters identi ed by OrthAgogue were turned into binary data, indicating either the presence or absence of a cluster of orthologous and inparalogous genes in each strain, after which the lists of clusters were uploaded to UpSet [77] to visualise the intersections. All groups were selected to be visualised, and the number of intersections was set to 60.

Carbohydrate-active enzymes (CAZymes)
CAZymes were identi ed using the dbCAN metaserver [78] and annotated using the "mRNAs/CDSs/Metagenomes or short DNA seqs" option, running HMMER with default setting of E-Value < 1e-15, coverage > 0. 35. GH sequences were extracted using Samtools V1.9 and each GH family was aligned using the Clustal Omega online server. A tree for each family was inferred using IQ tree V1.6.10, and visualised using iToL. Coloured ranges were put onto the trees using the iTOL template sheets to display clade colourings, and OrthAgogue and dbCAN annotations were compared to determine which GHs were orthologs, co-orthologs and inparalogs. These homolog-based annotations were denoted by the concentric coloured rings surrounding the trees, again using the iToL templates. Stacked histograms were produced showing only the most abundant GH families; in the case of the histogram with all 71 genomes displayed, and the histogram showing all 6 taxa, only GH families with over 100 total instances across all genomes were displayed. For the taxon speci c histograms, this number was 10.

Metranscriptome analysis
To check expression of the identi ed CAZyme isoforms in the rumen, and check whether these genes are expressed in vivo conditions, twenty publicly available metatranscriptomic datasets were taken from National Centre for Biotechnology Information Sequence Read Archive, under the accession number SRA075938 [67]. Datasets were composed of 150 bp paired end reads from the illumina Hiseq 2000 sequencer [67]. Fastq les were processed with multiqc [79] and reads were trimmed from 150 bp to 110 bp using trimmomatic software version 0.36 [80]. Reads were aligned to the Hungate rumen genome dataset using bowtie2 version 2.3.0 [81] using the settings "--very-sensitive-local" which allowed soft trimming of the reads and a relaxed alignment; and "-k 497". This produced SAM les, which were converted to BAM les using SAMtools version 1.9 [82]. SAMtools version 1.9 was used to lter all and the best alignment position for each read using the ag option "-F 260". For each of the resultant 20 BAM les FeatureCounts (from the subread package version 2.0.0) [83]  Chronological identi cation and classi cation of Butyrivibrio and Pseudobutyrivibrio.      Proportions of the most abundant glycosyl hydrolase (GH) families found in 71 strains of Butyrivibrio and Pseudobutyrivibrio. GH families annotated by dbCan metaserver [78].