A Genomic Census of the Chicken Gut Microbiome using Metagenomics and Culture

Background. The chicken is the most abundant food animal in the world. However, despite its importance, the chicken gut microbiome remains largely undened. Here, we exploit culture-independent and culture-dependent approaches to deliver a genomic census of this complex microbial community. Results. We performed metagenomic sequencing of fty chicken faecal samples from two breeds and analysed these, alongside all (n=582) relevant publicly available chicken metagenomes, to cluster over 20 million non-redundant genes and to construct over 5,500 metagenome-assembled bacterial genomes. In addition, we recovered nearly 600 bacteriophage genomes This represents the most comprehensive view of the chicken gut associated microbiome to date, encompassing dozens of novel candidate bacterial genera and hundreds of novel candidate species. To provide a stable, clear and memorable nomenclature for novel species, we devised a scalable combinatorial system for the creation of hundreds of well-formed Latin binomials. We cultured and genome-sequenced bacterial isolates from faeces, documenting thirty novel species, together with three species from the genus Escherichia, including the newly named species Escherichia whittamii. Conclusions. Our metagenomic and culture-based analyses provide new insights into the bacterial, archaeal and bacteriophage components of the chicken gut microbiome. The resulting datasets expand the known diversity of the chicken gut microbiome and provides a key resource for future high-resolution taxonomic and functional studies on the chicken gut microbiome. analyses and high-throughput culture.


Background
The domestic chicken is the most abundant bird and most abundant food animal on Earth, accounting for a larger fraction of the planet's biomass than all species of wild birds combined [1]. Consumption of chicken meat is growing faster than any other type of meat and is seen as a cheaper, healthier, lowcarbon alternative to meat from mammalian livestock [2,3]. Chicken eggs remain a nutritious, affordable food across the globe [4].
The chicken gastrointestinal tract is home to a complex community of microbes and their genes-the chicken gut microbiome-that underpins links between diet, health and productivity in poultry, as evidenced by the ability of antibiotics to promote growth in chicks [5]. This microbial community also acts as a source of pathogens associated with disease in birds or in humans-including Campylobacter, Salmonella, and Escherichia coli-as well as providing a reservoir of antimicrobial resistance genes [6][7][8].
Previous studies of this community have documented a rich variety of microorganisms (dominated by bacteria, but including viruses, archaea and microbial eukaryotes) and have shown that the taxonomic composition of this community varies with age, breed and disease status [9,10]. However, these earlier efforts have largely relied on analyses of molecular barcodes (in particular short 16S rRNA gene sequences), which fail to provide species-level resolution, are unable to detect viruses and reveal nothing about the genome sequences, population structures or functional repertoires of microbial species [11].
opportunistic pathogen of humans [21]-accounted for >10% of sequences in ten birds, corroborating a recent report of high abundance of 16S rRNA gene sequences from this organism obtained from the chicken caecum [22].
Bracken assigned sequences to over a hundred bacteriophage genomes, predominately phages infecting members of the Enterobacteriaceae assigned to the families Myoviridae and Podoviridae. Particularly noteworthy was the high abundance of reads in some samples from two distinct bacteriophages that prey on E. coli: phiEcoM-GJ1-a lytic bacteriophage isolated in Canada from pig sewage [23]-which accounted for 6.7% reads in a single sample and phAPEC8-a lytic bacteriophage with a large 147kb genome, isolated from a Belgian poultry farm-which accounted for 10% of reads in a single sample and for >1% of reads in three others [24].
Although these k-mer-based analyses can provide interesting insights into taxonomic diversity within the chicken gut, we quickly realised that they provide an incomplete and misleading picture of this important microbiome for several reasons: (1) they often report the presence of highly implausible organisms-for example, Kraken 2 reported the presence of human pathogens such as Shigella exneri and Plasmodium falciparum that are simply not credible in this context on clinical grounds; (2) as with studies on 16S rRNA gene sequences, they fail to provide genomic data or insights into the functional diversity or population structure of the microbial species that they identify and; (3) they rely on a reference database and so can only report previously known organisms and can never uncover "unknown unknowns".
The scale of the problem of unknown diversity is clear from the observation that nearly three quarters (73%) of sequence reads from our chicken samples cannot be con dently classi ed by Kraken 2 to species level and more than half of the reads (54%) cannot be classi ed at all and are simply designated as "Unassigned" (Fig. 2a). We therefore sought to extend our understanding of this community through two powerful reference-free approaches: assembly-based metagenome analyses and high-throughput culture.

Metagenomic assembly uncovers a wealth of viral diversity
Assembly of metagenomic sequences is a reference-free approach that involves aligning and merging short sequence reads into long contiguous sequences (contigs), which can then be ordered into larger scaffolds that include sequence gaps.
To con rm the presence of bacteriophages inferred through the reference-based analysis and to identify novel viral genomes, we assembled sequence reads from our fty chicken faecal samples into scaffolds. Scaffold sequences ≥10kb were analysed with VirSorter-a program designed to detect viral signals in microbial sequence data to nd novel viruses [25].
VirSorter identi ed 184 of our chicken faecal scaffolds as Category 1 ("most con dent") bacteriophage sequences and identi ed an additional 1,840 scaffolds as Category 2 ("likely") bacteriophage sequences.
This was de-replicated to 1,455 genomes using similarity thresholds of 95% ANI over 70% of the genome (Table S5). BLASTN analysis revealed only ten of these bacteriophage genomes showed high similarity (percentage identity > 70%; query covering > 50%) to known phages at the nucleotide level (Table S6).
These included close relatives of the two phages (phiEcoM-GJ1 and phAPEC8) found highly abundant in the Bracken analyses (Fig. 2b). Interestingly, more than one genus of coliphage (e.g. Jilinvirus, Phapecoctavirus, or Gamaleyavirus) was often detected in the same sample, along with an abundance of reads from their predicted prey (Escherichia) suggesting interesting dynamics in phage-host and phagephage interactions ( Fig. 2c; Table S7).
Of the remaining 1,445 unclassi ed bacteriophage genomes, nearly 600 encoded either an obvious terminase region or were circular and as such were suggested as being near-complete. Classi cation of these genomes revealed all genomes were predicted to belong to the order Caudovirales of tailed phages, with the majority belonging to the family Siphoviridae (n=429), but we also found representatives from the Myoviridae (n=87) and Podoviridae (n=27), plus some bacteriophages unclassi ed at family level (n=28) ( Table S8).
Remarkable microbial genome diversity in the chicken gut Next, we subjected our samples to computational binning-a process of grouping contigs/scaffolds on the basis of sequence composition and depth of coverage into discrete population bins representing metagenome-assembled genomes (MAGs). However, to carry out a de nitive survey of bacterial and archaeal diversity in the chicken gut microbiome-in addition to analysing the fty faecal samples mentioned and before we started the binning-we retrieved all publicly available chicken gut metagenomic datasets, to create an expansive dataset representing >630 samples, drawn from ten studies and twelve countries (Belgium, China, France, Germany, Italy, Malaysia, Netherlands, Poland, Spain, The Gambia, UK, USA) (Figure S1a/S1b; Table S9).
Sequence assembly and binning on all these samples generated 5,595 MAGs that passed our quality threshold of ≥80% completion and ≤5% contamination ( Figure S1c). Of these 3,131 could be considered high-quality draft genomes, with >90% completion and <5% contamination, as judged by recently published criteria (Table S10) [26]. Genome sizes of the MAGs ranged from ~0.5 to 6.4 Mbp, while GC content ranged from 24% to 73%.
Then, we grouped the MAGs into metagenomic species (MGSs). Initially, this involved de-replicating MAGs at the widely accepted 95% average nucleotide identity (ANI) for de ning bacterial and archaeal species and 99% ANI for de ning bacterial and archaeal strains [27,28]. De-replication of MAGs at 95% ANI resulted in 846 clusters representing bacterial and archaeal species, while de-replication at 99% ANI resulted in 2182 clusters, representing strains. However, to improve recovery of MAGs, MGSs and associated gene sets, we used gene correlations to identify species-representative genes and then applied hierarchical clustering to co-occurring genes across the samples. This allowed us to identify additional genes from the core genome of a species, even when they show divergent nucleotide compositions (such as genes from genomic islands and plasmids) [18]. Similarly, using canopy clustering [29], we could identify commonly occurring species of low abundance. Using these approaches, we were able to identify an additional seven MGSs (Table S11).
Of the 853 de-replicated bacterial metagenomic species, 321 represented previously delineated species catalogued in publicly available databases (Table S13). Following direct comparison, a further 165 metagenomic species had been previously identi ed by Glendinning et al [14], with these sequences not currently available in public archives. However, only 158 of our metagenomic species possess validly published names based on Latin binomials.
We performed a search of PubMed with the species name and "chicken", leaving aside the 33 species named by Glendinning et al [14]. This suggested that our study provides the rst-evidence-in-chickens for the majority (81/125) of these species (Table S14). Examples include: Jeotgalicoccus halophilus, rst isolated from the traditional fermented seafood, Jeotgal [31]; Aliicoccus persicus, rst isolated from a hypersaline lake [32]; and Bacteroides reticulotermitis, rst isolated from the gut of a termite [33].
We found that 310 of our metagenomic species could be assigned a taxonomy only at the level of genus and so represent novel candidate species. A further 56 species could be assigned a taxonomy only at the level of family and, after AAI clustering at 60%, were assigned to 36 novel candidate genera. One candidate bacterial species could be assigned a taxonomy only at the level of order (Oscillospirales) and so represent a new family.
Three MAGs were assigned to the domain Archaea. One represents the species Methanobrevibacter woesei-which is already known to inhabit the chicken gut [34]-while the other two represent novel species within the genera Methanocorpusculum and UBA71.
Linnaean binomials for hundreds of new candidate species Linnaeus rst proposed the assignment of Latin binomials to provide a universal nomenclature for biological species [35]. The International Code of Nomenclature of Prokaryotes (ICNP) sets the rules for naming prokaryotic species [36], but currently precludes the valid publication of names of uncultivated organisms, represented by MAGs or other sequences. Furthermore, high-throughput generation of MAGs and of sequence-based taxonomies for bacteria, such as the GTDB [30] is often assumed to preclude the detailed attention usually given to one-by-one construction of Linnaean binomials. As a result, most uncultured taxa, as well as many taxa de ned on sequence-based criteria, have been assigned unstable, confusing and hard to-remember alphanumerical identi ers.
To provide a stable, clear and memorable nomenclature for novel and/or previously unnamed bacterial and archaeal species from the chicken gut, we exploited the provision within the ICNP for naming uncultivated taxa via Candidatus assignments, which, although provisional, provide the scienti c community with well-formed Latin binomials [37,38]. However, this prompted us into an unprecedented effort to create hundreds of new names for the purpose of this single research study-an effort that required us to devise a scalable combinatorial system for the creation of binomials. Here, we made extensive combinatorial use of around twenty Latin and Greek roots pertaining to poultry (avi-, galli-, pulli-, alektryo, ptero, kotto-, ornitho-), intestines (intestini-entero-), faeces (faec-, kakke, merd-, kopro-, excrement-) or microbial life (-monas, -bacterium, -microbium, -coccus, -bacillus, -bium, -cola)-twinned with addition of these roots (singly or in tandem) and/or pre xes (allo, hetero, meta-, para-, crypto-) to existing genus names-to create over 200 Candidatus genus names (Table S15). An additional source of diversity stemmed from repetitive use of around forty Candidatus species epithets built from similar roots, which when combined with genus names gave us a total of over 600 distinctive new binomials.

Taxonomic diversity of cultured bacterial isolates
To extend our metagenomics analyses, we applied culture-based methods to six faecal samples that appeared species-rich in Kraken 2 analyses and in so doing obtained 282 isolates from aerobic culture (~80% of isolates) and anaerobic culture (~20% of isolates) (Table S16). All isolates underwent genome sequencing on the Illumina platform and phylogenetic analysis to enable taxonomic assignment. The resulting chicken gut culture collection was found to contain 56 genera, 93 species and 162 strains drawn from ve phyla. These included thirty novel species, with all novel species con rmed to originate from a monophyletic group through phylogenetic analysis against all available reference genomes of their respective genus ( Figure S2). Curiously, there was no overlap between the species that we obtained and those reported by Medvecky et al [16], suggesting that we are far from exhausting the set of species that can be cultured from this habitat. As with the metagenomic species, all novel or previously unnamed genera and species from cultured isolates were assigned Linnaean binomials (Table 1; Table S17). Species-level ANI clustering of all MAGs and all cultured isolates according to phylum is provided in Figure S4.
Interestingly, alongside ten cultured isolates of the well-characterised species Escherichia coli, we recovered three isolates from Escherichia marmotae (a species recently described in Himalayan marmots [39]). As previously reported, the E. marmotae strains cluster closely with the Escherichia Clade V [40,41], so all members of this clade should be considered members of this species (Fig. 4a, Table  S18). Further analysis of the GTDB species designated Escherichia sp001660175 (https://gtdb.ecogenomic.org/searches?s=al&q=sp001660175) con rmed that this species forms a monophyletic lineage that corresponds to Clade II, among the cryptic environmental clades described by Whittam and his colleagues [42], which was subsequently documented in birds [43]. As Clade II is comparable in divergence to the other Escherichia spp. and cryptic clades, we have therefore assigned the Linnaean binomial Escherichia whittamii to designate a new species (Table 1), honouring the outstanding contribution of Thomas S. Whittam to the study of Escherichia spp. [44].
We found that only sixteen species were common to our cultured isolates and our MGS. Subsequent sequence mapping allowed us to detect a further two cultured species at ≥1x coverage in at least one metagenomic sample ( Fig. 4b; Table S19), The genomes from cultured isolates were on average 20% larger than the corresponding MAG sequences retrieved from the same source sample (Table S20), which is in line with the completeness threshold of 80% we adopted in quality assurance of the MAGs. However, when we performed detailed gene content analyses on three abundant species in both cultured and metagenomic datasets -Lactobacillus reuteri (with the synonym Limosilactobacillus reuteri), Escherichia coli (including the synonym Escherichia exneri) and Enterococcus faecium-we found that >99% of the genes from the core genomes and nearly half of the genes in the accessory genomes of cultured species were represented in at least one MAG. These observations suggest that our high-quality MAGs are su ciently complete to warrant Candidatus names.
We analysed our chicken faecal metagenomes with a Kraken 2 database derived from genomes representing our candidate metagenomic and cultured species, this yielded a considerable improvement in the number of reads that can be classi ed through rapid phylogenetic pro ling (Fig. 4c).

Distribution of microbial species
An analysis of the distribution of 820 MGSs across the entire metagenomic dataset revealed marked variation between samples, with not a single species present at ≥1x coverage in all samples and only 39 species present in >90% of samples-although 441 species were present in >50% of samples at ≥1x coverage (Table S21). At ≥1x coverage, co-occurrence of nearly 300 species (n=295) was identi ed across all 10 BioProjects (Fig. 5a), with no species identi ed in all BioProjects at ≥10x coverage (Fig. 5b).
Among the species with high coverage, frequency is clearly linked to Bioproject. Although species quanti cation curves showed that the number of species identi ed increased rapidly with the number of samples, species discovery appeared to plateau at approximately 230 species after including only 50 metagenomes ( Figure S2a). Only two species appeared to be restricted (at ≥1x coverage) to just a single sample: Aliarcobacter thereius and Candidatus Avibacteroides faecavium. Correlation clustering con rmed structure in the data linked to BioProject ( Figure S2b) -for example, the BioProject from the study by Glendinning et al [14] clearly shows enhancement of clostridial species compared to other BioProjects, which re ects the fact that samples in that study were sourced from chicks with no posthatching contact with an adult bird. However, the BioSamples do not appear to cluster by country ( Figure  S2c) and show only limited clustering by sample site ( Figure S2d) Unfortunately, there is insu cient metadata for other potentially important factors, such as breed, age or diet to draw conclusions on how these might in uence clustering.

Discussion
Given the dominance of chickens in the planetary biomass, the chicken gut microbiome ranks as one of the most abundant microbial communities on the planet. Here, we have exploited two complementary approaches-metagenomics and culture-to create an extensive catalogue of genes, genomes and isolates from this important ecosystem. Our work illustrates the value of combining culture-dependent and culture-independent approaches in analysing microbiomes.
We have clearly demonstrated the advantages of shotgun metagenomic sequencing, when applied to the chicken gut microbiome, providing catalogues of genes and genome sequences that takes us well beyond what can be achieved using 16S ribosomal RNA gene sequences. Similarly, the current study is much wider in scope than the previous study by Glendinning and colleagues [14], not just including analyses of viral genomes and cultured isolates, while also incorporating MAGs built from data not just from that study but from all publicly available metagenomic datasets. Furthermore, the limited overlap between bacterial species represented among our cultured isolates and in our MGS reinforces the utility of the combined approach. Nonetheless, the substantial co-linearity between genomes obtained by the two approaches-and with those from another similar metagenomic study [14]-con rms the reliability of our binning approaches.
We were surprised to nd such a remarkable phylogenetic diversity within this commonplace livestock ecosystem-diversity that rivals that associated with the human gut. Our work has more than doubled the number of bacterial species known to reside in the chicken gut and has resulted in the creation of an unprecedented number of new Candidatus species. By including well-formed Latin binomials with the genomes we have uploaded into public repositories, we have ensured that the new proposed names and associated sequences will be integrated into commonly used online taxonomies and databases and will provide a stable taxonomic nomenclature for future studies. In addition, we have provided proof-ofprinciple for a scalable approach to Linnaean nomenclature that could be applied to species recovered from other metagenomic assembly projects.
Given that we did not recover by culture some of the organisms that appear most abundant by metagenomics, there is clearly scope for additional culture-based investigations, using a wider range of cultural conditions-perhaps drawing on the precedent of the Human Microbiome Project to create and target a list of the "most-wanted-for-culture" organisms documented by metagenomics [45]. The fact that novel metagenomic species are still being recovered from human gut datasets that include tens of thousands of metagenomes [12]-twinned with the promise of novel long-read and proximity-capture approaches to metagenome analyses [46]-make it clear that our attempts here to analyse all currently available chicken gut metagenomes provide far from the last word on microbial diversity in this abundant and important ecosystem. Nonetheless, the availability of so many novel genes, genome and species represents a substantial step forward.

Conclusions
The extensive catalogue of genes, genomes and isolates we have created here substantially improves the coverage of the chicken gut microbiome in the public databases and will make it possible to pro le sequences from the chicken gut much more rapidly, easily and comprehensively, providing a valuable resource that lays the ground-work for future comparative and intervention studies. However, as the goal of this study is to provide a genomic census of the chicken gut, we have refrained from making comparisons here between the microbial communities found in samples from different sites or from birds belonging to different breeds or reared under different conditions, because there are so many confounding variables. Thus, it would be misleading to compare caecal and faecal samples when they come from different birds raised in different ways on different continents. The only meaningful way to make such a comparison would be to analyse caecal and faecal material from the very same bird-an analysis that is outside the scope of this study.
This study also sets a provocative precedent-relevant not just to animal microbiomes, but to studies on all microbiomes-assigning well-formed Latin binomials to hundreds of metagenomic species in a scalable alternative to the automated use of bland, unstable, user-unfriendly alphanumerical designations. Drawing on the precedent set by the current study, we have recently extended this approach to encompass creation of more than a million new names for Bacteria and Archaea [47]. Thus, the time is now ripe to bring Linnaeus right into the heart of microbiome studies. Sixty faecal samples were collected from the Lohmann Brown laying hens and thirty samples from the Silkie hens (six and three samples per day, respectively, for ten days). Freshly evacuated faeces from individual birds were collected in sterile containers and immediately stored at -20 °C. Samples were then transferred to the laboratory for culture or DNA extraction. DNA was extracted using DNeasy PowerSoil kit (Qiagen), following manufacturer's instructions and then stored at -20°C.

Sequencing and subsequent work ow
Work ow from this point forward is summarised in Fig. 1.The fty samples yielding >20 ng DNA were processed according to the Low Input, Transpose Enabled (LITE) library construction pipeline [48] before being subjected to paired-end (2x150bp) metagenomic sequencing on the Illumina Novaseq 6000 platform. Bioinformatics analyses were performed on the Earlham Institute's High Performance Computing cluster and on the Cloud Infrastructure for Microbial Bioinformatics [49]. Sequences were assessed for quality using FastQC Version 0.11.8 and trimmed using Trimmomatic Version 0.36, con gured to a minimum read length of 40, "leading" and "trailing" settings of 3 (SLIDINGWINDOW:4:20) [50,51]. Metagenomic sequences for all samples have been uploaded to the Sequence Read Archive under Bioproject ID PRJNA543206.

Reference-based metagenomic analysis
An initial analysis of our chicken faecal sequences using the Kraken 2 taxonomic classi er [52] was performed on custom databases representing the domestic chicken genome (GenBank assembly accession GCF_000002315.6) and the food plants Triticum aestivum (wheat), Aegilops tauschii (diploid progenitor of the D genome of hexaploid wheat) and Glycine max (soy bean): GenBank assembly accessions GCF_001957025.1, GCA_900519105.1, GCA_000004515.4. Kraken 2 revealed that 8% (±16%) of reads originated from the chicken and at least 19% (±21%) originated from the diet. These sequences were ltered from our dataset and excluded from subsequent analyses by keeping only reads 'Unclassi ed' by Kraken 2 after comparison with each database in turn.
The remaining dataset underwent taxonomic pro ling using Kraken 2 against a microbial database built from all complete/representative archaeal, bacterial, fungal, protozoan, viral and UniVec_Core sequences in RefSeq [53] in January 2020. Bracken [17] was used to estimate taxon abundance from the Kraken 2 pro les, accepting only those taxa with ≥1000 assigned reads. Bracken-database les were generated using "bracken-build" on our microbial database and visualised using KronaTools [54].

Metagenomic assembly
We searched the NCBI BioProjects database (https://www.ncbi.nlm.nih.gov/bioproject/) in November 2019 with the term "chicken gut microbiome" and then selected nine publicly available projects that contained at least one metagenomic sequence dataset >1GByte in size: PRJEB33338, PRJNA193217, PRJNA291299, PRJNA375762, PRJNA415593, PRJNA417359, PRJEB22062, PRJNA543206, PRJNA417359, PRJNA385038, PRJNA616250. Only four of these studies were linked to research publications at the time of publication [14,15,55,56] All shotgun metagenomic reads were quality-ltered by removing reads shorter than 70% of the maximum expected read length (100 bp, 250 bp for miSeq data), an estimated accumulated error >2.5 with a probability of ≥0.01 [57] or with an observed accumulated error >2, or >1 ambiguous position to assist assembly. If base quality dropped below 20 in a window of 15 bases at the 3' end, or if the accumulated error exceeded 2, reads were trimmed. All these lter steps are integrated in sdm [58]. Reads mapping to the chicken genome and diet were removed from the metagenomic data as described previously, classifying reads with Kraken 2 against custom databases built on the aforementioned genomes.
Sequence datasets from our fty samples-together with 582 samples from the selected BioProjectswere assembled using MegaHIT [59] under the option "--k-list 25,43,67,87,101,127". To avoid artefacts that sometimes result from co-assembly of sequences from different samples and different sources, we performed individual assemblies on each sample, with the exception of BioProject PRJNA17359. For that BioProject, as multiple metagenomic samples had been sourced from different tissues of the same individual bird, we co-assembled reads from the 120 BioSamples from that project.

Bacteriophage identi cation and characterisation
Scaffold sequences from the MegaHIT assemblies of our fty samples that were ≥10kb were analysed with VirSorter v1.0.5 with the "-db 2" option to identify viral genomes [25]. VirSorter Category 1 and 2 scaffold sequences were collapsed at 95% nucleotide identity over 70% of the sequence length using CD-Hit Est v4.6.1 [60]. Classi cation of bacteriophage sequences relied on nucleotide searches using BLASTN against the NCBI NT database (Completed April 2020) and protein searches using Kaiju Version 1.7.3 against the RefSeq database (Completed April 2020) [61]. Only bacteriophage genomes with BLASTN hit E-Value <0.05, percentage identity >70% and query covering >50% were selected as reliable hits.
A taxonomic assignment was drawn from the highest scoring BLASTN (or in rare cases BLASTP) hit ranked by query cover and percentage ID. Synteny between predicted coliphages and their respective reference genomes were visualised using EasyFig [62]. Escherichia bacteriophage coverage per sample was determined using Anvi'o v6.1 [63] using default parameters and visualised in R using the Pheatmap package (https://www.rdocumentation.org/packages/pheatmap). Remaining viral geomes were ltered for completeness, retaining those that were circular and encoded a complete terminase gene (as predicted by VirSorter). Taxonomic assignments to family were performed on viral genomes using Demovir (https://github.com/feargalr/Demovir).

Gene catalogue
Complete genes identi ed by Prodigal v2.6.1 [64] were clustered at 95% nucleotide identity using CD-HIT-Est v4.6.1 [60]. Incomplete genes were then mapped to this complete gene list using Bowtie2 v 2.3.4.1 [65] and any mapping at 95% nucleotide identity were incorporated into the relevant gene clusters. Finally, genes representing the forty conserved marker genes de ned by Mende et al [66] were clustered separately and then merged with the existing set of gene clusters. We thus obtained a gene catalogue of >20 million genes, de ned as non-redundant at 95% average nucleotide identity. The nal gene catalogue was uploaded to FigShare (https://doi.org/10.6084/m9. gshare.13116809.v1)

Abundance estimates of contigs and genes
Prodigal [64] was applied in metagenome-mode to all contigs from the MegaHIT assemblies. Un ltered reads from each sample were mapped against their respective assembly to provide an estimate of contig and gene abundance using Bowtie2 [65] with the options "--no-unal--end-to-end -score-min L, -0.6,-0.6". Samtools 1.3.1 was used to sort and index all resulting Bam les [67]. Only reads with mapping quality >20, >95% nucleotide identity and >75% overall alignment length were retained. BEDTools v2.21.0 [68] was used to create depth pro les from the Bam les. These depth pro les were then translated with rdCover (https://github.com/hildebra/rdCover) into average coverage (in a 50 bp window) per contig or per gene predicted from each contig. Bam les were translated to abundances using the "jgi_summarize_bam_contig_depths" script from the MetaBAT 2 package [69].
Gene abundances were linked to their respective gene clusters and originating samples. Redundant genes representing the same orthologue were removed.
Species-level clusters were formed using a combination of two distinct approaches. One approach removed redundancy between samples by pre-clustering bins if ≥30% of their genes overlapped with a higher-quality bin to create a set of pre-MGS bins. Lower-quality bins (>60% completeness and <10% contamination) were also included in the analysis but were not used to form new species clusters. To recover prokaryotic species usually obscured using single-sample assemblies and conventional binning techniques, we re ned all species bins into "hcl-clusters" using gene correlations and hierarchical clustering, as described by Hildebrand et al [18]. We chose genes occurring in ≥10% of all associated MAGs as representatives for each pre-MGS bin and used these to sh for additional co-occurring genes from the gene catalogue, using a threshold of >0.75 Pearson correlation and >0.85 spearman rho to identify gene co-occurrences within this core gene set. We then merged MetaBAT 2 bins, canopy bins and co-occurring genes into our species bins. We used the presence of 40 known single-copy marker genes, without duplicates, as a quality criterion in selection of sub-clusters, before extracting the nal set of MGS gene representatives using MATAFILER (https://github.com/hildebra/MATAFILER). The nal collection of MGS bins (canopy clusters + hcl-clusters) was re-assessed for contamination and completeness using CheckM [70], so that we could be con dent that each bin represents a single species. A second approach de-replicated all MAGs at 95% average nucleotide identity (ANI) (species-level) and 99% ANI (strain-level) using dRep Version 2.0 [71] and only species not identi ed in approach one were added to the resulting non-redundant species catalogue. A single representative MAG for each novel species cluster was uploaded to NCBI SRA under BioProject PRJNA543206 and all MAGs generated were uploaded to FigShare (https://doi.org/10.6084/m9. gshare.13116809.v1.). CompareM Version0.1.1 (https://github.com/dparks1134/CompareM) was used to calculate average amino acid identity between novel genera.

Taxonomy of metagenomic species
We used the Genome Taxonomy Database Toolkit (GTDB-Tk Release 95) to perform taxonomic assignments on strain-level dereplicated MAGs [72]. In addition, genes from each MGS were analysed through GTDB-Tk (Release 95), proGenomes resource [73] and underwent k-mer-based taxonomic pro ling using Kraken 2. In assigning taxonomy, we allowed GTDB assignments to take precedence-only when no GTDB taxonomy was available would we adopt taxonomies assigned by ProGenomes and Kraken 2 and, then, only where genus and family assignments from these sources matched. When exploiting the taxonomy assigned according genes from metagenomic species, we applied a leastcommon-ancestor approach to unplaced taxa at higher taxonomic levels. Species distribution analyses were conducted using the Vegan package in R [74], before visualisation using ggplot2 [75] and Pheatmap R packages (https://www.rdocumentation.org/packages/pheatmap). Pan-genome analysis was conducted using Roary v3.11.2 and visualised using the roary2svg.pl script [76]. Comparison of our derived metagenomes with those of Glendinning et al. [14] was performed at 95% ANI using dRep and visualised using web-tool BioVenn [77].

Bacterial culture
To estimate species richness and diversity, the Phyloseq package of R [74] was applied to the output from Bracken [17] on all of our chicken faecal metagenomic datasets. The six faecal samples that showed highest species richness and taxonomic diversity were selected for culture-based studies. Frozen faecal samples were thawed, vortexed and two 0.5g aliquots (once processed aerobically, the other anaerobically) from each sample were suspended in 5ml PBS. Each aliquot was vortexed until homogenised, before performing serial dilutions in duplicate down to 1 x 10 -5 . Processing of samples for aerobic and anaerobic culture was identical, except that, for anaerobic culture, all culture media, diluent and consumables were pre-reduced to anaerobic conditions for at least 24 hours before faecal samples were processed in a Whitley A95TG workstation.
For dilutions 10 -3 -10 5 , 200µl was plated directly on to a set of three agar plates for each culture medium (Brain Heart Infusion, Colombia Blood Agar, Yeast extract, casitone and fatty acid) with or without vancomycin supplementation at a concentration of 6µg/ml (Table S1). Cultures were incubated at 37°C for 72 hours in their respective conditions before assessment of colony growth. Well-isolated colonies were picked according to colonial morphotype distinctive in colour, shape and size, before being restreaked on to the growth medium from which they were sourced to con rm purity. Individual colonies were subsequently used to inoculate 2ml of broth based on the source culture medium, incubated at 37°C for a further 24 hours before bacterial DNA extraction. All isolates were archived at -80°C in glycerol at 20% concentration.

Genome sequencing and analysis
Genomic DNA was extracted using a DNeasy UltraClean DNA isolation kit according to the manufacturer's instructions (Qiagen, Hilden, Germany). DNA was quanti ed using a Qubit ® uorometer (Invitrogen, CA, USA) high-sensitivity assay, before dilution to the required concentration in RNase-free water and puri cation on AMPure XP beads (Beckman Coulter). Sequencing libraries were prepared from 0.5ng/µl of RNA free genomic DNA. A total of 282 isolates were included for genomic sequencing using the Nextera-XT DNA sample preparation kit (Illumina) and whole-genome sequencing performed using the Illumina NextSeq sequencing platform, generating paired-end reads (2 x 150bp).
Paired-end reads were quality-assessed and trimmed using FastQC and Trimmomatic as described above. Trimmed reads were assembled into scaffolds using SPAdes version 3.13.1 [78]. Scaffolds shorter than 500 bp were discarded from analysis. Genome contamination and completeness was assessed using CheckM version 1.0.13. To con rm assembly quality, only genomes conforming to all the following criteria were included in further analysis: (i) scaffold N50 of >20 kbp (ii) 90% of assembled bases at > 5x read coverage (iii) completeness of > 95% (iv) contamination of < 5% (v) complete 16S rRNA gene sequence.
Genome sequence taxonomic assignment Barrnap Version 0.9 (https://github.com/tseemann/barrnap) was applied to all genomes that passed the quality lters to extract full-length 16S rRNA gene sequences. These were then compared to NCBI 16S rRNA gene sequences from RefSeq genomes using the NCBI's web-based BLASTN facility [79]. 16S rRNA gene sequences that showed an identity of <98.7% to known sequences were assigned to novel species, using the conservative approach in proposed minimal standards [80]. We used ReferenceSeeker Version 1.6.2 [81] to determine average nucleotide identity (ANI) and conserved DNA values compared to RefSeq bacterial genomes (Completed March 2020) [53]. Genomes that showed ANI ≤95% and conserved DNA ≤69% to the closest relative were designated novel species. The Genome Taxonomy Database Toolkit (GTDB-Tk Release 89) was used to perform taxonomic assignments on isolate genomes [72]. Genomes were clustered at 95% and 99% ANI before selection of a single representative isolate per species using dREP [71]. Where a genome previously designated as novel clustered with a genome of assigned taxonomy, this taxonomy was then applied to the previously designated 'novel' genome. Final taxonomic assignments were based on genome-based ANI values derived from RefSeq and GTDB -with GTDB assignments taking precedence. A single representative genome for each novel or renamed species cluster was uploaded to NCBI SRA under BioProject PRJNA543206 and all genomes alongside respective 16S rRNA gene seqeunces generated were uploaded to FigShare (https://doi.org/10.6084/m9. gshare.13234556).

Phylogenetic analysis
For phylogenetic analysis of all MGS and genome sequenced isolates we used PhyloPhlAn v3.0.58 [82] with the "diversity high" and a proteome input predicted from all genome sequences using Prodigal v2.6.1 [64]. Diamond v0.9.34 [83] was used to perform a search against 400 universal PhyloPhlAn markers. MAFFT v.7.271 [84] was used to perform multiple sequence alignment before re nement with trimAl v.1.4 [85] and reconstruction into trees using FastTree v2.1 and RAxML v. 8.2.12 [86,87]. All trees were visualised used the online iTOLv1.4 platform for visualisation and manual annotation [88]. Trees were scrutinised to con rm that species and genera were monophyletic. Phylogeny for all cultured genomes unassigned at species level was con rmed as previously described against all available reference proteomes of that respective genus downloaded from NCBI.
To investigate the phylogenetic placement of cultured isolates designated as Escherichia marmotae and Escherichia sp001660175 by GTDB, we constructed a core genome phylogenetic tree. The genomes from cultured isolates were compared to genomes representing the full diversity of the genus Escherichia. Three Salmonella genomes were included as an outgroup. The genome sequences were aligned using Mugsy [89], and alignment blocks conserved across all genomes were concatenated to produce a core genome alignment. A phylogenetic tree was constructed by maximum likelihood with 100 rapid bootstrap replicates, using the general time reversible model of nucleotide substitution with gamma correction for rate heterogeneity, as implemented in RAxML version 8.2.12 [87]. Authors' contributions RG, AR contributed to the study design, processing, analysis, and interpretation; and manuscript preparation. EFN, SJ, AS, MA contributed to sample collection and processing. NH, DB, KG contributed to sample processing. NFA, EMA contributed to the data analysis. FH contributed to the study design, data analysis interpretation, and manuscript preparation. MW contributed to the study design, data analysis and interpretation, and manuscript preparation. AO contributed to the naming of the candidate taxa and to the manuscript preparation. MJP conceived of the study in design, coordination and manuscript preparation. All authors read and approved the nal manuscript. RML, DLH, IP and MG contributed to the collection and processing of the samples and metadata. RC contributed to data analysis and interpretation and manuscript preparation. Table   Table 1. Protologues for new taxa cultured from chicken faeces DESCRIPTION OF ACINETOBACTER PECORUM SP. NOV.
(pe.co'rum M.L. gen. pl. pecorum of ocks of sheep, birds etc., as this species has also been isolated from sheep) by analysis of its genome sequence. GTDB [30] has given this species the alphanumerical designation sp001647535 and the two other genomes assigned to this species are derived from sheep isolates [90].
The Type Strain is Sa1BUA6, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 42.9% and the genome size is 3,209,341 base pairs. Further information can be found in the Methods and in Additional File 1.
DESCRIPTION OF ARTHROBACTER GALLICOLA SP. NOV.
(gal.li'co.la. L. masc. n. gallus a cock; N.L. suff. -cola an inhabitant of; N.L. masc. or fem. n. gallicola an inhabitant of the chicken) A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [30] has applied the designation Arthrobacter_B, to this genus.
The Type Strain is Sa2CUA1, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 65.5% and the genome size is 3,679,471 base pairs. Further information can be found in the Methods and in Additional File 1.
DESCRIPTION OF ARTHROBACTER PULLICOLA SP. NOV.

pullicola an inhabitant of young chickens)
A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [30] has applied the designation Arthrobacter_B to this genus.
The Type Strain is Sa2BUA2, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 65.7% and the genome size is 3,726,732 base pairs. Further information can be found in the Methods and in Additional File 1.

DESCRIPTION OF BACILLUS NORWICHENSIS SP. NOV.
(nor.wich.en'sis. N.L. masc. adj. norwichensis pertaining to English city of Norwich, where the organism was isolated) A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [30] has applied the designation Bacillus_AM to this genus.
The Type Strain is Sa1BUA2, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 40.2% and the genome size is 4,696,597 base pairs. Further information can be found in the Methods and in Additional File 1.
DESCRIPTION OF BREVIBACTERIUM GALLINARUM SP. NOV. (gal.li.na'rum. L. pl. gen. n. gallinarum of hens) A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. The Type Strain is Re57, which has been submitted for deposition in