A Genomic Blueprint of the Chicken Gut Microbiome

Background: The chicken is the most abundant food animal in the world. However, despite its importance, the chicken gut microbiome remains largely undened. Here, we exploit culture-independent and culture-dependent approaches to deliver a genomic blueprint of this complex microbial community. Results: We performed metagenomic sequencing of fty chicken faecal samples from two breeds and analysed these, alongside all (n=582) relevant publicly available chicken metagenomes, to cluster over 20 million non-redundant genes and to construct over 5,500 metagenome-assembled bacterial genomes. In addition, we recovered nearly 600 bacteriophage genomes This represents the most comprehensive view of the chicken gut associated microbiome to date, encompassing dozens of novel candidate bacterial genera and hundreds of novel candidate species. Keen to provide a stable, clear and memorable nomenclature for novel species, we devised a scalable combinatorial system for the creation of hundreds of well-formed Latin binomials. We cultured bacterial isolates from faeces to deliver 282 whole genome sequences, incorporating thirty novel species, together with three species from the genus Escherichia, including the newly named species Escherichia whittamii. Conclusions: Our metagenomic and culture-based analyses provide new insights into the bacterial, archaeal and bacteriophage components of the chicken gut microbiome. The resulting datasets expand the known diversity of the chicken gut microbiome and provides a key resource for future high-resolution taxonomic and functional studies on the chicken gut microbiome.


Background
The domestic chicken is the most abundant bird and most abundant food animal on Earth, accounting for a larger fraction of the planet's biomass than all species of wild birds combined [1]. Consumption of chicken meat is growing faster than any other type of meat and is seen as a cheaper, healthier, lowcarbon alternative to meat from mammalian livestock [2,3]. Chicken eggs remain a nutritious, affordable food across the globe [4].
The chicken gastrointestinal tract is home to a complex community of microbes and their genes-the chicken gut microbiome-that underpins links between diet, health and productivity in poultry, as evidenced by the ability of antibiotics to promote growth in chicks [5]. This microbial community also acts as a source of pathogens associated with disease in birds or in humans-including Campylobacter, Salmonella, and Escherichia coli-as well as providing a reservoir of antimicrobial resistance (AMR) genes [6][7][8] .
Previous studies of this community have documented a rich variety of microorganisms (dominated by bacteria, but including viruses, archaea and microbial eukaryotes) and have shown that the taxonomic composition of this community varies with age, breed and disease status [9,10]. However, these earlier efforts have largely relied on analyses of molecular barcodes (in particular short 16S sequences), which Metagenomic assembly We searched the NCBI BioProjects database [25] in November 2019 with the term "chicken gut microbiome" and then selected eight publicly available projects that contained at least one metagenomic sequence dataset >1GByte in size (PRJEB33338, PRJNA193217, PRJNA291299, PRJNA375762, PRJNA415593, PRJNA417359, PRJEB22062, PRJNA543206, PRJNA417359, PRJNA385038). We also analysed an additional dataset derived from chicken caecal samples collected in the Gambia [26].
All shotgun metagenomic reads were quality-ltered by removing reads shorter than 70% of the maximum expected read length (100 bp, 250 bp for miSeq data), an estimated accumulated error >2.5 with a probability of ≥0.01 [27] or with an observed accumulated error >2, or >1 ambiguous position to assist assembly. If base quality dropped below 20 in a window of 15 bases at the 3' end, or if the accumulated error exceeded 2, reads were trimmed. All these lter steps are integrated in sdm [28]. Reads mapping to the chicken genome and diet were removed from the metagenomic data as described previously, classifying reads with Kraken 2 [29] against custom databases built on the aforementioned genomes.
Sequence datasets from our fty samples-together with 582 samples from the selected BioProjectswere assembled using MegaHIT [30] under the option "--k-list 25,43,67,87,101,127". Keen to avoid artefacts that sometimes result from co-assembly of sequences from different samples and different sources, we generally performed individual assemblies on each sample. However, we noted that in BioProject PRJNA17359 multiple metagenomic samples had been sourced from different tissues of the same individual bird, so we co-assembled reads from the 120 BioSamples from that project.

Bacteriophage identi cation and characterisation
Scaffold sequences from the MegaHIT assemblies of our fty samples that were ≥10kb were analysed with VirSorter v1.0.5 with the "-db 2" option to identify viral genomes [31]. VirSorter Category 1 and 2 scaffold sequences were collapsed at 95% nucleotide identity over 70% of the sequence length using CD-Hit Est v4.6.1 [32]. Classi cation of bacteriophage sequences relied on nucleotide searches using BLASTN against the NCBI NT database (Completed April 2020) and protein searches using Kaiju Version 1.7.3 against the RefSeq database (Completed April 2020) [33]. Only bacteriophage genomes with BLASTN hit E-Value <0.05, percentage identity >70% and query covering >50% were selected as reliable hits.
A taxonomic assignment was drawn from the highest scoring BLASTN (or in rare cases BLASTP) hit ranked by query cover and percentage ID. Synteny between predicted coliphages and their respective reference genomes were visualised using EasyFig [34]. Escherichia bacteriophage coverage per sample was determined using Anvi'o v6.1 [35] using default parameters and visualised in R using the Pheatmap package [36]. Remaining viral geomes were ltered for completeness, retaining those that were circular and encoded a complete terminase gene (as predicted by VirSorter). Taxonomic assignments to family were performed on viral genomes using Demovir [37].

Gene catalogue
Complete genes identi ed by Prodigal v2.6.1 [38] were clustered at 95% nucleotide identity using CD-HIT-Est v4.6.1 [32]. Incomplete genes were then mapped to this complete gene list using Bowtie2 v 2.3.4.1 [39] and any mapping at 95% nucleotide identity were incorporated into the relevant gene clusters. Finally, genes belonging to the forty conserved marker genes de ned by Mende et al [40] were clustered separately and then merged with the existing set of gene clusters. We thus obtained a gene catalogue of >20 million genes, de ned as non-redundant at 95% average nucleotide identity. Abundance estimates of contigs and genes Prodigal [38] was applied in metagenome-mode to all contigs from the MegaHIT assemblies. Un ltered reads from each sample were mapped against their respective assembly to provide an estimate of contig and gene abundance using Bowtie2 [39] with the options "--no-unal--end-to-end -score-min L, -0.6,-0.6". Samtools 1.3.1 was used to sort and index all resulting Bam les [41]. Only reads with mapping quality >20, >95% nucleotide identity and >75% overall alignment length were retained. BEDTools v2.21.0 [42] was used to create depth pro les from the Bam les. These depth pro les were then translated with rdCover [43] into average coverage (in a 50 bp window) per contig or per gene predicted from each contig. Bam les were translated to abundances using the "jgi_summarize_bam_contig_depths" script from the MetaBAT 2 package [44].
Gene abundances were linked to their respective gene clusters and originating samples. Redundant genes representing the same orthologue were removed.
Species-level clusters were formed using a combination of two distinct approaches. One approach removed redundancy between samples by pre-clustering bins if ≥30% of their genes overlapped with a higher-quality bin to create a set of pre-MGS bins. Lower-quality bins (>60% completeness and <10% contamination) were also included in the analysis but were not used to form new species clusters. To recover prokaryotic species usually obscured using single-sample assemblies and conventional binning techniques, we re ned all species bins into "hcl-clusters" using gene correlations and hierarchical clustering, as described by Hildebrand et al [45]. We chose genes occurring in ≥10% of all associated MAGs as representatives for each pre-MGS bin and used these to sh for additional co-occurring genes from the gene catalogue, using a threshold of >0.75 Pearson correlation and >0.85 spearman rho to identify gene co-occurrences within this core gene set. We then merged MetaBAT 2 bins, canopy bins and co-occurring genes into our species bins. We used the presence of 40 known single-copy marker genes, without duplicates, as a quality criterion in selection of sub-clusters, before extracting the nal set of MGS gene representatives using MATAFILER [48]. The nal collection of MGS bins (canopy clusters + hclclusters) was re-assessed for contamination and completeness using CheckM [47], so that we could be con dent that each bin represents a single species. A second approach dereplicated all MAGs at 95% average nucleotide identity (ANI) (species-level) and 99% ANI (strain-level) using dRep Version 2.0 [49] and only species not identi ed in approach one were added to the resulting non-redundant species catalogue. A single representative MAG for each species cluster was uploaded to NCBI SRA under BioProject PRJNA543206. CompareM Version0.1.1 [47] was used to calculate average amino acid identity between novel genera.

Taxonomy of metagenomic species
We used the Genome Taxonomy Database Toolkit (GTDB-Tk Release 95) to perform taxonomic assignments on strain-level dereplicated MAGs [50]. In addition, genes from each MGS were analysed through GTDB-Tk (Release 95), proGenomes resource [51] and underwent k-mer-based taxonomic pro ling using Kraken 2 [21]. In assigning taxonomy, we allowed GTDB assignments to take precedenceonly when no GTDB taxonomy was available would we adopt taxonomies assigned by ProGenomes and Kraken 2 and, then, only where genus and family assignments from these sources matched. When exploiting the taxonomy assigned according genes from metagenomic species, we applied a leastcommon-ancestor approach to unplaced taxa at higher taxonomic levels. Species distribution analyses were conducted using the Vegan package in R [52], before visualisation using ggplot2 [53] and Pheatmap R packages [36]. Pan-genome analysis was conducted using Roary v3.11.2 and visualised using the roary2svg.pl script [54]. Comparison of our derived metagenomes with those of Glendinning et al. [15] was performed at 95% ANI using dRep and visualised using web-tool BioVenn [55] [49].

Bacterial culture
To estimate species richness and diversity, the Phyloseq package of R [52] was applied to the output from Bracken [23] on all of our chicken faecal metagenomic datasets. The six faecal samples that showed highest species richness and taxonomic diversity were selected for culture-based studies. Frozen faecal samples were thawed, vortexed and two 0.5g aliquots (once processed aerobically, the other anaerobically) from each sample were suspended in 5ml PBS. Since the culture work used here is solely exploratory and in no way used to de ne the chicken gut microbiome as a whole, impacts of sample storage on bacterial recovery rate will provide negligible in uence the on nal conclusions. Each aliquot was vortexed until homogenised, before performing serial dilutions in duplicate down to 1 x 10 -5 . Processing of samples for aerobic and anaerobic culture was identical, except that, for anaerobic culture, all culture media, diluent and consumables were pre-reduced to anaerobic conditions for at least 24 hours before faecal samples were processed in a Whitley A95TG workstation.
For dilutions 10 -3 -10 5 , 200µl was plated directly on to three agar plates of broad culture medium with or without vancomycin supplementation at a concentration of 6µg/ml (Additional File 1, Supplementary Table 1). Cultures were incubated at 37°C for 72 hours in their respective conditions before assessment of colony growth. Well-isolated colonies were picked according to colonial morphotype distinctive in colour, shape and size, before being re-streaked on to the growth medium from which they were sourced to con rm purity. Individual colonies were subsequently used to inoculate 2ml of broth based on the source culture medium, incubated at 37°C for a further 24 hours before bacterial DNA extraction. All isolates were archived at -80°C in glycerol at 20% concentration.

Genome sequencing and analysis
Genomic DNA was extracted using a DNeasy UltraClean DNA isolation kit according to the manufacturer's instructions (Qiagen, Hilden, Germany). DNA was quanti ed using a Qubit ® uorometer (Invitrogen, CA, USA) high-sensitivity assay, before dilution to the required concentration in RNase-free water and puri cation on AMPure XP beads (Beckman Coulter). Sequencing libraries were prepared from 0.5ng/µl of RNA free genomic DNA. A total of 282 isolates were included for genomic sequencing using the Nextera-XT DNA sample preparation kit (Illumina) and whole-genome sequencing performed using the Illumina NextSeq sequencing platform, generating paired-end reads (2 x 150bp). Reads were uploaded to the Sequence Read Archive under Bioproject ID PRJNA543206.
Paired-end reads were quality-assessed and trimmed using FastQC and Trimmomatic as described above [19,20]. Trimmed reads were assembled into scaffolds using SPAdes version 3.13.1 [56]. Scaffolds shorter than 500 bp were discarded from analysis. Genome contamination and completeness was assessed using CheckM version 1.0.13 [47]. To con rm assembly quality, only genomes conforming to all the following criteria were included in further analysis: (i) scaffold N50 of >20 kbp (ii) 90% of assembled bases at > 5x read coverage (iii) completeness of > 95% (iv) contamination of < 5% (v) complete 16S rRNA gene sequence.

Genome sequence taxonomic assignment
Barrnap Version 0.9 [57] was applied to all genomes that passed the quality lters to extract full-length 16S rRNA gene sequences. These were then compared to NCBI 16S rRNA gene sequences from RefSeq genomes using the NCBI's web-based BLASTN facility [58]. 16S rRNA sequences that showed an identity of <98.7% to known sequences were assigned to novel species, using the conservative approach of [59]. We used ReferenceSeeker Version 1.6.2 [60] to determine average nucleotide identity (ANI) and conserved DNA values compared to RefSeq bacterial genomes (Completed March 2020) [22]. Genomes that showed ANI ≤95% and conserved DNA ≤69% to the closest relative were designated novel species. The Genome Taxonomy Database Toolkit (GTDB-Tk Release 89) was used to perform taxonomic assignments on isolate genomes [50]. Genomes were clustered at 95% and 99% ANI before selection of a single representative isolate per species using dREP [49]. Where a genome previously designated as novel clustered with a genome of assigned taxonomy, this taxonomy was then applied to the previously designated 'novel' genome. Final taxonomic assignments were based on genome-based ANI values derived from RefSeq and GTDB -with GTDB assignments taking precedence.

Phylogenetic analysis
For phylogenetic analysis of all MGS and genome sequenced isolates we used Anvi'o v6.1 [61] to retrieve protein sequences associated with the set of 40 conserved marker genes for metagenomic species described above and representative genomes of dereplicated cultured species [40]. We performed a multiple sequence alignment on concatenated sequences of the marker gene products from each species using Muscle v3.8.31 [62]. We then built an approximate-maximum-likelihood phylogenetic tree using FastTree v2.1 [63] and visualised used the online iTOLv1.4 platform for visualisation and manual annotation [64]. This was used to con rm that species and genera were monophyletic.
To investigate the phylogenetic placement of cultured isolates designated as Escherichia marmotae and Escherichia sp001660175 by GTDB, we constructed a core genome phylogenetic tree. The genomes from cultured isolates were compared to genomes representing the full diversity of the genus Escherichia (Additional File 1, Supplementary Table 5). Three Salmonella genomes were included as an outgroup. The genome sequences were aligned using Mugsy [65], and alignment blocks conserved across all genomes were concatenated to produce a core genome alignment. A phylogenetic tree was constructed by maximum likelihood with 100 rapid bootstrap replicates, using the general time reversible model of nucleotide substitution with gamma correction for rate heterogeneity, as implemented in RAxML version 8.2.12 [66].

Reference-based pro ling documents novel diversity
We collected faecal samples from fty chickens reared in the UK belonging to two breeds: Lohman Browns (n=30) and Silkies (n=20). Short-read sequencing of fty faecal samples generated a metagenomic dataset in excess of a billion paired-end reads or three hundred billion base pairs (Additional File 1, Supplementary Table 2).
In recent years, bioinformatics methods have been developed for ultrafast, highly accurate phylogenetic pro ling of metagenomic sequences that rely on matching short sequences (k-mers) to a reference database built from sequenced genomes [21]. Such methods can assign sequences within a taxonomic hierarchy that extends from domains to species and can then quantify the representation of each taxon within a sample. Although these methods have been used in a very focused way to detect Campylobacter in chickens [67], they have yet to be applied to a global analysis of microbial diversity in the chicken gut microbiome. We therefore initially analysed the faecal samples using the kmer-based program Kraken 2, followed by re ned phylogenetic analysis using the allied program Bracken [23] (Additional File 1, Supplementary Table 3).
Unsurprisingly, Kraken 2 and Bracken assigned sequence reads from the faecal samples to all three domains of life, as well as to viruses (Fig. 2a, Additional File 1, Supplementary Table 4), although relative abundance assignments show that bacteria predominate in this environment. Sequences were assigned to a wide range of bacterial phyla, including the three expected as predominant in the vertebrate gut (Bacteroidetes, Firmicutes, Proteobacteria), but also including over twenty additional phyla (Fig. 2b).
Searches of the PubMed database with each phylum name and the term "chicken" reveal that round half of these have been previously documented in the chicken gut. However, at least a dozen appear to be novel in this setting, including the Aqui cae, Balneolaeota, Calditrichaeota, Chlorobi, Dictyoglomi, Fibrobacteres, Gemmatimonadetes, Ignavibacteriae, Kiritimatiellaeota, Lentisphaerae, Nitrospirae, and the Thermodesulfobacteria.
When we rank-ordered the species identi ed by Bracken according to maximum abundance in any one sample, we found, as expected, that species of Lactobacillus dominated among the top twenty most abundant organisms. However, we found that two species of Escherichia-Escherichia coli and Escherichia marmotae-accounted for ≥ 5% of reads in nearly half of the samples (23/50) and in two samples, accounted for more than 50%. Such monodominance of the gut microbiome by bacterial species has been described in diseased humans [45,68], but is surprising in the context of healthy poultry. We also noted a high relative abundance of the recently described chicken pathogen Gallibacterium anatis [69] in most birds (with ve birds showing >5% reads assigned to this organism), despite their healthy status. Similarly, Fusobacterium mortiferuman opportunistic pathogen of humans [70]-accounted for >10% of sequences in ten birds, corroborating a recent report of high abundance of 16S sequences from this organism obtained from the chicken caecum [71].
Bracken assigned sequences to over a hundred bacteriophage genomes, predominately phages infecting members of the Enterobacteriaceae assigned to the families Myoviridae and Podoviridae. Particularly noteworthy was the high abundance of reads in some samples from two distinct bacteriophages that prey on E. coli: phiEcoM-GJ1-a lytic bacteriophage isolated in Canada from pig sewage [72]-which accounted for 6.7% reads in a single sample and phAPEC8-a lytic bacteriophage with a large 147kb genome, isolated from a Belgian poultry farm-which accounted for 10% of reads in a single sample and for >1% of reads in three others [73].
Although we were pleased to see that these k-mer-based analyses can provide interesting insights into taxonomic diversity within the chicken gut, we quickly realised that they provide an incomplete and misleading picture of this important microbiome for several reasons: (1) they often report the presence of highly implausible organisms-for example, Kraken 2 reported the presence of human pathogens such as Shigella exneri and Plasmodium falciparum that are simply not credible in this context on clinical grounds; (2) as with 16S studies, they fail to provide genomic data or insights into the functional diversity or population structure of the microbial species that they identify and; (3) they rely on a reference database and so can only report previously known organisms and can never uncover "unknown unknowns".
The scale of the problem of unknown diversity is clear from the observation that nearly three quarters (73%) of sequence reads from our chicken samples cannot be con dently classi ed by Kraken 2 to species level and more than half of the reads (54%) cannot be classi ed at all and are simply designated as "Unassigned" (Fig. 2a). We therefore sought to extend our understanding of this community through two powerful reference-free approaches: assembly-based metagenome analyses and high-throughput culture.

Metagenomic assembly uncovers a wealth of viral diversity
Assembly of metagenomic sequences is a reference-free approach that involves aligning and merging short sequence reads into long contiguous sequences (contigs), which can then be ordered into larger scaffolds that include sequence gaps.
Keen to con rm the presence of bacteriophages inferred through the reference-based analysis and to identify novel viral genomes, we assembled sequence reads from our fty chicken faecal samples into scaffolds. Scaffold sequences ≥10kb were analysed with VirSorter-a program designed to detect viral signals in microbial sequence data to nd novel viruses [31].
VirSorter identi ed 184 of our chicken faecal scaffolds as Category 1 ("most con dent") bacteriophage sequences and identi ed an additional 1,840 scaffolds as Category 2 ("likely") bacteriophage sequences.
This was de-replicated to 1,455 genomes using similarity thresholds of 95% ANI over 70% of the genome (Additional File 2, Supplementary Table 1). BLASTN analysis revealed only ten of these bacteriophage genomes showed high similarity (percentage identity > 70%; query covering > 50%) to known phages at the nucleotide level (Additional File 2, Supplementary Table 2). These included close relatives of the two phages (phiEcoM-GJ1 and phAPEC8) found highly abundant in the Bracken analyses (Fig. 3).
Interestingly, more than one genus of coliphage (e.g. Jilinvirus, Phapecoctavirus, or Gamaleyavirus) was often detected in the same sample, along with an abundance of reads from their predicted prey (Escherichia) suggesting interesting dynamics in phage-host and phage-phage interactions ( Fig. 3; Additional File 2, Supplementary Table 3).
Of the remaining 1,445 unclassi ed bacteriophage genomes, nearly 600 encoded either an obvious terminase region or were circular and as such were suggested as being near-complete. Classi cation of these genomes revealed all genomes were predicted to belong to the order Caudovirales of tailed phages, with the majority belonging to the family Siphoviridae (n=429), but we also found representatives from the Myoviridae (n=87) and Podoviridae (n=27), plus some bacteriophages unclassi ed at family level (n=28) (Additional File 2, Supplementary Table 4).
Remarkable microbial genome diversity in the chicken gut Next, we subjected our samples to computational binning-a process of grouping contigs/scaffolds on the basis of sequence composition and depth of coverage into discrete population bins representing metagenome-assembled genomes (MAGs). However, to carry out a de nitive survey of bacterial and archaeal diversity in the chicken gut microbiome-in addition to analysing the fty faecal samples mentioned and before we started the binning-we retrieved all publicly available chicken gut metagenomic datasets, to create an expansive dataset representing >630 samples, drawn from ten studies and twelve countries (Belgium, China, France, Germany, Italy, Malaysia, Netherlands, Poland, Spain, The Gambia, UK, USA) (Figure S1a/S1b; Additional File 3, Supplementary Table 1).
Sequence assembly and binning on all these samples generated 5,595 MAGs that passed our quality threshold of ≥80% completion and ≤5% contamination ( Figure S1c). Of these 3,131 could be considered high-quality draft genomes, with >90% completion and <5% contamination, as judged by recently published criteria (Additional File 3, Supplementary Table 2) [74]. Genome sizes of the MAGs ranged from 0.5 to 6.4 Mbp, while GC content ranged from 24% to 73%.
Then, we grouped the MAGs into metagenomic species (MGSs). Initially, this involved de-replicating MAGs at the widely accepted 95% average nucleotide identity (ANI) for de ning bacterial and archaeal species and 99% ANI for de ning bacterial and archaeal strains [75,76]. De-replication of MAGs at 95% ANI resulted in 846 clusters representing bacterial and archaeal species, while de-replication at 99% ANI resulted in 2182 clusters, representing strains. However, to improve recovery of MAGs, MGSs and associated gene sets, we used gene correlations to identify species-representative genes and then applied hierarchical clustering to co-occurring genes across the samples. This allowed us to identify additional genes from the core genome of a species, even when they show divergent nucleotide compositions (such as genes from genomic islands and plasmids) [45]. Similarly, using canopy clustering [46], we could identify commonly occurring species of low abundance. Using these approaches, we were able to identify an additional seven MGSs (Additional File 3, Supplementary Table 3).
Of the 853 de-replicated bacterial metagenomic species, 321 represented previously delineated species catalogued in publicly available databases (Additional File 3, Supplementary Table 4). Following direct comparison, a further 165 metagenomic species had been previously identi ed by Glendenning et al [15], with these sequences not currently available in public archives. However, only 158 of our metagenomic species possess validly published names based on Latin binomials (Additional File 4, Supplementary  Tables 1 and 2).
We performed a search of PubMed with the species name and "chicken", leaving aside the 33 species named by Glendenning et al [15]. This suggested that our study provides the rst-evidence-in-chickens for the majority (81/125) of these species (Additional File 4, Supplementary Table 3). Examples include: Jeotgalicoccus halophilus, rst isolated from the traditional fermented seafood, Jeotgal [77]; Aliicoccus persicus, rst isolated from a hypersaline lake [78]; and Bacteroides reticulotermitis, rst isolated from the gut of a termite [79]. A search of PubMed with the species name and the terms "humans" and infection" suggests that nearly 40% of these known species have been associated with human infection, highlighting the role of the chicken gut as a source of zoonotic pathogens.
We found that 310 of our metagenomic species could be assigned a taxonomy only at the level of genus and so represent novel candidate species. A further 56 species could be assigned a taxonomy only at the level of family and, after AAI clustering at 60%, were assigned to 36 novel candidate genera. One candidate bacterial species could be assigned a taxonomy only at the level of order (Oscillospirales) and so represent a new family.
Three MAGs were assigned to the domain Archaea. One represents the species Methanobrevibacter woesei-which is already known to inhabit the chicken gut [80]-while the other two represent novel species within the genera Methanocorpusculum and UBA71.

Linnaean binomials for hundreds of new candidate species
Linnaeus rst proposed the assignment of Latin binomials to provide a universal nomenclature for biological species [81]. The International Code of Nomenclature of Prokaryotes (ICNP) sets the rules for naming prokaryotic species [82], but currently precludes the publication of names of uncultivated organisms, represented by MAGs or other sequences. Furthermore, high-throughput generation of MAGs and of sequence-based taxonomies for bacteria, such as the GTDB [50] is often assumed to preclude the detailed attention usually given to one-by-one construction of Linnaean binomials. As a result, most uncultured taxa, as well as many taxa de ned on sequence-based criteria, have been assigned unstable, confusing and hard to-remember alphanumerical identi ers.
Keen to provide a stable, clear and memorable nomenclature for novel and/or previously unnamed bacterial and archaeal species from the chicken gut, we exploited the provision within the ICNP for naming uncultivated taxa via Candidatus assignments, which, although provisional, provide the scienti c community with well-formed Latin binomials [83,84]. However, this prompted us into an unprecedented effort to create hundreds of new names for the purpose of this single research study-an effort that required us to devise a scalable combinatorial system for the creation of binomials. Here, we made extensive combinatorial use of around twenty Latin and Greek roots pertaining to poultry (avi-, galli-, pulli-, alektryo, ptero, kotto-, ornitho-), intestines (intestini-entero-), faeces (faec-, kakke, merd-, kopro-, excrement-) or microbial life (-monas, -bacterium, -microbium, -coccus, -bacillus, -bium, -cola)-twinned with addition of these roots (singly or in tandem) and/or pre xes (allo, hetero, meta-, para-, crypto-) to existing genus names-to create over 200 Candidatus genus names. An additional source of diversity stemmed from repetitive use of around forty Candidatus species epithets built from similar roots, which when combined with genus names gave us a total of over 600 distinctive new binomials.

Taxonomic diversity of cultured bacterial isolates
To extend our metagenomics analyses, we applied culture-based methods to six faecal samples that appeared species-rich in Kraken 2 analyses and in so doing obtained 282 isolates from aerobic culture (~80% of isolates) and anaerobic culture (~20% of isolates) (Additional File 5; Supplementary Table 1).
All isolates underwent genome sequencing on the Illumina platform and phylogenetic analysis to enable taxonomic assignment. The resulting chicken gut culture collection was found to contain 56 genera, 93 species and 162 strains drawn from ve phyla. These included thirty novel species. As with the metagenomic species, all novel or previously unnamed genera and species from cultured isolates were assigned Linnaean binomials (Table 1; Additional File 5, Supplementary Table 2).
Interestingly, alongside ten cultured isolates of the well-characterised species Escherichia coli, we recovered three isolates from Escherichia marmotae (a species recently described in Himalayan marmots [85]). As previously reported [86], the E. marmotae strains cluster closely with the Escherichia Clade V [87] , so all members of this clade should be considered members of this species (Fig. 6, Additional File 5; Supplementary Table 3). Further analysis of the GTDB species designated Escherichia sp001660175 [88] con rmed that this species forms a monophyletic lineage that corresponds to the Clade II, among the cryptic environmental clades described by Whittam and his colleagues [89], which was subsequently documented in birds [90]. As Clade II is comparable in divergence to the other Escherichia spp. and cryptic clades, we have therefore assigned the Linnaean binomial Escherichia whittamii to designate a new species (Table 1), honouring the outstanding contribution of Thomas S. Whittam to the study of Escherichia spp. [91].
We found that only sixteen species were common to our cultured isolates and our MGSs. Subsequent sequence mapping allowed us to detect a further two cultured species at ≥1x coverage in at least one metagenomic sample (Fig. 5; Additional File 5, Supplementary Table 4), The genomes from cultured isolates were on average 20% larger than the corresponding MAG sequences retrieved from the same source sample (Additional File 5, Supplementary Table 5), which is in line with the completeness threshold of 80% we adopted in quality assurance of the MAGs. However, when we performed detailed gene content analyses on three abundant species in both cultured and metagenomic datasets -Lactobacillus reuteri (with the synonym Limosilactobacillus reuteri), Escherichia coli (including the synonym Escherichia exneri) and Enterococcus faecium-we found that >99% of the genes from the core genomes and nearly half of the genes in the accessory genomes of cultured species were represented in at least one MAG ( Figure S2). These observations suggest that our high-quality MAGs are su ciently complete to warrant Candidatus names.
We analysed our chicken faecal metagenomes with a Kraken 2 database derived from genomes representing our candidate metagenomic and cultured species, this yielded a considerable improvement in the number of reads that can be classi ed through rapid phylogenetic pro ling ( Figure S3).

Distribution of microbial species
An analysis of the distribution of 820 MGSs across the entire metagenomic dataset revealed marked variation between samples, with not a single species present at ≥1x coverage in all samples and only 39 species present in >90% of samples-although 441 species were present in >50% of samples at ≥1x coverage ( Fig. 7; Additional File 5, Supplementary Table 6).
Among the species with high coverage, frequency is clearly linked to Bioproject. Although species quanti cation curves showed that the number of species identi ed increased rapidly with the number of samples, species discovery appeared to plateau at approximately 230 species after including only 50 metagenomes ( Figure S4). Only two species appeared to be restricted (at ≥1x coverage) to just a single sample: Aliarcobacter thereius and Candidatus Avibacteroides faecavium. Correlation clustering con rmed structure in the data linked to BioProject ( Figure S5) -for example, the BioProject from the study by Glendenning et al [15] clearly shows enhancement of clostridial species compared to other BioProjects, re ected samples sourced from chicks with no post-hatching contact with an adult bird.
However, the BioSamples do not appear to cluster by country and unfortunately metadata for other potentially important factors, such as breed, age or diet, is not adequate enough to draw conclusions on how these might in uence clustering.

Discussion
Given the dominance of chickens in the planetary biomass, the chicken gut microbiome ranks as one of the most abundant microbial communities on the planet. Here, we have exploited two complementary approaches-metagenomics and culture-to create an extensive catalogue of genes, genomes and isolates from this important ecosystem. Our work illustrates the value of combining culture-dependent and culture-independent approaches in analysing microbiomes.
We have clearly demonstrated the advantages of shotgun metagenomic sequencing, and allied bioinformatics approaches [92], when applied to the chicken gut microbiome, providing catalogues of genes and genome sequences that takes us well beyond what can be achieved using 16S ribosomal RNA gene sequences. However, the limited overlap between bacterial species represented among our cultured isolates and in our MGS reinforces the utility of the combined approach. Nonetheless, the substantial colinearity between genomes obtained by the two approaches-and with those from another similar metagenomic study [15]-con rms the reliability of our binning approaches.
We were surprised to nd such a remarkable phylogenetic diversity within this commonplace livestock ecosystem-diversity that rivals that associated with the human gut. Our work has more than doubled the number of bacterial species known to reside in the chicken gut and has resulted in the creation of an unprecedented number of new Candidatus species. By including well-formed Latin binomials with the genomes we have uploaded into public repositories, we have ensured that the new proposed names and associated sequences will be integrated into commonly used online taxonomies and databases [22,93] and will provide a stable taxonomic nomenclature for future studies. In addition, we have provided proofof-principle for a scalable approach to Linnaean nomenclature that could be applied to species recovered from other metagenomic assembly projects [94].
Given that we did not recover by culture some of the organisms that appear most abundant by metagenomics, there is clearly scope for additional culture-based investigations, using a wider range of cultural conditions-perhaps drawing on the precedent of the Human Microbiome Project to create and target a list of the "most-wanted-for-culture" organisms documented by metagenomics [95]. The fact that novel metagenomic species are still being recovered from human gut datasets that include tens of thousands of metagenomes [12]-twinned with the promise of novel long-read and proximity-capture approaches to metagenome analyses [96]-make it clear that our attempts here to analyse all currently available chicken gut metagenomes provide far from the last word on microbial diversity in this abundant and important ecosystem. Nonetheless, the availability of so many novel genes, genome and species represents a great leap forward.

Conclusions
The extensive catalogue of genes, genomes and isolates we have created here substantially improves the coverage of the chicken gut microbiome in the public databases and will make it possible to pro le sequences from the chicken gut much more rapidly, easily and comprehensively, providing a valuable resource that lays the ground work for future comparative and intervention studies. We also offer a provocative precedent-relevant not just to animal microbiomes, but to studies on all microbiomesassigning well-formed Latin binomials to hundreds of metagenomic species in a scalable alternative to the automated use of bland, unstable, user-unfriendly alphanumerical designations. We hope that others will agree that it is now time to bring Linnaeus right into the heart of microbiome studies. Authors' contributions RG, AR contributed to the study design, processing, analysis, and interpretation; and manuscript preparation. EFN, SJ, AS, MA contributed to sample collection and processing. NH, DB, KG contributed to sample processing. NFA, EMA contributed to the data analysis. FH contributed to the study design, data analysis interpretation, and manuscript preparation. MW contributed to the study design, data analysis and interpretation, and manuscript preparation. AO contributed to the naming of the candidate taxa and to the manuscript preparation. MJP conceived of the study in design, coordination and manuscript preparation. All authors read and approved the nal manuscript. RML, DLH, IP and MG contributed to the collection and processing of the samples and metadata. RC contributed to data analysis and interpretation and manuscript preparation. Tables   Table 1. Protologues for new taxa cultured from chicken faeces DESCRIPTION OF ACINETOBACTER PECORUM SP. NOV.
(pe.co'rum M.L. gen. pl. pecorum of ocks of sheep, birds etc., as this species has also been isolated from sheep) A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [88] has given this species the alphanumerical designation sp001647535. The Type Strain is Sa1BUA6, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 42.9% and the genome size is 3,209,341 base pairs. Further A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [88] has applied the designation Arthrobacter_B, to this genus. The Type Strain is Sa2CUA1, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 65.5% and the genome size is 3,679,471 base pairs. Further information can A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [88] has applied the designation Arthrobacter_B to this genus. The Type Strain is Sa2BUA2, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 65.7% and the genome size is 3,726,732 base pairs. Further information can be found in the Methods and in Additional File 5.
DESCRIPTION OF BACILLUS NORWICHENSIS SP. NOV.
(nor.wich.en'sis. N.L. masc. adj. norwichensis pertaining to English city of Norwich, where the organism was isolated) A bacterial species cultured from chicken faeces and assigned to this genus and delineated as a species by analysis of its genome sequence. GTDB [88] has applied the designation Bacillus_AM to this genus. The Type Strain is Sa3CUA8, which has been submitted for deposition in NCTC and DSMZ. The GC content of the Type Strain is 40.2% and the genome size is 4,696,597 base pairs. Further information can be found in the Methods and in Additional File 5.   Sequence based analysis of four coliphage genomes recovered from the chicken faecal metagenomes a.
Synteny plots comparing four novel coliphage genomes (in red) to closest reference genomes. b.
Coverage of four coliphages and of the host bacterial species. Only samples in which at least one genome had ≥1x coverage are shown (n=29). All coverage values have been Log10 transformed with blue depicting low abundance and red high abundance. Sequence based analysis of four coliphage genomes recovered from the chicken faecal metagenomes a.
Synteny plots comparing four novel coliphage genomes (in red) to closest reference genomes. b.
Coverage of four coliphages and of the host bacterial species. Only samples in which at least one genome had ≥1x coverage are shown (n=29). All coverage values have been Log10 transformed with blue depicting low abundance and red high abundance.
Page 43/46   UpSet plots depicting presence of 820 metagenomic species across all BioProjects included within this study at a. 1x coverage and b. 10x coverage. Bars are stacked according to taxonomic species novelty, with black stacked bars depicting novel species and grey depicting species previously described in public databases or published studies. Only intersections with 5 or more species are shown.