Recovery of Human Gut Microbiota Genomes Substantially With Third-generation Sequencing

Human gut microbiota modulates normal physiological functions, such as the maintenance of barrier homeostasis and the modulation of metabolism, and various chronic diseases including type 2 diabetes and gastrointestinal cancer. Despite decades of researches, the composition of the gut microbiota remains unexplored and unidentied. Here we established an effective extraction method to obtain high-quality gut microbiota genomic DNA and detected the samples with third-generation sequencing technology. We acquired a quite big data form each sample and assembled many reliable contigs. Not only enormous unknown genes, but also several new bacteria subspecies or species were identied. This work provides a novel and to substantially, facilitating the


Background
There are growing evidences that gut microbiota-the human commensals-in uences normal physiological functions from the maintenance of barrier homeostasis to the modulation of metabolism, in ammation, immunity, and development [1][2][3]. Further, its disorder is a potential risk factor of various chronic diseases such as type 2 diabetes, gastrointestinal cancer, and brain disorders [4][5][6]. Furthermore, the gut microbiota is one of the crucial factors which in uence drug effectiveness owing to its effects on drug metabolism [7,8]. Despite decades of researches, the composition of the gut microbiota remains unidenti ed. For example, through analyzing more than 150,000 human microbial genomes, Edoardo et al recapitulated nearly 5000 species-level genome bins, in which 77% without genomes in public repositories [9]. One of the main di culties is that many microbes are uncultivable and hard to be isolated [9]. In addition, culture-dependent genomic researches are rare to large cohorts [10]. High-through sequencing that do not need to cultivate microbes, overcomes these shortage and becomes a powerful method for bacterial metagenome study.
There are two main methods for bacteria metagenome sequencing: the 16s rRNA-based sequencing and whole-metagenome shotgun sequencing (WGS). 16S rRNA-based sequencing is widely used in assessing microbial communities due to its low cost, time e ciency, and the ability to provide a full overview of the community [11,12]. However, as only a single region of 16s rRNA in the genome is detected, the information it provides about the microbial community is very limited [13,14]. Furthermore, it underestimates the diversity of the microbes. Compared with 16s rRNA sequencing, WGS technology examines the whole bacterial genome, and provides a more accurate detection both at the species and diversity levels [15]. Through WGS technology, it is feasible to assemble the entire bacteria metagenome.
However, there are several drawbacks in WGS technology owing to small sequencing reads (mostly less than 200 bp). For example, the sequence information is too little to assemble the de novo genome e ciently [16,17]. As a result, many microbiota genomes are fragmented into lots of contigs [17]. Moreover, great phenotypic differences exist even between highly related strains of the same species [18,19], and the differences between these strains are di cult to be distinguished by WGS. Therefore, there is a dire need to detect the bacteria metagenome through a more powerful sequencing method.
The third-generation sequencing (TGS) technology, also known as long-read sequencing, detects the isolated genome DNA without ampli cation, and produces surprising long reads (average 10-20 kb) [20].
Compared with the second-generation sequencing technology of 16s rRNA and WGS, it can detect much longer fragments, so TGS produces genome assemblies of unprecedented quality [20]. For example, Hui et al de novo assembled a chromosome-level reference genome of red-spotted grouper with TGS [21].
Besides, TGS has been used in bacterial genome detection yet. For instance, Johanna et al assemble a certain speci c microbial species living in the vaginal with an abundance of more than 75% [22].
Thidathip et al compared the taxonomic abundance of gut microbiota of head and neck cancer patients; while the mean length is about 1 kb, which is much smaller than that of the buffalo genome (11.5 kb) [23,24]. However, one main di cult for gut microbiota genomes detection is that it is tough to obtain highquality genome DNA since the complex fecal environment. Therefore, there are few studies on gut microbiota with TGS yet. In this study, we have tried multiple methods and successfully extracted highquality gut microbial genome DNA. Though detecting the gut microbiota metagenome via Single Molecule Real-Time (SMRT) Sequencing of Pacibio, we assembled the bacteria genomes e ciently and discovered many new contigs without genomes in public repositories and new bacteria species.

Results
Gut microbiota genome isolation and the whole characteristics of the TGS results We analyzed the data of prokaryotes genomes in the National Center for Biotechnology Information (NCBI) and found that complete genomes account for only a small portion (https://www.ncbi.nlm.nih.gov/genome/browse#! /prokaryotes/) (Fig. 1a, Supplementary Fig. 1a), which was consistent with the previous research [9]. In order to assemble de novo complete genomes of gut microbes, we tried to detect the gut microbiota metagenome through the TGS technology. To obtain highquality gut microbiota genome samples for TGS, we tried our best to improve the integrity and the quantity of the genome DNA. In the beginning, the isolated genome DNA was fragmented ( Supplementary  Fig. 1b). After trying many different kits, we acquired a complete genome DNA with the kit from the MPbio company. Then we tried to improve the yield of the genome DNA. We isolated the top band and puri ed the gut microbiota genome DNA with a DNA Puri cation Kit. Finally, we acquired two high-quality samples: a 34-year old male and a 10-month old baby.
. We then detected the whole metagenome of the two samples through TGS technology. We acquired 53G data from the baby sample, and 47G data from the adult sample. Through analyzing the genome database in NCBI, we found that most of the bacterial genomes are large than 0.5 Mb and the smallest genome is about 0.1 Mb (Supplementary Fig. 1b). Therefore, we chose 0.1 Mb and 0.5 Mb as cutoff values to analyze the assembly contigs (Fig. 1b). In the baby sample, there were 43 contigs larger than 0.5 Mb, and 248 contigs larger than 0.1 Mb. Similarly, in the adult sample. Similarly, in the adult sample, there are 33 contigs larger than 0.5 Mb, and 237 contigs larger than 0.1 Mb (Supplementary Tab. 1, Fig.  1b). To assess the reliability of our methods, we analyzed the sequences of 16s rRNA from the longest contig (Contig_511). All the six 16S rRNA sequences of contig 511 are almost the same, with only two different bases in the over 1.5 kb sequence ( Supplementary Fig. 1c). Interestingly, the 16S sequence in contig 511 has fewer differences than that in the E. coli (U00096.3) (Supplementary Fig. 1c). These results suggest that the methods we used are available and reliable ( Supplementary Fig. 1c). Moreover, the contig length was signi cantly related to the number of the coding regions (CDSs) (Fig. 1c). Through analyzing the function potential of the CDSs, we found that many CDSs were correlated with the metabolism of amino acids and carbohydrates. Interestingly, the function potential of a large number many genes were unknown, indicating that there are still numerous valuable genes worth investigating in gut microbes (Fig. 1d).
The aspects of the contigs in the two samples Since most of bacteria genomes are over 0.5 Mb, we further analyzed the contigs over 0.5 Mb in the two samples. In these contigs, ten of them are circular ((Supplementary Tab. 2, Fig. 2a); and twelve contigs are more than 99% complete (Fig. 2b). Nine contigs are both circular and complete (>99%) (Fig. 2c).
Therefore, we focus on the nine contigs and analyzed their features. The full-length 16S rRNA blast analyses showed that less than 50% (four of nine contigs) of these contigs could be completely matched with previous sequences in the NCBI databases ( Fig. 2d and 2e) (indicated by the red arrow). Five of them are not completely matched, indicating that they might be new species or new subspecies (Fig. 2e). To investigate the taxonomic location of the nine contigs, we analyzed them with a phylogenetic analysis using the full-length 16S rRNA. Comparing with some common bacteria, including Bacterioides fragilis and Escherichia coli, we found that their taxonomic location in the phylogenetic tree successfully (Fig.   2e).
The genome analysis of the ve non-match genomes.
Then we further analyzed the ve genomes that cannot be completely matched. Their genome length were from 1.5 Mb to 3.5 Mb ((Supplementary Tab. 3, Fig. 3a). Compared with the huge difference in genome length, the average length of each gene was very similar (Fig. 3a). On the contrary, the difference of GC contents between these genomes were quite large (Fig. 3b). And afterwards, we analyze some special structures of the genomes, including non-coding RNA (ncRNA) and repeat sequences. Interestingly, the number of rRNA and tRNA in each genome was correlated with each other (Fig. 3c). Each genome had all types of repeat sequences, and simple repeats were the major repeat type (Fig. 3d). Intriguingly, we found that all the ve bacteria carry streptomycin resistance, which might result from long-term antibiotic use (Fig. 3e), suggesting that the extensively use of antibiotics might have irreversibly transformed the human gut microbial. To determine the classi cation they belong to, we blasted the fulllength 16S rRNAs in NCBI databases (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome). Four of them (contig_82, contig_242, contig_318 and contig_511) had very high identities (>99%). On the contrary, the identity of contig_638 is less than 97% (Fig. 3f).
Phylogeny of four high identity non-matched genomes.
Then we analyzed the bacteria with identities more than 99%, and detected their accurate taxonomic location by phylogenetic tree analysis. We blasted the full-length 16s rRNAs of these genomes with the NCBI databases, and then selected the most related species (rank by identity) for further analysis. Contig_82 belongs to Collinsella erofaciens and might be a new subspecies (Fig. 4a). Similarly, contig_242 is a subspecies of Megasphaera micronuciforrnis; contig_318 is a subspecies of Lactobacillus dextrinicus; and contig_511 is a subspecies of Lactobacillus smilis ( Fig. 4b-4d). The analysis of non-redundant database annotations con rmed these results ( Supplementary Fig. 3a-3d).Then we compared the full-length 16s rRNAs of these genomes with that of the related bacterial strains (indicated by the red arrow) and found that the differences between them were very small (Fig.   4e). The most considerable difference in contig_318 was only ve bases among more than 1500 bases (Fig. 4e). Interestingly, the Megasphaera micronuciforrnis (contig_242) is a conditional pathogen; therefore we sought to evaluate whether healthy people also carry other conditional pathogens or pathogens. We selected 27 common pathogens and conditional pathogens, including Escherichia coli, Staphylococcus aureus, Shigella exneri, and Helicobacter pylori. Only Shigella exneri was found in the gut of adult and baby, while other pathogenic bacteria were not found (Fig. 4f).
Genomice aspects of the new species of Candidatus Enterococcus shanghaius.
Then we analyze the contig_638 with a poor identity. Through the analysis of non-redundant database annotations, we found that it is di cult to nd a bacterium matched contig_638, since the most similar Enterococcus faecalis had only 8% similarity (Fig. 5a). Then we sought to identify its classi cation through NCBI blast and the 16s rRNA phylogenetic tree analysis. Excitingly, we discovered that this bacterium belongs to the Enterococcus (Fig. 5b). Since it was found in Shanghai, we named it as Candidatus Enterococcus shanghaius. Comparing with the most similar bacteria, Enterococcus saccharolyticus and Enterococcus faccalis, we found that there were considerable differences in the 16s rRNA sequence among them (Fig. 5c). Then we analyzed the function potential of the genome and found that many genes might be correlated to carbohydrate metabolism and lipid metabolism (Fig. 5d).
Interestingly, the function potential of many genes were not clear (Fig. 5d). Further analysis of sugar metabolism shows that the most genes were associated with hydrolysis (Fig. 5e). Finally, we analyzed the whole genome of the contig_638: it contains a total of 2.1M bases, and the GC content is relatively low, about 34.49%, which is also consistent with other species of Enterococcus (Fig. 5f). It has 2111 genes, 12 rRNA and 63 tRNA (Fig. 5f).

Discussion
Extensive studies on the human gut microbiota have been reported in the past decades. However, large numbers of species of in the gut microbiota remain unknown. Through improving the extraction method, we successfully acquired the high-quality genomic DNA of the gut microbiota, and obtained approximately 50 Gb TGS data each sample, which is much bigger than previous studies [23,25,26]. We have successfully assembled nine bacteria genomes and large numbers of contigs through these methods. Interestingly, more than 50% of the genomes (5 of 9 bacteria genomes) might be new human species or subspecies. Via large-scale metagenomic assembling, Edoardo et al and Alexandre et al uncovered thousands of new bacteria species [9,27]. These results, together with our work, indicate that it is still of great valuable to investigate the gut microbiota in the future. Furthermore, the methods used in this study help to assemble the genome of unknown bacteria, in consequence, might facilitate to the Human Microbiome Project and the Genomic Encyclopedia of Bacteria and Archaea [28,29]. Furthermore, bacteria are widely distributed in various environments, such as natural lakes, oceans, soils and some polluted environments. Therefore, it is of great interest to investigate bacteria living in different environments with our methods. Compared with metagenome, the genes in the microbial genome has traditionally been underestimated. For example, Hila et al found that there are a fairly large number of unknown small proteins in the human microbiome [30]. Intriguingly, our work had discovered more than 10,000 unknown genes with no known domain (Fig. 1d). These ndings illuminate that a vast number of unknown genes are need to explore in the human microbiome. Importantly, some special genes of bacteria are very important for human health.
For example, an anti-in ammatory protein from Faecalibacterium prausnitzii could inhibit the NF-κB pathway in intestinal epithelial cells [31]. Analyzing with the KEGG, we also found that the genes of the two samples were closely associated with our basic physiological activities (Supplementary Fig. 4a and  4b). It is reported that several proteins from bacteria are potential diagnostic disease indicators since they can pass through intestinal epithelial cell and enter into the plasma [32]. Besides, the exploring of the drug-metabolizing enzymes from gut microbiota might be very useful in drug development and personalized medicine [7,33]. Our research helps to discover more bacteria proteins linking to human health and diseases.
In the nine assemble genomes, we luckily discovered a new bacterium-Candidatus Enterococcus shanghaius. We discovered that it belongs to the Enterococcus, a ubiquitous Gram-positive genus with low-GC genomes. It is reported that many members of this genus are pathogenic bacteria or conditional pathogens for their role as primary causative agents of health care-associated infections [34,35]. Therefore, this bacterium might be a conditional pathogen. Besides this type of bacteria, we have also found other two conditional pathogens: Megasphaera micronuciforrnis and Shigella exneri. There is an important question: what matters human health: quality, quantity of microbes [36]. The results in these work suggest that quality and quantity of microbes are both important for our health.
Despite the promise that this study holds for gut microbiota, it is important to note its limitations. First, the lengths of bacteria genome we acquired were between 1.5 Mb and 3.5 Mb, which is much smaller than that of Escherichia coli str. K-12 (4.64 Mb). We found out that the contig lengths were closely correlated with the coverage (Supplementary Fig. 5). Therefore it is very important to improve the data quantity of each sample to acquire longer contigs. With the continuous development of TGS technology, it is probable to acquire more sequencing data in a single sample. Second, although we have established an effective method, it is still not easy to acquire su cient high-quality genome DNA, and we have harvested two samples. It is of urgent need to improve these methods and to detect more human feces samples.

Conclusions
In summary, we established an effective extraction method to obtain high-quality gut microbiota genomic DNA and detected the samples with TGS technology. Not only a large number of unknown genes, but also several new subspecies and species were identi ed with our methods. This work provides a novel and reliable framework for exploring gut microbiota genomes, facilitating the understanding of the mechanisms that underlie the role of the microbiome in health and disease.

Methods
Sample collection and bacteria genome DNA extraction All methods in this study were approved by the Research Medical Ethics Committee of Shanghai University of Medicine & Health Sciences a liated Zhoupu Hospital. The feces samples were collected from a 34-year old male and a 10-month old baby and then stored in the Ultra-Low Temperature Freezer (Haier, Qingdao, China). The genome DNA extraction were performed with FastDNA Spin Kit for Feces (MPbio, California, USA). To acquire su cient high-quality gut microbiota genome DNA, we improved the experimental procedure, as follows: we added 500 mg feces in a 2 ml Lysing Matrix E tube, then mixed the feces with 825 μl Sodium Phosphate Buffer and 275 μl PLS solution, then shook the mix and vibrated for 15 seconds. Afterwards we centrifuged the samples at 14,000 g for 5 minutes at room temperature and decanted supernatant. Subsequently, we added 978 μl Sodium Phosphate Buffer and shook the mix and vibrated the mixture for 15 seconds, then added 122 μl MT Buffer and shook up and down gently for 5 minutes. Then we placed the samples in the shaker at 4 centigrade for 30 minutes; centrifuged samples at 14,000 g for 5 minutes and then transferred the supernatant to a clean EP tube; added 250 μl of PPS solution, shook vigorously to mix, and incubated at 4°C for 10 minutes and centrifuged samples at 14,000 g for 2 minutes; transferred supernatant to the Binding Matrix Solution in a 15 ml conical tube and shook gently for 5 minutes. Then we centrifuged samples at 14,000 g for 2 minutes and decanted the supernatant. Afterwards, we washed the binding mixture pellet with 1 ml Wash Buffer #1 and transferred the binding mixture to a SPIN Filter tube and centrifuge at 14,000g for 1 minute. We emptied the catch tube and added 500 μl of prepared Wash Buffer #2 to the SPIN Filter tube and gently resuspended the pellet. Afterwards, we centrifuged the samples at 14,000 g for two times to to extract residual ethanol. Finally, we transfer the SPIN Filter bucket to a clean 1.9 ml Catch Tube and add 100 μl TES to resuspend the genome DNA. The DNA were detected with agarose gel electrophoresis, and the top bands were isolated and puri ed with a DNA Puri cation Kit (Finegene, Shanghai, China). The DNA concentration and integrity were assessed by a NanoDrop2000 spectrophotometer (Thermo Fisher Scienti c, Waltham, USA).

library construction
For Paci c Biosciences sequencing library preparation and SMRT sequencing, DNA was fragmented by a Covaris g-TUBE device (10 kb) and was concentrate DNA with AMPure PB beads following the manufacturer' protocol (Beckman Coulter Co., USA). The DNA damage and ends were repaired in a LoBind microcentrifuge tube. Blunt ligation reaction was performed by adding 1 μL of blunt adaptor (20 μM ) and 1 μL of ligase to the 30 μL of DNA and then incubation was performed at room temperature for 15 min. SMRTbell™ templates were puri ed with AMPure PB beads and then the concentration was measured by Qubit. Sequencing was performed on a PacBio Sequel instrument by OE Biotech Co., Ltd (Shanghai, China).

Bioinformatic analysis
Metagenome assembly was performed using ye software after getting valid reads. ORF prediction of assembled scaffolds using prodigal was performed and translated into amino acid sequences. The nonredundant gene sets were built for all predicted genes using CD-HIT. The clustering parameters were 95% identity and 90% coverage. The longest gene was selected as representative sequence of each gene set. The gene set representative sequence (amino acid sequence) was annotated with NR, KEGG, COG, SWISSPROT and GO database with an e-value of 1e-5. The taxonomy of the species was obtained as a result of the corresponding taxonomy database of the NR Library.

Blast in the NCBI
We acquired the full-length 16s rRNAs from the assembled contigs. Then the 16s rRNAs were blasted in the database of the NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), then we screened similar sequences in the order of identity, and downloaded the relative sequences. Then we identi ed their bacterial species of the assembled contigs and analyzed their evolutionary relationship.

ClustalW and phylogenetic tree analysis
To examine the differences between the full-length 16s rRNAs, we compared the 16s rRNAs and visualized the differences using bioedit software (Borland Software Corporation, Scotts Valley, USA). For phylogenetic tree analysis, we blasted the full-length 16s rRNAs in the NCBI database (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), and selected the most related fulllength 16s rRNAs by identity. With these 16s rRNAs, we analyzed the phylogenetic tree with mega7 [37].
Statistical analysis R programming language v. 3.4.3 was used for statistical analysis. Statistical signi cance between two groups was determined using an unpaired two-tailed Student's t test. Data are presented as mean ± SD (standard deviation) or mean ± SEM (standard error of the mean) as indicated in the gure legends. P values were considered statistically signi cant at P < 0.05.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
Availability of data and materials Datasets from metagenomic sequencing are archived at the Sequence Read Archives (SRA) at NCBI.

Competing interests
The authors declare no competing nancial interests.  The whole characteristics of the two samples. a, Pie chart comprising 4 classes of human bacteria sequence data in the NCBI databases: Chromosome, Complete, Scaffold and Contig. b, Numbers of contigs with different lengthes acquired from the two samples, relative to the adult sample (up) and the baby sample (down). c, The whole characteristics of the two samples. The corresponding contig length and CDS number were depicted in the rst outer layer and the second outer layer, respectively; the GC content was depicted in the third outer layer; the contig number and the coverage were depicted in the rst inner layer and the second inner layer, respectively. d, Function analyses of all the CDSs of the two samples.

Figure 2
The different aspects of the contigs ( 0.5 Mb ) in the two samples. a, The length and coverage of the contigs in the two samples ( 0.5 Mb). b, Complete analysis of the contigs in the two samples. c, Wenn diagram shows the number of contigs (Complete ≥99% or circle). d, Pie diagram shows the number of the nine genomes whether they match public genomes. e, phylogenetic tree of the nine assembled genomes.

Figure 3
The genome analysis of the ve non-match genomes. a, The genome length and gene number of the ve non-match genomes. b, The GC content(total and gene) of the contigs in the two samples of the ve alignment of the full-length 16s rRNAs of these genomes with that of the related bacterial strains. c, Conditional pathogens Shigella exneri lived in the gut of adult and baby. Figure 5 Genome aspects of the new species of Candidatus Enterococcus shanghaius. a, Non-redundant database annotations of contig_638. b, Phylogenetic tree of the high contig_638 through full-length 16s rRNAs. c, ClustalW multiple alignment of the full-length 16s rRNAs of Candidatus Enterococcus shanghaius with that of the related bacterial strains. d, Function analyses of the CDSs of the Candidatus Enterococcus shanghaius. e, Carbohydrate-Active enzymes analysis of the Candidatus Enterococcus shanghaius. f, Circos diagram of the whole genome of the Candidatus Enterococcus shanghaius (From the inner to the outer are GC skew, GC content, non-coding RNA (rRNA red, tRNA blue, sRNA green), lagging strand COG annotation, leading strand COG annotation).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.