Complete Genome Sequence of Sphingobacterium sp. Strain CZ-2T Isolated from Tobacco Leaves Infected with Wildfire Disease

Background Sphingobacterium is a class of Gram-negative, non-fermentative bacilii, and rarely involved in human infections. It is characterized by a large number of cellular membrane sphingophospolipids. Due to its wide ecological distribution and oil degradation ability, environmental microbiologists have paid much attention to it. Results A novel gram-negative bacterium, designated CZ-2 T , was isolated from a sample of tobacco leaves infected with wildfire disease in Guiyang County, Chenzhou City, Hunan Province, China, and its phylogenetic position was investigated by GC content determination, PCR amplification, sequencing and phylogenetic analysis. Growth occurred on TGY medium at 30 °C and pH 7.0. The GC content of the DNA of strain CZ-2 T is 40.68 mol%. Genome relatedness, rDNA phylogeny and chemotaxonomic characteristics all indicate that strain CZ-2 T represents a novel species of the genus Sphingobacterium . We propose the name Sphingobacterium. tobaci sp. nov., with CZ-2 T as the type strain. Third-generation sequencing (TGS) and next-generation sequencing (NGS) were used to derive a finished genome sequence for strain CZ-2 T , consisting of a circular chromosome 3,925,977 bp in size. The genome of strain CZ-2 T features 3,462 protein-encoding and 50 tRNA-encoding genes. Unigenes were annotated by matching against Clusters of Orthologous Groups of proteins (COG; 2,021 genes), Gene Ontology (GO; 1,952 genes) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (1,380 genes). Comparison of the predicted proteome of CZ-2 T with those of other Sphingobacterium identified 677 species-specific proteins that may contribute to the adaptation of CZ-2 T to its native environment. Conclusions As the first report of a Sphingobacterium genome sequenced by NGS and TGS, our work will serve as a useful reference for subsequent sequencing and mapping efforts for additional strains and species within this genus.


Introduction
Sphingobacterium is a rod-shaped, non-spore-producing, gram-negative bacterium, and the GC content of its DNA ranges from 35 to 44 mol% (Lee, et al., 2013). The genus Sphingobacterium was established by Yabuuchi (Yabuuchi, et al., 1983) in 1983 and comprises bacterial species whose membranes contain high concentrations of shingolipids.
At present, the genus Sphingobacterium encompasses 41 validly published species names.
Sphingobacterium sp. have been reported to cause infection and sepsis in humans (Hibi and Kumano, 2017). Sphingobacterium multivorum and Sphingobacterium thalpophilum have the ability to degrade petroleum. Between the two, Sphingobacterium multivorum SWH-2 has a higher capacity to degrade petroleum.
PacBio and Oxford Nanopore developed PacBio RSII and MinION platforms for long-read sequencing (average: 10-15 kb) (Chaisson, et al., 2014;Schatz and Delcher ALSalzberg, 2010), named Third-Generation Sequencing (TGS). Compared with Next-Generation Sequencing (NGS), TGS generates long reads (more than 10,000 bp and some read lengths up to 100,000 bp or more) (Lee, et al., 2016), and non-GC biased (Chaisson, et al., 2014;Niu, et al., 2010) data from TGS has been used widely in studies of genome assembly (Gordon, et al., 2016;Jain, et al., 2018;Zheng, et al., 2016) and DNA 6mA methylation (Fang, et al., 2012;Greer, et al., 2015;Wu, et al., 2016;. Importantly, longer read lengths span more repetitive elements and thus produce more contiguous reconstructions of the genome (Roberts, et al., 2013). With respect to structural variation analysis, long reads enable improved "split-read" analyses so that insertions, deletions, translocations and other structural changes can be recognized more readily (Chaisson, et al., 2015). Many important research results based on TGS have been published in the journals Cell, Nature, Science and so on. However, genome assembly from TGS data is time consuming due to the high sequencing error rate, whereas NGS data have the characteristics of a short-read length (50-200 bp) and low error rate (1%, mainly substitution) (John, et al., 2009;Langmead, et al., 2009;Li and Durbin, 2009). Therefore, it is necessary to correct TGS sequences by using NGS data.
CZ-2 T (T representing the type strain) is the first strain within the genus Sphingobacterium for which TGS and NGS technology were combined to produce a fully assembled and complete genome sequence; our work will serve as a high-quality reference for any future genomic studies of strains from this genus. Furthermore, comparison of the genome sequence of CZ-2 T with those of other Sphingobacterium strains allowed the identification of species-unique genes and pathways that may be important mediators of adaptation of this species to its environment.

Bacterial strain and DNA extraction
The Sphingobacterium sp. strain CZ-2 was isolated from a tobacco leaf sample from Chenzhou city in China. The strain sample was dispersed in TGY medium (1.0% tryptone, 0.5% yeast extract and 0.1% glucose), and the culture was incubated for 2 days at 30 °C with shaking at 200 rpm before dilution plating on TGY agar plates (30 °C, 24 h) to isolate single colonies. Genomic DNA was isolated from the cell pellets with a Bacteria DNA Kit (OMEGA) according to the manufacturer's instructions, and quality control was subsequently carried out on the purified DNA samples. Genomic DNA was quantified by using a TBS-380 fluorometer (Turner BioSystems Inc., Sunnyvale, CA). High-quality DNA (OD260/280 = 1.8~2.0, >6 ug) was used to construct the fragment library.

Illumina HiSeq sequencing
For Illumina pair-end sequencing of each strain, at least 3 μg of genomic DNA was used for sequencing library construction. Paired-end libraries with insert sizes of ~400 bp were prepared following Illumina's standard genomic DNA library preparation procedure.
Purified genomic DNA is sheared into smaller fragments of the desired size by Covaris, and blunt ends are generated by using T4 DNA polymerase. Following addition of an 'A' base to the 3' end of the blunt phosphorylated DNA fragments, adapters are ligated to the ends of the DNA fragments. The desired fragments can be purified by gel electrophoresis then selectively enriched and amplified by PCR. The index tag is introduced into the adapter at the PCR stage, as appropriate, and a library quality test is performed. Finally, the qualified Illumina pair-end library is used for Illumina Hiseq sequencing (PE150 mode).

PacBio sequencing
For Pacific Biosciences sequencing, whole-genome shotgun libraries with 20-kb inserts were generated and sequenced on a Pacific Biosciences RS instrument using standard methods. An 8-μg aliquot of DNA was centrifuged in a Covaris g-TUBE (Covaris, MA) at 6,000 rpm for 60 seconds using an Eppendorf 5424 centrifuge (Eppendorf, NY). DNA fragments were then purified, end-repaired and ligated with SMRTbell sequencing adapters following the manufacturer's recommendations (Pacific Biosciences, CA).
Resulting sequencing libraries were purified three times using 0.45 volumes of Agincourt AMPure XPbeads (Beckman Coulter Genomics, MA) following the manufacturer's recommendations.

Genome assembly
Raw sequencing data were generated by using Illumina base calling software CASAVA v1.8.2 (http://support.illumina.com/sequencing/sequencing_software/casava.ilmn) according to its user's guide. Contamination reads, such as those containing adaptors or primers, were identified by Trimmomatic (http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic) with default parameters. Clean data obtained from the above quality control processes were used in further analysis.
The CZ-2 T genome was sequenced using a combination of PacBio RS and Illumina sequencing platforms. The Illumina data were used to evaluate the complexity of the genome and correct the PacBio long reads. Firstly, we used ABySS (http://www.bcgsc.ca/platform/bioinfo/software/abyss) to perform genome assembly with multiple-Kmer parameters and obtained optimal results for the assembly (Jackman, et al., 2017). Secondly, canu (https://github.com/marbl/canu) was used to assemble the PacBiocorrected long reads. Finally, GapCloser software was subsequently applied to fill the remaining local inner gaps and correct the single nucleotide polymorphisms (https://sourceforge.net/projects/soapdenovo2/files/GapCloser/) for the final assembly results (Koren, et al., 2017).

GC content, PCR amplification, sequencing and phylogenetic analysis
Genomic DNA from strain CZ-2 T was prepared using the TIANamp bacterial DNA isolation kit (Tiangen). The GC content of the DNA was determined according to the procedure of Zhou (Zhou, et al., 2012). PCR amplification and sequence analysis of the 16S rRNA gene has been described previously in detail (Zhou, et al., 2012). Phylogenetic dendrograms, which displayed substantially identical topologies, were constructed using the neighbourjoining method (Saitou and Nei, 1987), with bootstrap values calculated from 1,000 resamplings.

Average nucleotide identity
Bacterial genome sequencing is rapidly emerging as the most important source of information for microbial taxonomy. For example, determination of the whole genome sequence of a newly isolated strain allows the calculation of average nucleotide identity (ANI) scores, providing for global comparisons of the new strain with previously isolated strains whose genome sequences are deposited in databanks. These ANI scores will probably serve as the next-generation gold standard for species delineation (Kim, et al., 2014). ANI was calculated by using the Chun Lab's online Average Nucleotide Identity calculator (Yoon, et al., 2017).

Illumina paired-end sequencing and PacBio sequencing
An Illumina PE library (300-500 bp) and a PacBio library (15~20 kb) were constructed by Illumina Hiseq combined with the TGS technique. A total of 23,648,762 sequence reads were generated using the Illumina Hiseq platform (150 bp, paired-end run), and a total of 26473 sequence reads were generated using the PacBio platform (average length 6674 bp) (Additional file 1: Figure S1). The whole genome of the strain was mapped by bioinformatics analysis after the low-quality nucleotides were trimmed from the sequenced data (Table 1).

Genome properties
The genome is 3,925,977 bp long and comprises one circular chromosome with a 40.68% G+C content (Table 2 and Fig. 1). The genome of the CZ-2 T strain contains 50 tRNA genes (Additional file 2: Table S1) and several rRNA gene clusters (Additional file 3: Table S2), including duplicate copies of the 16S rRNA gene, triplicate copies of the 23S rRNA gene and four copies of the 5S rRNA gene. An estimated 88.21% of the genome contains coding sequences (CDSs), and these CDSs are predicted to encode 3,462 putative proteins. The genome of CZ-2 T is approximately 3.92 Mb in size, which is much smaller than the estimated genome sizes of other S. spiritivorum strains (6.47 Mb in S. spiritivorum_ HMA12 T , GenBank ID: NZ_BEYR01000001; 5.33 Mb in S. spiritivorum_ML3W T , GenBank ID: NZ_CP009278; 6.36 Mb in S. spiritivorum_B29 T , GenBank ID: NZ_CP019158; 6.23Mb in S. spiritivorum_21, GenBank ID: NC_015277). The possible reason is that new genomes assembled using short-read NGS data are often of lower quality in the genomic repeated region, resulting in longer repeated sequence and more repeat copy numbers. The results of NR, SWSS, KEGG, COG, and GO revealed that 3310, 1952, 1380, 2021, and 1952 unigenes were annotated in S. spiritivorum CZ-2 T strain, respectively (Additional file 4: Table S3; Additional file 5: Figure S2).
A genome map is a circular representation of one or several genomes that provides a rapid and simple method of identifying global patterns, large features or regions of interest based on various metrics. Visualizing all features at the genomic level can aid in understanding the organization of a genome or the similarities and differences across multiple genomes.
The single genome map presents a pseudogenome comprised of all assembled contigs (Fig   1), including plasmids found in the bacterium. The coloured bands in Section 1 represent contigs, each of which can be clicked on to reveal only the features within that contig.
Section 2 represents the annotated reference genes (specifically, CDSs) found on the forward strand, and Section 3 represents the same information on the reverse strand.
Different colours were assigned to indicate COG functions (more details). Section 4 displays only rRNA and tRNA found in this genome. Section 5 displays the GC skew metric, which can be used as an indicator for identifying replication loci and leading/lagging strands. The genomic mean GC-skew value is used as the baseline, relative to which higher-than-average values are displayed in red, whereas lower-than-average values are displayed in blue. Finally, Section 6 displays the GC ratio metric, which can be used to profile the genome, identify isochores or observe co-variations with other data. The GC ratio also uses the genomic mean GC ratio value as its baseline, with higher-than-average values in pink and lower-than-average values in sky blue. Rings from the outermost to the centre: (1) scale marks; (2) protein-coding genes on the forward strand; (3) protein-coding genes on the reverse strand (color-coded by functional category); (4) rRNA (red) and tRNA genes (purple); (5) GC content; and (6) GC skew.
Protein-coding genes are colour coded according to their COG categories.

Evolution of CZ-2 T
The 16S rRNA gene sequence of strain CZ-2 T is 1,432 bp in length. BLAST searches in the GenBank database and the EzTaxon server (Kim, et al., 2012) (http://www.ezbiocloud.net/eztaxon) indicated that strain CZ-2 T belongs to the genus Phylogenetic analysis confirmed that strain CZ-2 T forms a coherent cluster with members of the genus Sphingobacterium, and an intra-genus clade with S. lactis DSM 22361 T , S. daejeonense TR6-04 T and S. kyonggiense KEMC 2241-005 T (Fig. 2). The average nucleotide identity (ANI) of the genome sequence of strain CZ-2 T against the four other Sphingobacterium species for which genome sequences are publicly available ranged from 68.87% (with strain S. spiritivorum _21) to 70.76% (with strain S. spiritivorum_ML3W T ).
These ANI values are also considerably lower than the 95% to 96% threshold used to identify isolates as belonging to the same bacterial species (Goris, et al., 2007;Richter and Rossello-Mora, 2009). Thus, rDNA phylogeny, genome relatedness and chemotaxonomic characteristics all indicate that strain CZ-2 T represents a novel species within the genus Sphingobacterium. We propose the name S. tobaci sp. nov. (isolated from tobacco), with CZ-2 T (In Chenzhou city) as the type strain.

Annotation
The Clusters of Orthologous Groups (COG) database is an effective tool for the annotation of functional proteins (Galperin, et al., 2015). In this investigation, COG categorization according to the HMM profiles of Gammaproteobacteria was found in the popular EggNOG database (Huertacepas, et al., 2016;Sean, et al., 2012) with an E-value cut-off of 1e-5.
To determine the functional proteins encoded by 3,462 genes, COG results were examined, revealing 2,021 unigenes that were annotated in the CZ-2 T strain of S. spiritivorum. COG assigned functional proteins to 21 classifications, which are summarized in Table S4 (Additional file 6). Among these annotated unigenes, 45.98% were related to metabolism, 18.32% to cellular processes and signalling and 19.27% to information storage and processing. However, 16.43% of the unigenes were poorly classified by COG category because their features and functions remain obscure. The classification of COG protein-encoding genes found in the genome of S. spiritivorum CZ-2 T is plotted in Fig. 3.

Fig. 3 COG functional categories
KEGG pathway annotation was helpful for assigning biological functions to genes via interpretation of enzymes and other proteins in biochemical processes (Kanehisa and Goto, 2000). Metabolic pathway annotation was carried out using predicted protein sequences and the KEGG Automatic Annotation Server (KAAS) (Yuki, et al., 2007). In the present study, a total of 1,380 unigene sequences were assigned to 173 pathways of KEGG annotated protein sequences of S. spiritivorum CZ-2 T (Additional file 7: Table S5). Most (455) of the unigenes were assigned to "metabolic pathways", 211 to "biosynthesis of secondary metabolites" and 111 to "microbial metabolism in diverse environments", forming the dominant categories. The top 20 representative pathways from KEGG pathway annotation are depicted in Fig. 4. KEGG pathway is a collection of annually drawn pathway maps representing our knowledge on the molecular interaction, reaction and relation networks for metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases and drug development In the present study, 1,952 unigenes were annotated using the GO database (Table 3; Additional file 8: Figure S3). The GO-annotated unigenes were assigned to the biological process (573 unigenes), cellular component (259 unigenes) and molecular function (344 unigenes) categories. Table 3 represents the distribution of unigenes regarding the number of GO terms. For the biological process category, the predominant GO terms included "cellular process" (144 unigenes), "single-organism process" (99 unigenes) and "metabolic process" (181 unigenes). In the cellular component category, "cell part" (62 unigenes), "membrane" (53 unigenes) and "cell" (62 unigenes) were the predominant GO terms for the classification of unigenes. Meanwhile, the unigenes were assigned to nine different functions in the "molecular function process" category, the predominant one being "binding" (130 unigenes) and "catalytic activity" (121 unigenes).

Evolution and comparative genomics
We compared the predicted proteome of strain CZ-2 T with those of four other Sphingobacterium strains for which genome sequence data are available at NCBI (S. spiritivorum_ML3W T , S. spiritivorum_B29 T , S. spiritivorum _21 and S. spiritivorum_ HMA12 T ) (Fig. 5). These four sequenced Sphingobacterium strains share 1,706 orthologous protein groups. These common orthologous protein groups encompass the enzymes for central carbon metabolism and include the pentose phosphate pathway, the tricarboxylic acid cycle (TCA), amino acid biosynthesis and the assembly of purine and pyrimidine nucleotides. Genes for this set of predicted pathways are well conserved in the genomes of the sequenced Sphingobacterium strains. Orthologous protein groups shared by CZ-2 T with only one of the other Sphingobacterium strains were also identified and contained 97 (S. spiritivorum_ML3W T ), 626 (S. spiritivorum_B29 T ), 95 (S. spiritivorum_21) and 62 (S. spiritivorum_ HMA12 T ) groups, respectively. Importantly, we detected 677 species-unique predicted proteins belonging to 677 orthologous groups in strain CZ-2 T . Many of these CZ-2 T -specific genes are involved in transport systems, DNA repair and the biosynthesis of small molecules. These and other species-unique proteins may facilitate the adaptation of S. spiritivorum to tobacco leaf environments. Sphingobacterium strains are presumably well adapted to living in diverse environments, and transport systems are key components of bacterial tolerance to extreme conditions (Konings, et al., 2002). For example, most of the known transport systems of extremophiles are sugar uptake systems which belong to the ABC family of transporters (Konings, et al., 2002). To discover the mechanisms of Sphingobacterium strain adaptation to various environments, we compared the transport system of strain CZ-2 (isolated in 2018 from tobacco leaves) with those of four other Sphingobacterium strains (Strain HMA12 T was isolated in 2013 from flatland farm soil; strain ML3W T was isolated in 2012 from Myotis lucifugus wing; strain B29 T was isolated in 2015 from rhizospheres). KEGG mapping showed that CZ-2 T has 26 genes coding for transporters, mainly related to phosphate-specific transport (PstS protein is a phosphate-binding protein, PstA and PstC proteins are considered to form the transmembrane channel of the Pst system, ATP binding PstB protein probably interacts with PstA-PstC at the cytoplasmic side of the membrane), lipopolysaccharide (LptF and LptG, together with LptB, functions to extract lipopolysaccharide from the inner membrane en route to the outer membrane), lipoprotein (lipoprotein-releasing system permease protein LolC and LolE), iron complex (a periplasmic iron-binding protein (FhuD) and cytoplasmic membrane proteins (FhuB and FhuC)), heme (heme trafficking system membrane protein CcmB and CcmC), molyblate (molybdenum transport ATP-binding protein ModF) and the ABCB subfamily (ATP-binding lipopolysaccharide transport protein MsbA). Importantly, the PstS, PstC, PstA, LolC, LolE, CcmC, CcmB, LptF, LptB, FtsX, FhuD, FhuB, FhuC and MsbA genes encode transporters that are present in all five Sphingobacterium strains (Table 4). We consider these transporters essential for the survival of Sphingobacterium strains. The MalK, MsmX, MsmK, SmoK, AglK, MsiK, RbsB, RbsA, MglA, IatA, BtuF and ZnuC genes encode transporters that are present in HMA12 T and B29 T but absent in CZ-2 T , 21 and ML3W T . We hypothesize that these transporters may contribute to the adaptation of strain HMA12 T and B29 T to its soil environment. Alternatively, these transporters may help strains HMA12 T and B29 T to import a wider range of nutrients from soil organic matter.
Betaine is widely distributed in nature and has been found in many microorganisms, such as bacteria, archaea and fungi. The main function of betaine is to protect microorganisms from drought, osmotic stress and temperature stress. Meanwhile, betaine plays an important role in methyl group metabolism. Strains 21, ML3WT T , HMA12 T and B29T T feature several genes whose protein products are predicted to play a role in transport of the osmoprotectants glycine and betaine and may thus contribute resistance to environmental stresses. Conclusions ANI, rDNA phylogeny and genome sequencing were used to identify bacterial strain CZ-2 T , isolated from tobacco leaves infected with tobacco wildfire diseases in Guiyang County, Hunan Province, China, as a novel species within the genus Sphingobacterium. Therefore, we propose CZ-2 as the type strain of Sphingobacterium tobaci sp. CZ-2 T nov. Wholegenome sequencing (NGS and TGS) was used to derive the high quality, finished genome sequence assembly for this gram-negative bacterial strain within the genus Sphingobacterium. Genome-wide comparisons with other sequenced Sphingobacterium spp. provided an extensive list of core genes, which are highly conserved to contribute to the adaptation of Sphingobacterium to its native environment; we also found a list of species-unique genes, some of which are proposed to contribute to adaptation to the soil environment.