Sequencing of 185 Streptococcus thermophilus and identi � cation of fermentation biomarkers

Wenjun Liu Inner Mongolia Agricultural University Linjie Wu School of Mathematical Sciences and Center for Statistical Science, Peking University Jie Zhao Inner Mongolia Agricultural University Weicheng Li Inner Mongolia Agricultural University Yu Wang Inner Mongolia Agricultural University Huijuan Zheng Inner Mongolia Agricultural University Tiansong Sun Inner Mongolia Agricultural University Heping Zhang Inner Mongolia Agricultural University Ruibin Xi School of Mathematical Sciences and Center for Statistical Science, Peking University Zhihong sun (  sunzhihong78@163.com ) Inner Mongolia Agricultural University


Introduction
Streptococcus (S.) thermophilus is a predominant lactic acid bacterium (LAB) with rapid acidi cation capability and is a major dairy starter used in the milk fermentation and cheese production [1]. The starter strains of fermentation have vast economic value and produce a large amount fermented dairy products every year [2]. Traditional nomads have been making naturally fermented milk for thousands of years and these naturally fermented dairy products contain rich LAB resources [3].
However, previous researches mostly concentrate on genomics of industry strains of S. thermophilus. For example, one study showed that most of 47 industry S. thermophilus have similar genetic distance, indicated genome stability of industry strains [4]. The genetic architecture underlying natural S. thermophilus strains is largely unexplored.
In this study, we isolated and sequenced 185 S. thermophilus strains from natural dairy products in traditional pasture of different regions in China and Mongolia, to uncover the genetic evolution of wild strains and to explore the relationship between genotype and fermentation characteristics. Phylogenetic tree of S. thermophilus revealed clades with high acid production potential or signi cant genome decay. We also found 13 S. thermophilus which have a chloramphenicol-resistant gene CatB8 isolated from areas with high chloramphenicol emissions. Besides, we de ned a growth score and showed that the growth score could accurately predict acidi cation capability of S. thermophilus. With the growth score, we identi ed a missense mutation G1118698T by genome wide association study (GWAS) located at the gene AcnA that were signi cantly associated with the acidi cation. Our ndings provided novel insights in S. thermophilus genetic traits and identi ed robust biomarkers screening of culture starter with high acidi cation capability.

Phenotypic analysis and phylogenetic analysis of S. thermophilus
From natural fermented dairy products in China and Mongolia, we isolated 185 S. thermophilus and found 61 strains having potential high acid production capability (H-Acid) using a preliminary fermentation experiment. Among the 185 S. thermophilus, 65 were from China and 120 were from Mongolia (Fig. 1A). Whole genome sequencing (WGS) was performed for all 185 strains (mean coverage 374X, Table S1). After short read alignment to the reference genome CNRZ1066 of S. thermophilus, we identi ed 58,734 single nucleic acid polymorphisms (SNPs), 2,246 insertions and deletions (Indels , Table   S2) and 450 copy number variations (CNVs, Table S3). As expected, ribosomal RNA (rRNA) and transfer RNA (tRNA) had the lowest mutation rates, followed by protein coding regions, pseudogenes and intergenic regions (Fig. 1B). De novo assembly showed that the genome sizes of S. thermophilus ranged from 1.72 to 2.60 Mb and the numbers of genes ranged from 1,704 to 2,158 ( Fig. 1C-D).
Pan-genome analysis of our 185 isolates as well as 32 S. thermophilus available from the NCBI [5] totally identi ed 7,629 genes including 827 core genes and 315 soft core genes (Fig. 1E, Table S4-S6). The core and soft-core genes were mainly enriched in metabolism pathways like metabolic and biosynthesis of amino acids, which are essential for bacterial growth (Fig. S1A). Glycolysis/gluconeogenesis pathway and starch and sucrose metabolism pathways were enriched in shell genes (shared by 15%-95% samples). These two pathways are involved in carbohydrate utilization. Diversity in these pathways indicates that S. thermophilus strains might have undergone adaptive evolution in carbohydrate utilization pathways. Besides, the cloud genes (shared by less than 15% samples) were mainly enriched in quorum sensing and beta-lactam resistance pathways. 53% of cloud genes were homologous to genes in other Streptococcus strains and 8% to genes in Lactobacillales and Lactococcus, two bacterial species often found in natural fermented dairy products [6] (Fig. S1B). This indicated that a portion of cloud genes came from horizontal gene transfer (HGT).
The phylogenetic tree of the 217 S. thermophilus revealed four large clades and was consistent with multilocus sequence typing (MLST) (Fig. 2) The within-clade average nucleotide identity (ANI) was signi cantly larger than the between-clade ANI and the principle component analysis also showed that the four clades are well-separated ( Fig. S2A-B). Clade A, including strains from China, Mongolia and NCBI database, was closer to the roots of phylogenetic trees. In the other three clades, strains isolated from China and Mongolia also showed clear aggregation. Clade D was basically composed of strains isolated from yoghurts of Mongolia, the strains in Clade C were mainly from yoghurts of Mongolia and Xiniiang China. The clade B was mostly isolated from goat yoghurts in China and yoghurts in Mongolia. Genetic distance between isolates were signi cantly correlated with the geographical location of sampling sites (Fig. S2C). The dairy product type was also signi cantly correlated with the phylogenetic clades (Fisher's exact test, p-value = 5⋅10 − 3 , Table S7).
Interestingly, the H-Acid isolates and NCBI stains were signi cantly enriched in clade A and B (Fisher's exact test, p-value = 5.1⋅10 − 7 , Fig. 2). Note that strains from NCBI were extensively used in industrial production and generally had high dairy fermentation capability. Furthermore, the cell-wall protease gene PrtS, a gene known to be associated with rapid growth and acidi cation rates at bacteria in milk [7], was signi cantly enriched in clade A (Fisher's exact test, p-value = 1.3⋅10 − 28 , Table S8). These data implied that strains in clade A and B might have better fermentation potential than the strains in clade C and D.
Genome decay and antibiotic resistance of S. thermophilus S. thermophilus in clade D had signi cantly fewer number of genes (Fig. 3A), smaller genome sizes and more copy number losses than other clades, indicating that clade D might have undergone considerable genome decay (Fig. 3B-C and Fig. S3A). We found 131 genes that were prevalent in clade A-C (frequency > 0.5) but were signi cantly depleted in clade D (Fisher's test, Benjamini-Hochberg adjusted p-value < 0.05, Table S9). These genes were signi cantly enriched in the pathways including quorum sensing, betalactam resistance, and ABC transporters (Table S10). Further, clade D had signi cantly less quorum sensing genes compared with clade A and B (Fig. 3D). Many of the depleted quorum sensing genes were Blp bacteriocins related genes. By comparing with the Blp protein family in Streptococcus pneumoniae (Table S11), we found that blpB, blpM, blpH and blpR genes were signi cantly lost in strains from clade D comparing with other clades (Fig. 2, Fisher's test p-value < 0.001). BlpB protein, a transport accessory protein, is essential for the secretion of antimicrobial compounds. BlpM is a bacteriocin-encoding gene which directly in uence the production of bacteriocins. Both BlpH and BlpR are members of twocomponent regulatory system, which allow bacteria to sense and respond to changes in different environment conditions. The knockout of blpB, blpH and blpR genes in S. thermophilus reduced production of bacteriocins compared with the wild type [8]. These suggest that strains in clade D may have lower production capacity of bacteriocins than strains in other clades. We found that 104 of the 131 depleted genes in D had homologs in 10 other streptococci species and these genes were also enriched in quorum sensing pathway, especially the bacteriocins cluster, followed by ABC transporters (Table S12). This phenomenon was consistent with previous research, which found a striking level of genome decay in S. thermophilus compared with other streptococci [9]. Besides, 76 in 131 genes were identi ed in Streptococcus salivarius subspecies salivarius, implying that instead of being acquired by strains in clade A-C, the depleted genes in clade D were probably lost in strains from clade D in their adaptation to the milk niche.
Due to the widespread misuse of antibiotics, the problems caused by bacterial resistance have received wide attention. By comparing pan-genome with Comprehensive Antibiotic Resistance Database [10], we found 77 genes related with resistance of 31 different antibiotics (Table S13). Most antibiotic-related genes were associated with e ux (25 of 77, 32.47%) and target alteration (35 of 77, 45.45%). 16 strains in clade C had a gene homologous to the glycopeptides antibiotics resistance protein ARO3002945 (VanH) [11]. These 16 strains with VanH were mostly isolated from Xinjiang Autonomous Region in China. Three antibiotics resistance-related genes, including two antibiotic e ux genes (ARO3000614 and ARO3004054) and one antibiotic target alteration gene ARO3004253 (VanU) [12], are more likely being lost in clade D instead of being acquired by A-C. Besides, we also found that 13 strains in clade A had a gene homologous to the chloramphenicol resistance protein ARO3002680 (CatB8, Fig. 2, Table S14). Most strains with CatB8 were isolated from Hongyuan prairie in Sichuan and Gannan prairie in Gansu, two high chloramphenicol emission provinces of China [13]. We identi ed nearby transposon sequences around CatB8 (~ 1 kb and ~ 5 kb) for 12 out of 13 strains with the CatB8 gene (Table S15), indicating that the CatB8 gene was probably acquired by lateral gene transfer. CatB8, chloramphenicol acetyltransferase, inactivates chloramphenicol by acetylation [14]. The acquisition of CatB8 might confer S. thermophilus resistance to chloramphenicol. In fact, we applied droplet digital PCR (ddPCR) for strains with CatB8 and found a signi cantly higher expression of CatB8 in M17 cultures with 8 µg/ml chloramphenicol than cultures without chloramphenicol (Wilcoxon's test, p-value = 0.0025, Fig. S3B, Table S16), indicating that the bacteria responded to the exposure of chloramphenicol by elevating the expression of CatB8. These data suggested that the misuse of chloramphenicol might be closely related with the antibiotic resistant S. thermophilus and attention should be paid in the screening of potential starter to avoid the spread of resistance genes.
Growth score and acidi cation of S. thermophilus Acidi cation is the most important characteristic of S. thermophilus as a starter in fermentation of dairy products. Rapid acidi cation can shorten fermentation time of yoghurt production and reduce the production costs. Screening of S. thermophilus strains with rapid acidi cation is one of most concerned problems in fermentation dairy enterprises. However, currently biomarkers for rapid acidi cation is still lacking. Theoretically, acidi cation is closely related with the growth rate of S. thermophilus. As most bacteria, S. thermophilus has a single circular genome. During replication, DNA sequences passed the replication fork should have two copies and those to be replicated should have only single copy. Thus, because millions of cells at different replication stages were used in WGS of S. thermophilus, a genomic region's read depth should be negatively correlated with its distance to the replication origin and the strength of this correlation should re ect the growth rate of S. thermophilus. In our WGS data, the adjusted read depths (Methods) were indeed negatively correlated with the distance to the replication origin for all isolates. We de ned a growth score as the negative value of Spearman's correlation between them ( Fig. 4A-B, Fig S4A, Table S17). Clade A and B strains had signi cantly larger growth score than clade C and D (Fig. S4B). The growth score was signi cantly larger in H-Acid strains (Fig. 4C) and in the strains with the PrtS gene (Fig. 4D). These data implied that the growth score might provide an accurate marker for the acidi cation capability of S. thermophilus.
Using the growth score, we performed Genome-Wide Association Study (GWAS) and stability selection to screen for genomic variations that might be related with the acidi cation capability of S. thermophilus (Methods). We found that 7 SNPs were signi cantly correlated with growth score by GWAS (Wilcoxon test, p-value < 10 − 5 or Bonferroni adjusted p-value < 0.05, Table S18). Among the 7 SNPs, the missense SNP A764991G located at the gene AsnC, which promotes the growth of the S. thermophilus in milk by regulating aspartic acid metabolism [15,16], had the highest selection probability and minimum p-value (1.4⋅10 − 8 ) in GWAS analysis. The missense SNP G1118698T located at the gene AcnA, which encodes aconitate hydratase A, also had high selection probability and was very signi cant in GWAS analysis (pvalue = 4.5⋅10 − 6 , Fig. S4C-D). AcnA involves in succinate and citrate production, contribute to acidi cation. 27 S. thermophilus with this mutation were all from clade B and D and were all in the non-H-Acid group (Fig. S4E).
To con rm that the proposed growth score and the SNP G1118698T were associated with the acidi cation capability of S. thermophilus, we randomly selected 85 strains and performed fermentation experiments (Table S19). We evaluated the acidi cation capability of S. thermophilus by the acidity and the acid production speed (Methods). The acidity and acid production speed were signi cantly higher in strains with the PrtS gene, and higher in clade A and B (Fig. S4F-G). The growth score was signi cantly positively correlated with acid production speed (Pearson's correlation 0.4, p-value = 1.2⋅10 − 4 , Fig. 4E) and acidity (Pearson's correlation 0.35, p-value = 1.2⋅10 − 3 , Fig. 4F). Similarly, the SNP G1118698T was signi cantly associated with acidity and the acidi cation speed (Fig. 4G). These data suggested that the proposed growth score and the SNP G1118698T could serve as reliable biomarkers for screening S. thermophilus isolates with high acidi cation speed.

Discussion
We reported so far the largest WGS data of S. thermophilus isolated from natural fermentation dairy products. This large amount of data allowed us to systematically investigate the genomics landscape of S. thermophilus. We found that S. thermophilus had four large clades. Strains in clade A seemed to have high industry application potential, while strains in clade D might have undergone considerable genome decay through gene loss. We also identi ed novel biomarkers for the acidi cation capability of S. thermophilus. The data and novel discoveries in this paper provided valuable resources for understanding the evolution and genomics of S. thermophilus and for the industry application of S. thermophilus.
Compared with many other streptococci species, S. thermophilus lives in a rather stable environment. Previous researches [17] discovered that many virulence-related genes were lost or became pseudogenes in S. thermophilus. Here, we revealed that in clade D strains of S. thermophilus had signi cantly smaller genomes than strains in other clades, indicating that the genome decay might be an ongoing process of S. thermophilus. This large genome decay was possibly due to the adaptation of S. thermophilus to their stable niche of milk. In fact, mathematical simulation [18] showed that stable environments often lead to smaller genomes than environments with greater variability. Many quorum-sensing genes, especially blp genes, were lost in clade D strains. By analyzing genomes of three S. thermophilus strains (LMG18311, CNRZ1066, and LMD-9), previous research [8] found that, although identi ed in all three strains, the blp gene clusters were only full functional in LMD-9. It was thus plausible that these blp genes conferred little or no survival advantages to S. thermophilus, and then the genes in this pathway gradually became inactive and eventually lost clade D S. thermophilus strains.
The traditional method for evaluating the acidi cation capability of S. thermophilus strains is very laborintensive, time-consuming and costly. With the advancement of sequencing technologies, WGS becomes very e cient and cost-effective. The growth score de ned in this paper can be calculated only using WGS data and thus provided a very convenient and cost-effective surrogate to the traditional evaluation method. With this score, one can easily screen hundreds of S. thermophilus strains. In addition, since fast growth is often a desirable property for industrial bacteria, the growth score de ned here might also serve as a robust criterion for evaluating other industrial bacteria and thus has a large application potential in industry. One disadvantage of this growth score is that it currently can only be used for monoculture bacteria. In industrial applications, multiple species of bacteria are often simultaneously used. WGS of the mixture of the different bacteria would cause read mapping ambiguity to different reference genomes and thus the growth score cannot be directly applied. However, we could only consider the genomic sequences that are unique to each species and generalize the growth score using these unique sequences for the mixed sequencing data.

Variant calling, assemble and annotation
Genomic DNA was sequenced using an Illumina HiSeq 4000 platform (Illumina, San Diego, CA) generating 150-bp paired-end reads with an average insert size of 350 bps. All 185 S. thermophilus sequencing data were mapped to reference genome CNRZ1066 by BWA-mem [19] with default parameters. SNPs and Indels were called by GATK Uni edgenotyper [20] and annotated by SnpEff [21].
SNPs having Indels within its 10 bp neighborhood were ltered. CNVs were called by CNV-BAC [22]. We performed de novo assembly using SOAPdenovo2 [23] (k-mer = 71). The contigs were than annotated by Prokka [24]. Roary [25] was used for the pan-genome analysis. Core genes were de ned as genes shared by all strains, soft core genes shared by at least 95% strains, shell genes shared by 15%-95% strains and cloud genes shared by less than15% strains. The origin of gene sequences in pan-genome were identi ed by comparing with nr database using blastp [26]. We used the species with highest bitscore and longest alignment length, identity > 40% and e-value < 10 − 6 as the nal origin for each gene.

Phylogenetic Analyses
We used the Streptococcus salivarius CP013216 as an outgroup strain in the phylogenetic analysis. We aligned the outgroup strain genome, the 32 S. thermophilus genomes in the NCBI database, as well as the 185 assembled S. thermophilus sequences to the reference genome CNRZ1066 using the algorithm MumMer[27]. Neighbor-Joining tree was rst generated using MEGA7[28] with default parameters. Then, we used ClonalFrameML[29] with NJ tree and alignment sequences to reconstruct the tree to remove in uence of recombination. Among the genes that were prevalent in clades A-C (frequency > 0.5 in at least one of clades A-C) but less prevalent in clade D (frequency < 0.5 in clade D), we used Fisher's exact test to identify genes signi cantly depleted genes in clade D. The genes with Benjamini-Hochberg adjusted p-value < 0.05 and odds ratio > 1.5 were selected. This gave use 158 genes.

Proteolysis and antibiotic resistance genes
We compared annotated genes to the reference sequence of proteolysis genes from NCBI database using blastp [26]. We kept the alignments with e values less than 10 -5 and bit scores larger than 30. Antibiotic resistance genes were identi ed by comparing annotated genes with the sequences in the Comprehensive Antibiotic Resistance database [10].

Calculation of growth score and GWAS analysis
We rst normalized the read depth by considering local GC content and the mappability of short reads by BIC-seq2 [30]. The adjusted read depth was calculated in 1000 bp bins as the ratio between the observed read count in the bin and the expected read count given by BIC-seq2. The replication origin of the reference CNRZ1066 was obtained from the DoirC database [31]. For each strain, we calculated the Spearman correlation between the bin's adjusted read depth and its distance to the replication origin. For the GWAS analysis, we rst performed a principle component analysis (PCA) based on SNPs and Indels with allele frequencies within (0.05, 0.95) and genes whose occurrence frequencies were in (0.05, 0.95). We then performed a linear regression using the growth score as the response variable and the rst two PCA components as the covariates and calculated the residuals of the linear regression for each strain. This step was to remove potential confounding factors (such as hidden population structure) that might in uence the growth score. Finally, we performed Wilcoxon's rank test to identify nonsynonymous SNPs and Indels that were signi cantly correlated with the growth score residuals. For the stability selection, we rst ltered the SNPs by controlling the false discovery rate less than 0.05. This gave us 690 SNPs. We then performed stability selection[32] using the lasso regression.

Fermentation experiment
In the preliminary acidi cation experiments, the S. thermophilus were inoculated into reconstituted skimmed milk. After 12h fermentation, titratable acidity was measured. Strains with curd time less than 12h and titratable acidity above 55 °T were de ned as high acid production capability (H-Acid). The rest strains were de ned as non-H-Acid group. Thus, we distinguish the 185 S. thermophilus strains into two groups preliminary. To test the acidi cation capability of S. thermophiles, strains from frozen stock were reactivated at 37℃ in M17 Broth (Oxoid) and subcultured twice at 24 h before use. Milk was prepared by adding 6% sucrose to 11.5 % reconstituted skimmed milk, which was then sterilized at 95 °C for 10 min and cooled to 42 °C before inoculation (about 6 log10 cfus mL -1 for each strain). Fermentation allowed to proceed at 42 °C until fermentation completed. The fermentation experiment was performed in triplicate. PH and titratable acidity (TA, °T) were measured in triplicate to evaluate fermentation progress. The pH was evaluated by pH meter (Mettler Toledo, Switzerland). Titratable acidity was measured using the method described in National Standards of the People's Republic of China. Each sample (5.0 g) was mixed with 4.5 ml of distilled water and titrated with 0.1N NaOH in the presence of 0.5% phenolphthalein indicator to an end point of faint pink color.

Minimum Inhibitory Concentration of Chloramphenicol
We selected and reactivated twelve S. thermophilus isolates at 37℃ in M17 Broth (Oxoid) before use. A wide range of chloramphenicol concentration (spanning across a wide concentration range from 0.125 µg/mL to 64 µg/mL achieved by ten-fold dilution) were prepared before use. The minimum inhibitory concentration (MIC) was determined according to ISO Standard 10932:2010. Brie y, bacterial suspensions were diluted by 1000-fold (~3×10 5 cfu/mL) and tested against each chloramphenicol concentration. The MIC was recorded after incubating the bacterial cells for 48 h at 37 °C in strictly anaerobic conditions. Chloramphenicol resistance gene expression checked by droplet digital PCR The chloramphenicol resistance gene was quanti ed using QX100 droplet digital PCR (ddPCR, Bio-Rad), with the gene speci c primer (Strep-F: 5'-AATGTTTAGCAATGACGGAAGCC-3', Strep-R: 5'-TTCACCAATGTAAATCCCACCAC-3'). Quanti cation was performed using ddPCR as follows: initially, a nal volume of 20 μL reaction mixture containing 2 μL cDNA, 10uL ddPCR Supermix for EvaGreen (Bio-Rad), 0.2 μL forward primer (20mM), 0.2 μL reverse primer (20mM) and 7.6 μL ddH2O were per-mixed; Each 20 μL reaction with 70 μL of droplet generation oil (Bio-Rad) was used to generate droplets; Droplets were generated by a droplet generator (Bio-Rad). The generated droplets with foil seal were then placed on a conventional PCR Thermocycler. After PCR, the PCR plate was loaded on the droplet reader (Bio-Rad), which automatically reads the droplets from each well of the plate. Analysis of the ddPCR data was performed with QuantaSoft analysis software (Bio-Rad) that accompanied the droplet reader. territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.

Figure 2
The phylogenetic tree as well as various genotypes and phenotypes of the S. thermophilus. The sampling locations, the dairy products from which the S. thermophilus was isolated and the pollution level of chloramphenicol are also shown.