Variant calling, assemble and annotation
Genomic DNA was sequenced using an Illumina HiSeq 4000 platform (Illumina, San Diego, CA) generating 150-bp paired-end reads with an average insert size of 350 bps. All 185 S. thermophilus sequencing data were mapped to reference genome CNRZ1066 by BWA-mem[19] with default parameters. SNPs and Indels were called by GATK Unifiedgenotyper[20] and annotated by SnpEff[21]. SNPs having Indels within its 10 bp neighborhood were filtered. CNVs were called by CNV-BAC[22]. We performed de novo assembly using SOAPdenovo2[23] (k-mer = 71). The contigs were than annotated by Prokka[24]. Roary[25] was used for the pan-genome analysis. Core genes were defined as genes shared by all strains, soft core genes shared by at least 95% strains, shell genes shared by 15%-95% strains and cloud genes shared by less than15% strains. The origin of gene sequences in pan-genome were identified by comparing with nr database using blastp[26]. We used the species with highest bitscore and longest alignment length, identity > 40% and e-value < 10− 6 as the final origin for each gene.
Phylogenetic Analyses
We used the Streptococcus salivarius CP013216 as an outgroup strain in the phylogenetic analysis. We aligned the outgroup strain genome, the 32 S. thermophilus genomes in the NCBI database, as well as the 185 assembled S. thermophilus sequences to the reference genome CNRZ1066 using the algorithm MumMer[27]. Neighbor-Joining tree was first generated using MEGA7[28] with default parameters. Then, we used ClonalFrameML[29] with NJ tree and alignment sequences to reconstruct the tree to remove influence of recombination. Among the genes that were prevalent in clades A-C (frequency > 0.5 in at least one of clades A-C) but less prevalent in clade D (frequency < 0.5 in clade D), we used Fisher’s exact test to identify genes significantly depleted genes in clade D. The genes with Benjamini–Hochberg adjusted p-value < 0.05 and odds ratio > 1.5 were selected. This gave use 158 genes.
Proteolysis and antibiotic resistance genes
We compared annotated genes to the reference sequence of proteolysis genes from NCBI database using blastp[26]. We kept the alignments with e values less than 10-5 and bit scores larger than 30. Antibiotic resistance genes were identified by comparing annotated genes with the sequences in the Comprehensive Antibiotic Resistance database[10].
Calculation of growth score and GWAS analysis
We first normalized the read depth by considering local GC content and the mappability of short reads by BIC-seq2[30]. The adjusted read depth was calculated in 1000 bp bins as the ratio between the observed read count in the bin and the expected read count given by BIC-seq2. The replication origin of the reference CNRZ1066 was obtained from the DoirC database[31]. For each strain, we calculated the Spearman correlation between the bin’s adjusted read depth and its distance to the replication origin. For the GWAS analysis, we first performed a principle component analysis (PCA) based on SNPs and Indels with allele frequencies within (0.05, 0.95) and genes whose occurrence frequencies were in (0.05, 0.95). We then performed a linear regression using the growth score as the response variable and the first two PCA components as the covariates and calculated the residuals of the linear regression for each strain. This step was to remove potential confounding factors (such as hidden population structure) that might influence the growth score. Finally, we performed Wilcoxon’s rank test to identify nonsynonymous SNPs and Indels that were significantly correlated with the growth score residuals. For the stability selection, we first filtered the SNPs by controlling the false discovery rate less than 0.05. This gave us 690 SNPs. We then performed stability selection[32] using the lasso regression.
Fermentation experiment
In the preliminary acidification experiments, the S. thermophilus were inoculated into reconstituted skimmed milk. After 12h fermentation, titratable acidity was measured. Strains with curd time less than 12h and titratable acidity above 55 °T were defined as high acid production capability (H-Acid). The rest strains were defined as non-H-Acid group. Thus, we distinguish the 185 S. thermophilus strains into two groups preliminary. To test the acidification capability of S. thermophiles, strains from frozen stock were reactivated at 37℃ in M17 Broth (Oxoid) and subcultured twice at 24 h before use. Milk was prepared by adding 6% sucrose to 11.5 % reconstituted skimmed milk, which was then sterilized at 95 °C for 10 min and cooled to 42 °C before inoculation (about 6 log10 cfus mL-1 for each strain). Fermentation allowed to proceed at 42 °C until fermentation completed. The fermentation experiment was performed in triplicate. PH and titratable acidity (TA, °T) were measured in triplicate to evaluate fermentation progress. The pH was evaluated by pH meter (Mettler Toledo, Switzerland). Titratable acidity was measured using the method described in National Standards of the People’s Republic of China. Each sample (5.0 g) was mixed with 4.5 ml of distilled water and titrated with 0.1N NaOH in the presence of 0.5% phenolphthalein indicator to an end point of faint pink color.
Minimum Inhibitory Concentration of Chloramphenicol
We selected and reactivated twelve S. thermophilus isolates at 37℃ in M17 Broth (Oxoid) before use. A wide range of chloramphenicol concentration (spanning across a wide concentration range from 0.125 µg/mL to 64 µg/mL achieved by ten-fold dilution) were prepared before use. The minimum inhibitory concentration (MIC) was determined according to ISO Standard 10932:2010. Briefly, bacterial suspensions were diluted by 1000-fold (~3×105cfu/mL) and tested against each chloramphenicol concentration. The MIC was recorded after incubating the bacterial cells for 48 h at 37 °C in strictly anaerobic conditions.
Chloramphenicol resistance gene expression checked by droplet digital PCR
The chloramphenicol resistance gene was quantified using QX100 droplet digital PCR (ddPCR, Bio-Rad), with the gene specific primer (Strep-F: 5’-AATGTTTAGCAATGACGGAAGCC-3’, Strep-R: 5’-TTCACCAATGTAAATCCCACCAC-3’). Quantification was performed using ddPCR as follows: initially, a final volume of 20 μL reaction mixture containing 2 μL cDNA, 10uL ddPCR Supermix for EvaGreen (Bio-Rad), 0.2 μL forward primer (20mM), 0.2 μL reverse primer (20mM) and 7.6 μL ddH2O were per-mixed; Each 20 μL reaction with 70 μL of droplet generation oil (Bio-Rad) was used to generate droplets; Droplets were generated by a droplet generator (Bio-Rad). The generated droplets with foil seal were then placed on a conventional PCR Thermocycler. After PCR, the PCR plate was loaded on the droplet reader (Bio-Rad), which automatically reads the droplets from each well of the plate. Analysis of the ddPCR data was performed with QuantaSoft analysis software (Bio-Rad) that accompanied the droplet reader.
Availability of data and materials
The data for the 185 isolated S. thermophilus has been deposited in the NCBI database under the BioProject ID: PRJNA594100. Supplementary tables see https://github.com/XiDsLab/ST185.