Information of Source and Genome Characteristics of L. acidophilus
In this study, we selected 17 genomes of L. acidophilus in our bacterial collection, which were isolated from the human gut in Shenzhen, Guangdong, China. The other 57 were publicly available L. acidophilus genomes from NCBI. Among these genomes, 2 genomes were from animal gut, 16 from commercial products, 15 from food, 30 from the adult human gut, 6 from infant gut, and 7 genomes were missing the source information. The genome size varied from 1.85 Mb to 2.09 Mb with an average size of 1.97 Mb. The GC content ranged from 34.2–35.7% with an average GC content of 34.6%, only strain P2 was exceeded 35%. The gene number rise evenly from 1,856 to 2,084. The average CDS number is 1,914 (Table 1).
Genome comparison of L. acidophilus based on ANI
ANI analysis was a genome-based relatedness method for assessing genetic distance 38,39, which 95% of ANI value being suggested as the delimitation criterion for species 17,40. The result showed that most of the ANI values of 74 genomes of L. acidophilus were higher than 99.9% (Supplementary Fig. 1), indicating these genomes shared highly similar genome sequences. Among these, strain YT1, derived from animals, share a relative farther genetic distance with other strains, which the ANI values were below 99.2%. The relationships between ANI with the source of L. acidophilus were shown in Supplementary Fig. 1, we found strains “HgS” and strains “C” tend to cluster together, suggesting that commercial strains may also have originated in the human gut. In contrast, L. acidophilus strains from other sources were more scattered.
Phylogenetic Analyses of L. acidophilus
To analyze the phylogenetic relationship of 74 L. acidophilus strains, the phylogenetic tree was constructed on the whole genomes of 74 L. acidophilus, which was rooted with Alicyclobacillus acidiphilus NBRC 100859 as an outgroup (Fig. 1). Due to the high similarities of ANI values between 74 L. acidophilus strains, it is difficult to determine the evolutionary relationship of L. acidophilus. As shown in Fig. 1, the phylogenetic tree mainly formed 4 subbranch, but genomes from varied sources were distributed in different subbranch. The biggest subbranch included the strains mainly from“Commercial”, “Food”, “Human gut”, and “Human infant”. This could indicate no significant correlation between the phylogenic relationship of strains with the different source.
Genomic Function Analysis of L. acidophilus
The distribution of functional classification of genes based on the COG and KEGG annotation of 74 L. acidophilus strains were showed in Supplementary Fig. 2 and Supplementary Fig. 4. The genes are mainly distributed in Carbohydrate transport and metabolism (G), Translation, ribosomal structure and biogenesis (J), and Amino acid transport and metabolism (E) for COG categories (Supplementary Table 1). According to the KEGG annotation, the genes are mainly distributed in genetic information processing, signaling and cellular processes, and carbohydrate metabolism (Supplementary Table 2), that were consistent with the categories of COG. In addition, vitamin B12 import ATP-binding protein BtuD possess the most numbers of the genes in the COG categories (Supplementary Fig. 3), which was a part of the ABC transporter complex BtuCDF involved in vitamin B12 import and energy coupling to the transport system. Nevertheless, the molecule with the most gene numbers in the KEGG database is sugar PTS system EIIA component (Supplementary Fig. 5), a phosphoenolpyruvate-dependent sugar phosphotransferase system (sugar PTS), involved in catalyzing the phosphorylation of incoming sugar substrates concomitantly with their translocation across the cell membrane 41. We next analysis the different function categories in 74 L. acidophilus strains according the COG and KEGG annotations. The top 30 different molecules in COG categories were consist of many tRNA and rRNA related genes. Interestingly, Vitamin B12 import ATP-binding protein BtuD was found in the top 30 different molecules (Supplementary Fig. 3). The top 30 distinct molecules in KEGG categories contained the kinases, epimerases, transposases, mutases, transport proteins, etc. The most different molecule was wecB: UDP-N-acetylglucosamine 2-epimerase (Non-hydrolyzing) (Supplementary Fig. 5). Additionally, no obvious correlation between the different function categories of strains with the source.
Pan-genome analysis of L. acidophilus
The pan-genome analysis is performed for investigating the distribution of all the genes at the species level. In this work, the boxplot, fitting curve, and accumulation curve were generated from the pangenome data and shown in Fig. 2. The boxplot and fitting curve reflected the number of genes increases with the addition of new genomes. The pan-genome of L. acidophilus remained unsaturation with the Heap’s alpha being below 1.0 (Fig. 2A). According to the previous study 23, strict core genes are found in all genomes, core genes are found in 99%-100% of genomes, soft core genes are found in 95–99% of genomes, shell genes are found in 15–95%, while cloud genes are presented in less than 15% of genomes. The obtained data showed that there were 1,573 strict core genes, 0 core genes, 221 soft core genes, 92 shell genes, and 443 cloud genes (Supplementary Table 3).
The outcomes of the distribution of the functional categories based on KEGG and COG annotation of core genome, accessory genome, and unique genome were depicted in Supplementary Fig. 6 and Supplementary Fig. 7, respectively. Metabolism, information storage, and processing genes were mainly categories. Moreover, the core genome encodes more conservative genes involved in translation, ribosomal structure and biogenesis than [J], amino acid transport and metabolism [E], and intracellular trafficking, secretion and vesicular transport [U] compared with the accessory genome and unique genome (Supplementary Fig. 6). On the contrast, defense mechanisms [E], replication, recombination and repair [L], carbohydrate transport and metabolism [G], nucleotide transport and metabolism [F], and general function prediction only [R] were more abundant in the accessory genome and unique genome. For the distribution of KEGG functional categories, metabolism was the most abundant categories both in core genome, accessory genome, and unique genome (Supplementary Fig. 7). The distribution of subcategories of KEGG were consistent with the results of COG.
The genome accumulation curve showed that less genes gain with the newly increased genomes from the most of different source, but the two animal-derived L. acidophilus strains contributed over 200 new genes (Fig. 2B), suggesting that the animal-derived strains may encode more unique genes involved in the adaptation of the animal intestinal environment. Then we annotated these unique genes of the animal-derived strains by the COG database and found that most of these were distributed in [K] Transcription and [L] Replication, recombination and repair category for information storage and processing (Fig. 3A). These unique genes were belonging to phage-associated, peptidase-associated, ABC transporter-associated, transposase-associated, CRISPR-associated, glycosyl transferases, transcriptional regulator, and glycosyl hydrolase related genes (Fig. 3B), that were considered to be involved in the adaptation and evolution to the complex animal intestinal environment of L. acidophilus.
Comparative Analysis Of Carbohydrate-active Enzyme
30 CAZY families were predicted in the genome of L. acidophilus, which include 14 GH (glycoside hydrolases), 10 GT (glycosyl transferases), 3 CBM (carbohydrate-binding modules) families, and 3 CE (carbohydrate esterases). GH and GT are the largest number of enzymes in the carbohydrate metabolism of L. acidophilus. In all CAZY families, GH1 (β-glucosidase families), GH13 (α-amylase families), and GT2 (cellulose synthase families) were the most abundant CAZY families. In contrast, GH78 (α-L-rhamnosidase families), CE1 (acetyl xylan esterase families) and CE7 (acetyl xylan esterase families) were the low abundance CAZY families. The other CAZY families are distributed similarly in different strains (Supplementary Fig. 8). Altogether, no significant differences were found in the genome of these 74 L. acidophilus strains, suggesting the carbohydrate metabolism were evolutionary conserved, even in different environmental niches.
Prediction of the Secondary Metabolites Biosynthetic Gene Clusters in L. acidophilus
Bacterial secondary metabolites are alternative sources of novel drug leads. The secondary metabolite biosynthetic gene clusters (SMBGs) predicted in the bacterial genome will provide crucial information for the efficient discovery of novel natural products 42. We used anti-SMASH 30 to predict the secondary metabolite from the genomes of 74 L. acidophilus strains. Gassericin T, a hydrophobic class II bacteriocin with high heat stability (121°C, 10 min), good pH tolerance (pH 2–11), was strong bactericidal against many gram-positive bacteria, and thus are expected to be effective food preservatives 31,32. We found Gassericin T was predicted in most of L. acidophilus genomes (73/74), and the structure of Gassericin T gene clusters of 73 strains are quite similar and no significant phylogenic relationship between the Gassericin T from L. acidophilus with the different source (Fig. 4), indicating all these 73 strains can be used as an alternative producer for Gassericin T. Class-I lanthipeptide, a ribosomal synthesized and post-translationally modified polycyclic peptide, was considered a novel potent antimicrobial compound 33,34. Interestingly, we discovered the Class-I lanthipeptide was only predicted in strain A_YT1 (animal origin), suggesting the animal derived L. acidophilus may display different antibacterial characteristics compared with the L. acidophilus from other niches.
Potential Genetic Factor Related To Adaptation Of Various Niches By Single Nucleotide Polymorphism (Snp) Analysis
SNP analysis enables descry differences among different genomes at the single-nucleotide level and uncover the genetic factor in adapting different niches of L. acidophilus. In total, 17,001 SNPs were identified with the number varied from 73 to 7,796, the SNPs number of the strains from the same source is close (Fig. 5A), and most of the SNPs (7,796) were found in strain A_YT1 (Supplementary Table 4), which the number was significantly more than other strains, indicating that more variations were occurred for the strain A_YT1 to adapt to the animal gastrointestinal tract. We subsequently analysis the involved genes of SNPs and found the most numerous SNPs were located in the btuD, a gene encoding Vitamin B12 import ATP-binding protein BtuD, a part of the ABC transporter complex BtuCDF involved in vitamin B12 import, responsible for energy coupling to the transport system (Fig. 5B). Coincidentally, according to the result of COG classification at the molecular level, Vitamin B12 import ATP-binding protein BtuD was the most numbers of the genes in the COG categories in L. acidophilus. We furtherly investigated the highest impact of SNP variant of the btuD gene and found this SNP is a nonsense variant from the animal-derived strain YT1, which the variant was possibly pertinent to the ecological niche of this L. acidophilus strains. A transversion at the 918th nucleotide from A (adenine) to C (cytosine) in the non-coding strand was found in the btuD gene, generating the transversion at the corresponding position from T (thymine) to G (Guanine) in the coding strand. Consequently, this transversion leads to a translation termination at the 306th amino acid, tyrosine. Structurally, Vitamin B12 import ATP-binding protein BtuD was composed of NBD (nucleotide-binding domain) and TMD (transmembrane binding domain). NBD for binding and hydrolyzing ATP, provides energy for importing of Vitamin B12. TMD is primarily involved in the identification and transit of substrates through the lipid bilayer 43. Subsequently, we predicted the protein structure including the pre-mutation and post-mutation DNA sequences and found that the premature stop gained can induce the loss of NBD, thereby mediating the inactivation of the transport function of the Vitamin B12 import ATP-binding protein BtuD to stop the intracellular transport and utilizing of vitamin B12 (Fig. 5C). We hypothesized this situation may happen in the exposure of low Vitamin B12 level and L. acidophilus will reduce the expression and utilization of Vitamin B12 import ATP-binding protein BtuD to save the raw material for the biosynthesis of BtuD.