Data quality
After filtering the sample data and removing adapter, the metagenomic sequencing depth was 6,208,391,400 bases, and there were 62,083,914 clean reads with a GC content of 37.29% and a Q20 value of 98.95% indicating a high level of data quality.
Scaffolding
Using the SPADES software, a total of 6,387 scaffolds were assembled, with a median size of 334 base pairs (Supplementary material - Table 1). The three largest scaffolds have sizes of 2.35x105, 1.87x105, and 1.75 x105 base pairs, respectively. These scaffolds were then aligned to reference genomes using the RAGTAG software. While only 3.2% of our scaffolds aligned with the reference genome (GenBank number: CP045033), the sum of nucleotides from our aligned scaffolds amounted to approximately 1,635 million base pairs. A query to our standalone NCBI genome database using the code CP045033 returned the reference genome Lactobacillus kefiranofaciens ASM1465658v1, CP061341 in GenBank. This genome consists of 2,149,348 bases pair and includes two plasmids with sizes of 4,472 and 19,814 bases pair, respectively.
Genome assembly of Lactobacillus kefiranofaciens Ueira (LkefirU)
Our analysis of the kefir DNA sample revealed the scaffolds aligned with the CP061341 reference genome, which consists of 2,241,619 base pairs, along with two plasmids of sizes 4,527 and 48,532 base pairs. We named the DNA strand Lactobacillus kefiranofaciens Ueira or LkefirU for short.
The LkefirU genome has a GC content of approximately 38%, and a gene density of 1037 genes per megabase, with 88% of coding bases. Within the LkefirU genome, there are 2326 protein-coding sequences (CDS), although this number does not account for potential false genes. Among these total CDSs, 1643 (70.6%) show greater than 95% size and protein identity similarity (SPIS) to the reference genome CP061341. Another 363 (15.6%) fall within the 95% to 70% ranges, while the remaining 320 (13.7%) CDS exhibit SPIS below 70%.
The LkefirU genome consists of 396 named genes with repetition and 325 in isolation. When including the number of CDSs with SPIS<70% (357) and CDSs annotated as hypothetical based on the reference genome (203); LkefirU presents 560 CDSs with unknown functions, accounting for 24% of its coding sequences. Figure 1 provides an example of these sequences with unknown functions compared to the LkefirU reference genome.
Unlike the reference genome, LkefirU does not possess any plasmids, and proteins of assembled sequences from these two plasmids have no correspondence in the leading strand of LkefirU.
Additionally, compared to its reference, the LkefirU genome does not possess the expected number of ncRNA and tRNA. The CP061341 genome, contains 15 ncRNAs and 63 tRNAs, while in LkefirU, predicted to have 41 tRNAs and no ncRNAs. The lack of a minimal amount of ncRNAs indicates that LkefirU genome still in the draft stage. We used the RNAMMER program to predict ncRNAs among the 6837 scaffolds generated during the assembly with SPADES, resulting in the identification of ten ncRNAs (Table 1). However, it cannot be definitively concluded that these ncRNAs belong to LkefirU since there is a possibility of other organisms in the metagenome. Considering the significant anticipation of L. kefiranofaciens genomes in our assemblies, it is plausible that a significant amount of DNA is related to LkefirU. However, only the NODE_924 exhibits GC content that aligns with both LkefirU and its reference, falling within the range of 35 to 38%. To confirm this, we performed a Blastn to compare 16s rRNA with L. kefiranofaciens and the NODE_811_length_2056_cov_3902.386807 indicates 100% of identity and 0.0 of e-value, when the NODE_929_length_1578_cov_17.926461 indicates 79.24% of identity and e-value 3e-130.
Pangenome analysis
To study the relationship between our LkefirU genome and other genomes of the Lactobacillus genus, we collected 204 genomes of the Lactobacillus genus deposited at the NCBI. Using the GENPPI tool, we examined the LkefirU pangenome and found 63% (1474 out of 2326) of the predicted proteins had one or more similar counterparts with over 90% identity in the at least one of the 204 Lactobacillus genomes. This includes both the core and accessory genomes of the LkefirU. However, we could not consider 37% of the LkefirU genome as unique because the GENPPI tool was set to a minimum protein identity based on conserved neighborhood relationships (CN) and phylogenetic profile (PP) with threshold above 90%. The central and accessory pangenome of LkefirU exhibited protein similarities with approximately 7000 proteins in 156 Lactobacillus genomes. The five genomes with the highest number of similar proteins to LkefirU, in descending order, were Lactobacillus sp. (401), Lactobacillus paragasseri (290), Lactobacillus kefiranofaciens (267), Lactobacillus gallinarum (188) and Lactobacillus sp. UMNPBX14 (184).
We generated a network of interactions for LkefirU, which consisted of approximately 159,000 edges, 7,645 for CN, and 151,671 for PP. This network comprised 80% of the predicted open reading frames (ORFs) in LkefirU, distributed across 21 interconnected components. In the Figure 2, the colors white, green, red, and blue mean cytoplasmic, membrane, surface exposed, and secreted proteins, respectively. The larger nodes represent the thirty proteins with the highest Bridging Centrality scores [20].
The LKefirU genes absent in the reference genome were confirmed using its interaction network. Since the GENPPI software evidenced conserved neighborhood or phylogenetic profile, it means the gene exists in other genomes. For this analysis, we focused on 319 genes from the LkefirU chromosome, excluding genes that were assigned hypothetical protein annotations by the reference genome. The presence of non-annotated gene interactions with the reference genome (indicated in red) suggests that these genes, which were not accounted for in the reference annotation, may exist in one or more of the 204 Lactobacillus genomes obtained from the NCBI. These genes likely share conserved gene neighborhoods, conserved phylogenetic profiles, or both. Out of these 319 genes examined, 141 genes (44%) interacted with other Lactobacillus genomes. On average, these proteins indicated in red displayed thirteen interactions each. Notably, the two proteinsLKU01457.1 and LKU00889.1 stood out with 564 and 555 interactions, respectively, representing the highest number of interactions among the non-annotated genes. Aligning these protein sequences against the NCBI pool revealed 51 alignments for LKU01457.1, and over 100 LKU00889, predominantly with Lactobacillus sp.. Therefore, we estimate that there are approximately 178 unique proteins in the LkefirU chromosome, which accounts for around 8% of the entire genome. This proportion of unique proteins aligns with expectations for an unpublished genome [21].
Peptidomes analysis and in silico Bioactive Prediction of LkefirU
A total of 91 peptides were identified in peptidomics analysis from an intact peptide fraction encrypted by LkefirU in the Kefir sample. These peptides were used for the prediction analyses of bioactive peptides. Among them, only one peptide was predicted to be toxic, while nine peptides passed the bioactivity filter, and nine passed on blood-brain barrier peptide (BBP) filter. Remarkably, two peptides, namely VPGYPFLPI and KSPCVFILDQKKRL, met the criteria for both bioactivity and BBP filters (Supplementary table 2).
The peptide VPGYPFLPI is encrypted within Amino Acid Permease, a transmembrane protein, while the KSPCVFILDQKKRL peptide is encrypted within Glycolate Oxidase, a protein associated with the cytoplasmatic membrane. Molecular docking was then performed using these two peptides to identify bioactive peptides that could affect important molecules involved in the pathophysiology of Alzheimer’s diseases. Both peptides demonstrated a low Weight Coefficient with all predictions target β-amyloid monomer, and β-amyloid plaque, BACE, and AChE. Specifically, VPGYPFLPI displayed the lowest Weight Coefficient against the target AChE, while the KSPCVFILDQKKRL peptide exhibited the lowest Weight Coefficient against the target β-amyloid plaque (Table 2).
Molecular Docking
The KSPCVFILDQKKRL peptide exhibited interactions with 7 out of the 42 amino acids residues (aa) in the Aβ (1–42) monomer (Fig 3a and Supplementay table 3). Moreover, the VPGYPFLPI peptide showed interactions with 5 out of the 42 aa in the Aβ (1–42) monomer (Fig 3b and Supplementay table 3).
During the molecular docking analysis of the Aβ plate, the KSPCVFILDQKKRL peptide displayed interactions with only two aa. Specifically, the LYS-32 of the L chain interacted with Arg13, and ILE-32 of the I chain interacted with Ser2 (Fig. 4a and Supplementay table 3).
On the other hand, the VPGYPFLPI peptide exhibited interactions with 10 peptides of the plate, and sometimes one amino acid residue of the peptide interact with different amino acids residues of the plate, such as the amino acids residues Tyr4 interact with amino acids residues of D and E chain of plaque Aβ (1–42) - Leu34 of D chain and Leu17 and His14 of E chain- and Phe6 interact with amino acids residues of E and F chain of plaque Aβ (1–42) - Ile32 and Gly33 of E chain and Leu17 and Leu34 of F chain - (Fig. 4b and Supplementay table 3).
The KSPCVFILDQKKRL peptide, when docked with BACE, it demonstrated interactions with six aa (Fig 5a and Supplementay table 3). Similarly, the VPGYPFLPI peptide, when docked with BACE, interacted with six aa, with two of these aa forming the flap region, which plays a crucial role in determining whether binding to the BACE cleavage site occurs or not (Fig 5b and Supplementay table 3).
The KSPCVFILDQKKRL peptide, when docking to the AChE, interacted with nine aa of the A chain. Among these interactions, one aa was located within the catalytic active site (CAS), and another aa was part of the peripheral anionic site (PAS) (Fig 6a and Supplementay table 3). In addition, the VPGYPFLPI peptide, when docked to AChE, interacted with five aa of the B chain, with one aa bordering the PAS binding site, located close to the CAS active site (Fig 6b and Supplementay table 3).