Data collection
The human gut microbiome data were collected and assembled from six online repositories. We have extensively collected, filtered, and assembled data from these six databases (please see Methods for details) to identify POP homologues.
Pipeline
The methodology is illustrated in Fig. 1 (Individual sequences taken at each stage are given in Additional File 2). All genomes accessible from the aforementioned six studies were incorporated into the analysis. The resulting non-redundant filtered dataset encompasses a comprehensive compilation exceeding 5,000 genomes. This includes diverse strains co-existing within the human gut for specific species. Subsequent to rigorous data mining and meticulous refinement processes (please see Methods for our sequence searches), a subset of 4,400 sequences was selected. By subsequently applying a filter to target sequences featuring the two-domain architecture of propeller and catalytic domains (POP), 734 sequences were identified for subsequent in-depth investigation. Ultimately, ten sequences bearing potential probiotic significance (as evidenced in the literature) were chosen for structure modelling. The protein which served as the template (PDB code 3IVM)(17) is from the bacterium Aeromonas punctata in the ligand-bound “closed” form of the protein. Further, a refined set of six sequences was selected based on the binding affinity with the ligands and the probiotic nature of the bacterial species of origin. These six proteins were further subjected to a thorough evaluation concerning their therapeutic relevance in the context of celiac disease using Molecular Dynamics simulations.
The selected sequences are either probiotics or show probiotic potential
Probiotic products typically take the form of oral supplements or food-based products containing microorganisms, typically bacteria. We hypothesize that a gut bacterium could be regarded as a potential therapeutic in this context if it is generally used as a probiotic and also retain genes which have sequence signature of POP domains. These criteria could suggest that such enzymes are capable of degrading the harmful celiac disease-containing peptides. We evaluate the effectiveness of six potential bacterial sequences (Table 1) to degrade the known celiac disease epitopes(18) (Fig.6C). The most commonly used strain for gram-negative probiotics is Escherichia coli Nissle 1917(19). Escherichia coli is also one of the common bacteria of the human gut Microbiome. We have evaluated the sequence WP_021564606 from E.coli.
Bacteria within Prevotella species are one of the most important members of the gut microbiome, especially in Indian populations(20). Studies show that the genomes of Prevotella species recovered from human gut metagenomes show that the most prevalent and relatively abundant species are primarily related to P. copri and P. stercorea. These species possess distinct gene repertoires likely reflecting adaptations to metabolic niches(21).
We could not find any genes with potential POP domain in any of the strains of Prevotella copri through our sequence searches, but genes coding for POP domains with specific two-domain architecture were found in P. stercorea and other Prevotella species.
Recent studies also suggest that Prevotella species could be used as next-generation probiotics(22).
We have evaluated the sequence WP_040596915 from Prevotella Stercorea and WP_081457517 from Prevotella salivae for our analysis.
Growing attention is being directed towards Faecalibacterium prausnitzii, an abundantly prevalent bacterial species within the gut. This heightened interest arises from its potential significance in fostering gastrointestinal well-being. This species is shown to have anti-inflammatory properties as it can induce a tolerogenic cytokine profile(23).
Research shows that Faecalibacterium prausnitzii could be used as a next generation probiotic in gut disease improvement(24). Hence, we have modelled the sequence VDR34335 of Faecalibacterium prausnitzii, corresponding to the POP domain, for our analysis. Clostridium species are a predominant cluster of commensal bacteria in our gut, and they exert considerable effects on our intestinal homeostasis. Clostridium cadaveris is usually considered non-pathogenic and does not produce toxins, unlike other species of Clostridium(25). Likewise, we have modelled the POP domain in one sequence (WP_035771429) from Clostridium cadaveris for our analysis.
The commercial Bacillus probiotic strains in use are B. cereus, B. clausii, B. coagulans, B. licheniformis, B. polyfermenticus, B. pumilus, and B. subtilis. These probiotic bacteria have demonstrated their ability to significantly improve digestive health, and they even have a role in preventive healthcare with their ability to lower lipid levels and improve immune response(4).
Almost all the POP sequences from Bacillus strains found in our analysis from gut microbiome consists of only the catalytic domain. However, we could discover POP sequence from B. subtilis, which contain the two-domain architecture and hence can be considered for protein modelling and subsequent study. This sequence belongs to S9A, and it is an oligopeptidase B, identified by the conserved sequence motif (GGSXGGLL) around the active site serine. Oligopeptidase B also has a similar domain architecture as POP but differs in the motif around the active site (as previously mentioned in the introduction). Studies show that Oligopeptidase B can also cleave proline-rich peptides(26). Considering the widespread use of Bacillus strains as probiotics, we have included the following oligopeptidase B sequence from B. subtilis for docking and molecular dynamics study. Although the strain from which the sequence is selected is not found in the above six studies, other studies show that B. subtilis is a part of the gut microbiome(27). We have included the POP domain in the sequence MCF7751424 from B. subtilis for our analysis.
Table 1: The selected species and the POP sequences. Functionally important residues have been marked.
Species
|
Sequence_ID
|
Tyr
|
Ser
|
Trp
|
Asp
|
Arg
|
His
|
Prevotella stercorea
|
WP_040596915
|
479
|
559
|
600
|
643
|
645
|
678
|
Clostridium cadavaris
|
WP_035771429
|
450
|
530
|
571
|
614
|
616
|
649
|
E.coli
|
WP_021564606
|
463
|
544
|
585
|
627
|
629
|
660
|
Alistepis Sahii
|
WP_149885823
|
479
|
559
|
599
|
642
|
644
|
677
|
Faecalibacterium prausnitzii
|
VDR34335
|
456
|
536
|
577
|
619
|
621
|
654
|
Bacilus Subtilus
|
MCF7751424
|
475
|
555
|
599(E)
|
640
|
642(Q)
|
675
|
Distribution of POP family (S9 family proteins) in gut microbiota across bacterial phyla
Fig. 2 shows the distribution of POPs that could be identified in the gut microbiome as present in our dataset. Studies have shown that gut microbiome consists of bacteria from four predominant phyla - namely Bacteroidetes, Actinobacteria, Firmicutes and Proteobacteria(28). Our study also reports similar results (Figure 2 and Additional file 3). Along with the above mentioned four phyla, bacteria from two other phyla (namely Fusobacteria and Verrucomicrobia) were also found to contain POP domain. However, no genes of the POP family (featuring the distinctive dual-domain architecture) were detected within these two phyla. Consequently, only genes coding for POP originating from the dominant four phyla were included in the analysis presented in Fig 2. We observe that nearly 50% of the identified POP homologous sequences are from phylum Bacteroidetes. Proteobacteria and Firmicutes constitute almost the same number of POP family genes (~20%). Finally, the phylum Actinobacteria has around (13%) of POP family genes (Fig.2A). Fig. 2B represents the distribution of bacterial phyla when data is segregated per species. We notice most genomes from Firmicutes have a single gene coding for a protein belonging to the POP family. In contrast, for the phyla Actinobacteria and Proteobacteria, the number varies from 1 to 3. Finally, the phyla Bacteroidetes have many genomes that encode for multiple POPs, where the number of genes varies from 1 to 10; however, most of the genomes were observed to have more than four POP genes (Fig. 2B).
Gut Microbiome genes containing POP domains show Diverse Domain Architecture
Diverse domain architectures were found in the genes of the gut microbiome that contain the sequence signature of POP domains (Fig. 3). The most common one constitutes sequences having only a catalytic domain. Sequences having domain architecture DPP_IV and catalytic domain (DPPIV_N-Peptidase_S9), as well as sequences having domain architecture as propeller domain and catalytic domain (Peptidase_S9_N-Peptidase_S9) are also common. Other frequently occurring domains are alpha-beta hydrolase 2, PD40, and esterase.
Alpha-beta hydrolase and esterase are protease sequences which share remote homology with S9 family proteins or act as intermediates. Hence, it is not uncommon that some of these domains are found in S9 family peptidases. In addition, many other domains were also present in very low frequency (present in one or two proteins) (details in Additional File 4). However, we chose candidate POPs which contain the classic two-domain architecture (N-terminal propeller domain and catalytic domain) for structural studies.
Modelled Structure of POP domain shows Stable Conformation
The structure of POP from the bacteria Aeromonas caviae is known in both open and closed conformation. This domain (closed form: PDB ID 3IVM) showed nearly 30 to 40% identity to the selected sequences and was hence considered as a template for structure modelling of the sequences. The alignment of the query and template shows the preservation of conserved residues, along with the catalytic triad, which is present towards the C-terminal of the sequence in the catalytic domain. There was good agreement between the secondary structural elements. The structures were modelled using Modeller(29,30). For each sequence, 100 different models were generated, and the model which showed the lowest DOPE score was selected for further analysis. The modelled structures were evaluated by Ramachandran plot analysis using PROCHECK(31). More than 98% of residues were in the core allowed region of the Ramachandran plot, indicating the better quality of models, while few residues were in the disallowed region (Fig. 4). A representative model from Prevotella stercorea is shown in Fig. 4B. The modelled catalytic domain is shown in blue while the modelled propeller domain is shown in red. The superimposed structure of the model with the template structure (Fig 4A) shows an RMSD value of 0.18, which indicates that the modelled structure is quite similar to the template structure. The per residue energy of the modelled structure is negative (Fig 4C), and the Ramachandran plot (Fig 4D) shows only six residues in the disallowed region. The statistics of all the other modelled structures are available in Additional File 5.
Inhibitor and substrate docking
Four of the potent celiac disease-causing epitopes (Proline containing sequence derived from Gluten subtype α and γ) which are recognised by HLA DQ2.5 were considered for peptide docking analysis.
Z-Pro-Prolinal (ZPR) (detailed structure in additional file 6) is known to be a natural inhibitor for POPs. We first analysed the crystal structure of the 3IVM (ligand (ZPR) bound form of POP). Apart from the active site residue, the closed form is shown to bind to three other stabilizing residues (Tyr 458, Trp 579 and Arg 624). It is known from the literature that Trp 579 is the critical residue that packs with the Proline ring of the substrate, and Arg 624 is known to be involved in hydrogen bonding to the substrate(3). In the docking analysis, only those docking poses which show interaction with the catalytic site and the stabilizing residues were considered. ZPR and the four most potent celiac disease-causing peptides were considered for docking analysis. Due to the presence of slight insertions and deletions within the analysed POP sequences, there is no exact correspondence between the residue numbers of sequences and the crystal structure residues. To address this disparity, equivalent residues were determined through a multiple sequence alignment encompassing all selected sequences and the template (Fig. 5 and Table 1). Notably, all active site residues and stabilizing residues exhibit 100% conservation across the POP sequences. The only exception is in an Oligopeptidase B in Bacillus subtilis, where Arginine and Tryptophan are not conserved, although the catalytic triad remains preserved.
The docking of all the selected sequences with above mentioned celiac disease-causing epitopes were performed using Schrodinger software (for details, please refer to Methods section). A representative 2D-docking diagram is shown in Fig. 6A, along with the hydrogen bonding patterns with the critical residues Arg, His, Tyr and Ser. The 3D-model of protein-peptide docking is shown in Fig. 6B. The heatmaps for docking score for all peptides and sequences are shown in Fig. 6C. Bacteria in Prevotella and Faecuilibacterium show the lowest average binding energy, considering all the peptides. All the peptides show similar average binding energy for all the sequences.
Molecular Dynamics Simulation shows Stable Binding of Peptides
Next, MD simulation was run for 100 ns on all six protein-peptide complexes in order to assess the stability of docking complexes. The peptide PQPELPYPQ, which is a part of the gluten, was selected as a representative peptide for MD analysis. A representative MD run is shown for a Prevotella POP in Fig 7. The RMSD plot shows the stability of the complex over the 100 ns simulation period (Fig 7A). It was also observed that the simulation converges, and the RMSD values stabilize around 20ns and do not fluctuate much for a duration of 100ns. Ligfit prot (Fig 7A) indicated that the RMSD of the protein-ligand complex remains stable and is comparable to the protein RMSD, and hence, it indicates a stable protein-ligand complex for the whole duration of 100 ns.
The major interacting residues which form H-bonds with the ligand and remain for more than 50% of the simulation time are Lys 150, Lys 197, Asp 273, Phe 482, Asn 699 and Trp 600 (Fig 7B, 7C). Trp 600 is one of the essential residues known to bind the ligand. Another essential residue mentioned earlier was Arg 645. A hydrogen bond with Arg 645 was observed, but not for more than 50% of the simulation time. However, Arg 645 was found to be generally interacting for almost 90% of the simulation time, including hydrophobic interactions. Few other residues interact through water bridges or hydrophobic interaction as well (Fig. 7C). At each point during simulation time, the protein remains in contact with the ligand (Fig 7D), which suggests the stability of the complex.
MD simulation for 100 ns was run for other protein-peptide sequences, and the complexes were found to be stable for the entire simulation time (Additional File 7). The equivalent residues of Tyr 479, Trp 600 and Arg 645 are found to be interacting in other selected POPs as well (details in Additional file 7). As mentioned earlier, these peptides are the actual epitopes which act as T-cell epitopes, which eventually cause CD.
In our analysis, we observed that POPs from the above-mentioned sequences are able to bind to these epitopes. Additionally, our molecular dynamics simulation study indicates the stability of these interactions. Hence, these POP sequences could be potentially used to degrade CD-causing peptides. Additionally, since all of the selected sequences are from probiotics/gut bacteria, it would be easier to design experiments and should be safe for human consumption. Consequently, we propose that these POPs from these select probiotic bacteria could potentially act as therapeutics and offer relief from the symptoms of CD.
MD Simulation of Prevotella in Acidic Environment
One requirement for a candidate for effective therapeutics for CD is that the protein should be able to show activity in both the acidic environment of the stomach and at neutral pH in the intestine. It is absolutely important that a candidate therapeutic for CD should be stable in both acidic (in the stomach) and in neutral pH (in the intestine). Therefore, we next evaluated the stability in an acidic environment and chose the POP identified from Prevotella. The MD simulation was redone at 2.5 pH, simulating the acidic environment of the stomach for 200ns. The RMSD diagram reveals that the protein-peptide complex is stable for the whole duration of 200ns (Fig 8A). In addition, the percentage conservation of secondary structure shows that most of them remain stable during the simulation time(Fig 8B). The protein-ligand interacting graph (Fig 8C and 8D) shows that several important residues, such as Arg 645, Tyr 479, and Ser 559(Active site residue), retain hydrogen bond interaction and are in contact with the ligand for a significant amount of time (40-90% of the time).