Serotype 3 pneumococci belonging to GPSC83 harbor a gadASpn gene
At the present time, the dataset of S. pneumoniae isolates harboring a gadASpn-like gene contains 40 strains, including two strains whose genome was sequenced to complete assembly, i.e., A66 (= NCTC 7978) and SPNA45 (Table S1). All these 40 strains were of serotype 3, in agreement with previous Southern blotting results with a low number of strains (2/11) (García and López 1995), and were isolated mainly from blood or cerebrospinal fluid. Based on ST, 38 out the 40 strains belong to CC378 and its SLVs (ST232, ST1377, and ST7369) and DLVs (ST260, ST6014, ST11931, and ST16577) with only two singletons (ST369 and ST6934). As a whole, the majority of the pneumococcal isolates with a 1428 bp gadASpn-like gene belong to GPSC83, a global pneumococcal sequence cluster of intermediate frequency (Gladstone et al. 2019). These authors included in GPSC83, 13 isolates of CC1220, 9 of CC378, and a single isolate of ST11931. Eight of the CC1220 isolates were ST260 (a SLV of ST1220) and 5 were ST1220. Of note, all of the ST260 isolates harbored a copy of the gadASpn gene (Table S1) where those of ST1220 did not (WGS projects CABAFM01, FHPC02, CAANYB01, CAAPEX01, and CAAPYY01 [data not shown]). The genomic zones of these ST1220 isolates where the gadASpn gene should be located are 98% identical to the SPD_RS058905–SPD_RS05910 region of strain D39.
Four different gadASpn alleles were found —allele 3 being the most frequent— that encode a 475-amino acid protein; the most frequent GadASpn allele is included in the NCBI database under accession number WP_061632578 (Table S1). Since the four gadASpn alleles were ≥ 97.8% identical, allele 3 was chosen for further analyses, unless stated otherwise. When the genome of S. pneumoniae A66A was compared with that of strain D39 (NC_008533.2), it could be determined that the gadASpn gene (marked as a red arrow; locus tag A66_RS05660) is included in a ca. 9 kb DNA fragment inserted at the 5’ end of SPD_RS05900, a locus potentially encoding a Cof-type HAD-IIB family hydrolase. In D39, the insertion site is: 5’-1,134,639 TCCATATCCGTTGCTACTAGTTTAAT 1,134,664-3’, whereas in strain A66 (and other pneumococci) the inserted DNA fragment is flanked by a near perfect, direct repeat: 5’-1,095,234 TCCATATCTGTTGCTACTAGTTTAAT 1,095,259-3’, and 5’-1,104,316 TCCGATTTCTGTTGCTACTAATTTAAT 1,104,342 -3’. The same situation was found in S. pneumoniae SPNA45, although the coding sequences were inverted, as compared with strain A66. This is due to the fact that this DNA region is located in one of the rearranged fragments of SPNA45 (Morales et al. 2015). In addition, the gene A66_RS05665, coding for a putative Rgg/GadR/MutR family transcriptional regulator, is similar (25% identical and 45% similar; E = 8 × 10− 15) to the gadR activator of the Lactococcus lactis gadCB operon (Sanders et al. 1998).
Sequence comparisons of gadASpn with those present in the databases revealed the presence of a very similar (98% identity) gene of identical length in Streptococcus constellatus (gadASco) and, specifically, in two phylogenetically closely related subspecies, i.e., subsp. pharyngis and subsp. viborgensis (Table S1) (Whiley et al. 1999; Jensen et al. 2013). In addition, all these S. constellatus strains share the same gadASco allele. Further, these genes are syntenic, this is, they share the same order of genes, and the DNA fragment where they are contained in S. constellatus is > 90% identical to that of S. pneumoniae A66, whereas the flanking genes were about 75% identical, as expected for two species of the same genus (Fig. S1A). This suggests that the gadASco gene may have been recently (in evolutionary terms) incorporated into the genome of an ancestor of the two related subspecies of S. constellatus and that this integration probably involved a pneumococcal donor. Remarkably, at the position of loci A66_RS12525 and A66_RS12530 potentially encoded proteins involved in recombination, there exists a ca. 40 kb prophage both in S. constellatus subsp. pharyngis and S. constellatus subsp. viborgensis. This prophage is identical, over a 37,901 bp overlap, to the Javan113 prophage (Acc. No. MK448670) previously reported (Rezaei Javan et al. 2019) and very similar to prophages from other streptococcal species (Fig. S1B). Further, other gadA genes also similar to that of S. pneumoniae were found in four strains of Streptococcus agalactiae (group B streptococci) and in the genome of 24 strains of Lactobacillus delbrueckii, mainly in L. delbrueckii subsp. lactis (22 strains) (Table S1). The nucleotide identity between S. agalactiae gadA (gadASag; 1434 bp) and gadALde (1422 bp) from L. delbrueckii is ca. 65%, a value very close to the average sequence identity (68%) between these genes and those of S. pneumoniae (or S. constellatus) (data not shown). The most divergent region of the gene is located in its 5’ part (between nucleotide positions 1 and 200, approximately). Of note, 6 out of 22 strains of L. delbrueckii subsp. lactis contain an incomplete copy of the gadALde gene (Table S1). Interestingly, L. delbrueckii subsp. lactis KCCM 34717 contains many insertion sequences (ISs) that interrupt gadALde and several other flanking genes. This finding was not completely unexpected since it has been proposed that ISs play an important role on the evolution of Lactobacillus species (Kaleta et al. 2010). A diagram showing the chromosomal region containing a gadA-like gene in different bacterial species is shown in Fig. 1.
It is interesting to point out that, at least, four additional pneumococcal genomes (Table S2) harbor a different (but partly related) insert of about 17 kb, at the same position where the 9 kb insert containing the gadASpn gene is located. These are: SA_GPS_SP505-sc-1895675 (GPSC21 ST10619 serotype 19F; NZ_LR216035) (Fig. 1), 2245STDY5775874 (GPSC90 ST8328 serotype 19F; NZ_LR216032), 2245STDY5699394 (GPSC30 ST7055 serotype 10B; NZ_LR216024), and B1900 (serotype 3; NZ_CP051650). In these cases, as in those containing gadASpn (see above), the insert is flanked by a near identical repeated sequence, e.g., 5’-1,107,742 TCCATATCCGTTGCTACTAGTTTAAT 1,107,767-3’ and 5’-1,124,932 tccaatttctgtcgctactagtttaat 1,124,958-3’ in the particular case of strain SA_GPS_SP505-sc-1895675. It should be underlined that the pneumococcal smc gene (corresponding to SPD_RS05905 in D39 and represented by cross-hatched arrows in Fig. 1), encoding the condensin protein SMC and that is not essential in S. pneumoniae — albeit important for timely localization of the division site (van Raaphorst et al. 2017) —, has been identified as a hotspot for both recent and ancestral recombination events (Mostowy et al. 2017). A similar role for smc has recently been demonstrated in S. agalactiae (Lee and Andam 2022).
Another interesting feature of the gadASpn-containing fragment is the low mol % G + C content (≈ 30%), much lower than that of Streptococcus species (35–40%) or L. delbrueckii (≈ 50%) (Fig. 2). Foreign DNA incorporated into a genome may have a different G + C composition. Over time, such DNA is subjected to a process of amelioration where directional mutation pressures act to alter the base composition of the incoming DNA to match that of the whole genome (Bentley and Parkhill 2004).
A novel gene ( gadB Spn ) is present in a diversity of S. pneumoniae strains
In addition to identify pneumococcal isolates harboring a gadASpn gene, sequence comparisons revealed the existence of another gene (designated as gadBSpn hereafter) in 1591 strains whose genomes were available from the NCBI database (Table S3). From these, 1182 isolates could be assigned to 16 different GPSCs and represent 123 STs and 20 serotypes (Table 1). The vast majority (92.4%) of isolates belong to GPSCs6 (424), GPSC8 (363), GPSC38 (110), GPSC53 (100), and GPSC43 (95). Notably, only one member (BioSample: SAMEA3233202) of GPSC9 — a dominant cluster — harbored a gadBSpn gene. Pairwise sequence alignments showed that GadASpn (475 amino acid residues) and GadBSpn (501 amino acid residues) contain 37% identical amino acid residues and 58% conserved substitutions (Table 2). In an analogous way to that already mentioned for gadASpn, the gadBSpn gene was found to reside in a DNA fragment together with a variety of other genes (Fig. 3).
Table 1
Distribution of STs and serotypes among GPSCs harboring gadBSpn.
GPSC | Prevalencea | nb | Different STs | Most frequent STs (n) | Serotypesc | Most frequent serotypes (n) |
6 | Dominant | 424 | 49 | 156 (276); 162 (47); 143 (19); 1269 (9) | 10 | 14 (242); 9V (139); 19A (22); 15B/C (10); 11A (4) |
8 | Dominant | 363 | 10 | 289 (232); 5659 (47); 3404 (33); 7050 (28) | 1 | 5 (363) |
9 | Dominant | 1 | 1 | 861 (1) | 1 | 14 (1) |
38 | Intermediate | 110 | 10 | 393 (78); 310 (16); 9325 (6); 5475 (4) | 1 | 38 (110) |
43 | Dominant | 95 | 17 | 280 (31); 3214 (21); 239 (18); 11758 (10) | 7 | 9V (65); 35A (23) |
53 | Intermediate | 100 | 5 | 847 (90); 5262 (5); 11714 (3) | 1 | 19A (100) |
54 | Intermediate | 32 | 7 | 9473 (13); 5778 (7); 6317 (5); 706 (3) | 2 | 9V (30) |
155 | Intermediate | 13 | 5 | 105 (5); 5604 (5) | 1 | 25F (13) |
172 | Rare | 16 | 4 | 6693 (13) | 2 | 20 (15) |
208 | Intermediate | 10 | 2 | 4908 (8) | 1 | 9V (10) |
234 | Rare | 7 | 4 | 10606 (3); 1116 (2) | 1 | 3 (7) |
247 | Rare | 3 | 3 | 613 (1); 7616 (1); 12796 (1) | – | NT (3) |
257 | Rare | 5 | 3 | 5407 (3) | 2 | 25F (3); 38 (2) |
419 | Rare | 1 | 1 | 6346 (1) | 1 | 18C (1) |
536 | Rare | 1 | 1 | 5359 (1) | 1 | 38 (1) |
566 | Rare | 1 | 1 | 4651 (1) | 1 | 18F (1) |
Total | | 1182 | 123 | | 20 | |
a The data of prevalence of the indicated GPSC lineages were taken from Gladstone et al. (2019).
b Number of isolates.
c Due to repetitions of some serotypes in various lineages, the total number of different serotypes is 20.
Table 2
Pairwise comparisons between different GadA and GadB proteinsa.
| GadASco | GadASag | GadALde | GadBSpn | GadBSsu |
GadASpn | 97/98 | 64/80 | 67/82 | 37/58 | 37/57 |
GadASco | | 64/80 | 68/83 | 37/58 | 37/58 |
GadASag | | | 62/80 | 38/57 | 38/57 |
GadALde | | | | 35/55 | 35/55 |
GadBSpn | | | | | 99/100 |
a Figures indicate the percentage of identical/conserved amino acid residues.
Sequence alignments indicated that the insertion of foreign genes (including gadBSpn) may result in the insertion and/or deletion of a variable number of the genes of the recipient strain. In most cases, the acquisition of gadBSpn involves a region that, in the D39 strain, is located between the termination codon of SPD_RS04910 (rlmD = rumA) and the initiation codon of SPD_RS05130 (37.5 kb) (Fig. 3A). A recent study has identified this region as ICESpnD39-1 (Liu et al. 2019). Integrative and conjugative elements (ICEs) are mobile genetic elements (MGEs) integrated into bacterial genomes, which encode their own excision, conjugative transfer and integration (Haudiquet et al. 2022). Previously, this area had been named PPI1 (for Pneumococcal Pathogenicity Island 1), and includes four genes (piuBCDA) coding for proteins involved in iron transport by S. pneumoniae (Brown et al. 2001, 2002); at least one of the other genes in this region (SPD_RS05200 in D39 or SP_1051 in TIGR4; NC_003028.3) has been reported to contribute to virulence in a mouse model of infection (Brown et al. 2004). More recently, it has been demonstrated that SPD_RS05195 (coding for PezA) and SPD_RS05200 (encoding PezT) actually correspond to the pneumococcal epsilon-zeta homolog (PezAT) — a class II, functional toxin–antitoxin (TA) system —, with PezA as the cognate antitoxin to the PezT toxin (Khoo et al. 2007). PPI1 is a putative mobile variable region (Wyres et al. 2013) present in highly virulent isolates but not in non-invasive or intermediate-virulent strains (Mutschler et al. 2011). It has also been reported that pezAT mutants exhibit higher resistance to β-lactam antibiotics and enhanced genetic competence (Chan and Espinosa 2016). As already mentioned, this zone greatly varies among different S. pneumoniae isolates, not only in size (changing to up to ≈ 58 kb in strain 70585), but also in gene composition. Interestingly, similarity searches showed that the proposed ICE of strain 70585 contains genes very similar (77–93% nucleotide identity) to 27 ORFs (ORF1–16, ORF69, ORF75–80, and ORF86) out of the 86 predicted genes of the 94-kb ICESluvan element of Streptococcus lutetiensis, which also inserts at the 3’ end of rlmD (Bjørkeng et al. 2013).The finding that different pneumococcal strains contain dissimilar gene cassettes (Fig. 3A) is in agreement with a role as a recombination hotspot. Indeed, the variability degree of this zone is not strain-specific and some strains share a near identical syntenic organization. The great diversity of genes accompanying gadBSpn is depicted in Fig. 3B. This gene (indicated by a red arrow) is encompassed by genes that may vary in number, orientation, and function depending on the pneumococcal strain analyzed, although many similarities exist in this region; for example, between strains 224STDY6178826 (GPSC53 ST947; NZ_LR216061) and FDAARGOS_1508 (GPSC6 ST156; NZ_CP083627) synteny is very obvious, except for a group of genes that are arranged in opposite directions (indicated in Fig. 3B as black arrows crosshatched with yellow lines).
In addition to the insertion site indicated above (between SPD_RS04910 and the initiation codon of SPD_RS05130, taking the D39 genome as a model), a second insertion site exists between SPD_RS10915 and SPD_RS10920. This region is located far away from the previous one (about 1 Mb apart) (Fig. 3A). This appears to be the case of strain 2245STDY5605669 (GPSC38 ST310; NZ_LR216017), although the genes E0F14_RS11240 (homolog of SPD_RS10915) and E0F14_RS00005 (corresponding to SPD_RS10920 in D39) were separated by ≈ 80 kb, instead of being contiguous, as is the case in D39. On the other hand, in strain 2245STDY6835400 (GPSC8; CAAVMP010000004) the gadBSpn gene is located ≈ 11 kb upstream of the cluster of genes located 3’ of SPD_RS05055 (in strain D39). When compared with strain 70585, the location of gadBSpn in strain 2245STDY6835401 was found to be flanked by SAMEA104035315_00895— encoding a frameshifted IS5-like element and corresponding to SP70585_RS13175 in strain 70585 — and SAMEA104035315_00903, which matches to SP70585_RS05435 and encodes a putative chloramphenicol acetyltransferase (CAT) (Fig. 3B). Immediately downstream of the CAT-coding gene, two more genes putatively encoding, respectively, a methionine–tRNA ligase (SAMEA104035315_00904) and a tyrosine-protein phosphatase (SAMEA104035315_00905), were found in every pneumococcal strain harboring a gadB-like gene. A detailed analysis of the genes shown in Fig. 3B revealed that the minimum cassette embracing gadBSpn appears to be composed by seven genes encoding respectively: 1) a hypothetical protein (HP), 2) an acyl carrier protein, 3) another HP, 4) the glutamate decarboxylase GadBSpn itself, 5) an acyl–CoA ligase, 6) an aminotransferase class I/II-fold pyridoxal phosphate-dependent enzyme, and 7) another HP. These genes are indicated by light green and pink arrows in Fig. 3B and included in an orange rectangle, and their mol % G + C content ranges between 27.0 and 31.9. In most strains, however, this minimum cassette is accompanied by three more genes in position 3’ of the last of the seven genes and those putatively encoding CAT, methionine–tRNA ligase, and the tyrosine-protein phosphatase already mentioned (also indicated by pink arrows). All these genes have a mol % G + C content lower than that of the whole chromosome (Fig. S2). In four out of six strains the gene cassette including gadBSpn is located ≈ 8 kb upstream of a group of genes designated as SPD_RS05125 to SPD_RS05170 in strain D39 (Fig. 3B).
As mentioned above, gadASpn homologs are present in several Gram-positive bacteria; these are, S. constellatus subsp. pharyngis, S. constellatus subsp. viborgensis, S. agalactiae, and in three subspecies of L. delbrueckii. Remarkably, a gadB-like gene was found in two isolates of Streptococcus suis (Fig. 3C), which presumably correspond to a single strain: the two isolates (2018WUSS147 and 2018WUSS150) were sampled in the same day (August 27, 2018) and in the same city (Hunan, China), and identified at the same laboratory (OIE Reference Laboratory for Swine Streptococcus). Even more, there exists near 100% nucleotide identity between the contigs of the corresponding S. suis isolates (unpublished observations). Interestingly, the pneumococcal allele 1 and the swine alleles of gadB are closely related since they are of the same length (1506 bp), they differ only at two nucleotide positions, and their encoded proteins differ only by a conserved, single amino acid substitution (Met in GadBSpn → Ile486 in GadBSsu).
In addition to the case of S. suis, gadBSpn-like genes appear to exist in some Gram-positive strains, mainly in some members of the Bacillus genus. The most similar paralog of GadBSpn (56% identity, 75% similarity) is encoded by the WR52_RS29730 locus of the Bacillus cereus (strain HN001) plasmid pRML02 (Fig. S3). The GadBBce decarboxylase has 504 amino acid residues, similar to the 501 amino acid residues of GadBSpn. Notably, several of the genes located around gadBSpn are preserved around gadBBce, although the synteny is somehow different (Fig. S3). Another protein very similar (> 98% identity) to GadBBce is WP_097888410 (504 amino acid residues; GadBBth) that is encoded by, at least, six different strains of Bacillus thuringiensis (Table S3).
GadA and GadB have putative epitopes similar to those of GAD65 presumably involved in T1DM development
Although many vertebrates harbor 3 different GAD-coding genes (Grone and Maruska 2016), GAD exists in two isoforms in humans, GAD67 and GAD65, each encoded by a different gene, GAD1 and GAD2 — located in chromosomes 2 and 10, respectively —, which differ in size, charge, localization, and antigenicity (Erlander et al. 1991; Kassa et al. 2014). GAD67 exists as the active holoenzyme (bound to PLP) that provides a steady production of neuronal cytosolic GABA, whereas GAD65 predominantly exists as a PLP-dissociated apoenzyme that mediates activity-dependent GABA synthesis when fast postsynaptic inhibition is needed switching from the inactive to the active form. GAD65 AAbs (but not GAD67 AAbs) were detected in 80–90% of newly diagnosed patients and were an early marker of β cell destruction in individuals who later developed disease (Atkinson et al. 1990). GAD67 isoform AAbs have been detected in the serum and the cerebrospinal fluid of patients with various neurological syndromes, although those AAbs are barely detected in the absence of GAD65 AAbs and thus are not considered clinically relevant.
According to its linear sequence, GAD65 is divided into three functional domains: the N-terminal domain comprising residues 1–188, the PLP domain (residues 189–464), and the C-terminal domain comprising residues 465–585 (Fenalti and Buckle 2010). The major epitopes in T1DM have been mapped to the PLP and C-terminal domains (Schwartz et al. 1999), and elimination of the first 100 amino acid residues altered neither enzyme activity nor reactivity with sera from diabetic patients (Fenalti et al. 2007). In addition, AAbs to N-terminally truncated GAD65 (lacking the 95 N-terminal amino acids) have been reported to identify more specifically at-risk relatives of patients with T1DM than AAbs to full length GAD65 (Pöllänen et al. 2022). In contrast, isolated positivity for AAbs to the N-terminal epitope of GAD65 confers no increased risk for T1DM. Pairwise sequence alignments of GAD65, GadASpn and GadBSpn were performed (not shown) and the CD4+ and CD8+ epitopes compiled in previous publications (James et al. 2020; Amdare et al. 2021; Ivanov et al. 2022) were localized in the alignments (Table 3). Seven putative epitopes in the pneumococcal Gads were located at regions corresponding to positions 202–266 in GAD65 and three at its C-terminal domain. This fits with the observation that the PLP domain of GAD65 is the most immunodominant region both at diagnosis and thereafter (Ronkainen et al. 2004).
Table 3
Sequence similarities between various GAD65 epitopes known to be relevant in T1DM and GadASpn/GadBSpna.
Protein | Position and sequence | IEDB epitope identifierb | Evidencec |
GAD65 | 202 TNMFTYEI-APVFVLLEYVTL 221 | 105004 | D+ |
GadASpn | 104 QNLINKDICSPMGSEIEAEVI 124 | | |
GadBSpn | 110 QNLINASFCAPVATIMEINVI 130 | | |
GAD65 | 206 TYEI-APVFVLLEYVT 220 | 67328 | D+ |
GadASpn | 108 NKDICSPMGSEIEAEV 123 | | |
GadBSpn | 114 NASFCAPVATIMEINV 129 | | |
GAD65 | 217 EYVTLKKMREIIGWPGGSGD 236 | 104481 | C− |
GadASpn | 120 EAEVIIWLRQILGYSFDDKI 139 | | |
GadBSpn | 126 EINVIQWLRKVLGYSTSD-V 144 | | |
GAD65 | 232 GGSGDGIFSPGGAISNMYAM 251 | 105216 | D+ |
GadASpn | 142 VTKLGGAVTTGGVMSNTYAL 161 | | |
GadBSpn | 147 IMEVGGIVTYGGTGSNSTAM 166 | | |
GAD65 | 247 NMYAMMIARFKMFPEVKEKG 266 | 45043 | C− |
GadASpn | Deletion | | |
GadBSpn | 162 NSTAMLLARENKDGNTLELG 181 | | |
GAD65 | 248 MYAMMIARFK 257 | 104140 | A− |
GadASpn | 158 TYALMAAKRK 167 | | |
GadBSpn | 163 STAMLLAREN 172 | | |
GAD65 | 248 MYAMMIARFKMF 259 | 104141 | A− |
GadASpn | 158 TYALMAAKRK-Y 168 | | |
GadBSpn | 163 STAMLLARENKD 174 | | |
GAD65 | 473 KCLELAEYLYNIIKNREGYE 492 | 142807 | A− |
GadASpn | 364 SRIENAKKFYNILSENNAFI 383 | | |
GadBSpn | 380 KRIELTNYLQDLI–LKSSK 397 | | |
GAD65 | 476 ELAEYLYNI 484 | 104767 | B+ (CD8+) |
GadASpn | 367 ENAKKFYNI 375 | | |
GadBSpn | 383 ELTNYLQDL 391 | | |
GAD65 | 556 FFRMVISNPAATHQDIDFLI 575 | 103167 | A− |
GadASpn | 441 VLRYNSGNINITEVELEDAV 460 | | |
GadBSpn | 468 PLRFMSGNPNLTIEELQKMV 487 | | |
a Residues on black or grey boxes indicate amino acids identical or conserved substitutions, respectively, between GAD65 and any of the two pneumococcal Gads. The sequences of T-cell epitopes were taken from James et al. (2020) and Amdare et al. (2021).
b IEDB: the Immune Epitope Database (https://www.iedb.org/home_v3.php).
c With the exception of 476 ELAEYLNI 484, the rest are CD4+ T-cell epitopes. The level of evidence that defines a given epitope as such was taken from James et al. (2020). Evidence of natural processing and presentation is not available for epitopes marked as C or D.
Folding predictions of GadASpn and GadBSpn using AlphaFold together with an evolutionary analysis of these proteins were done in a subset of the potential epitopes shown in Table 3, namely, peptides 158–167 and 367–375 of GadASpn (Fig. S4), and 163–172 and 383–391 of GadBSpn (Fig. S5). The models indicated that, with the possible partial exception of peptide 163–172 (GadBSpn) that is somehow buried, the putative epitopes analyzed are located on the surface of the proteins, in well-conserved regions, and may be favored due to an increased antibody accessibility.