Collection of emm89 S. pyogenes clinical isolates in Japan and construction of cohorts
We collected clinical S. pyogenes strains isolated between 2016 and 2021 from patients with non-invasive infections and STSS in Japan. STSS was diagnosed according to the clinical criteria of the Ministry of Health, Labour and Welfare of Japan (Extended Data Table 1)15. Phenotypes of strains from STSS patients were determined as ‘severe invasive’.
Concerning the emm89 clinical isolates, we collected T serotype TB3264 and untypable strains in addition to emm genotype-identified strains. The T serotype TB3264 corresponds to the genotype emm89 or emm9416. A total of 207 clinical isolates were collected with the cooperation of the National Institute of Infectious Diseases and ten public health institutes nationwide (Extended Data Tables 2, 3). We performed draft genome sequencing of the strains and identified their emm types. In total, 150 of these were determined as emm89. To focus on the pathogenic mechanisms underlying severe invasive infections in the emm89 cohort, 150 emm89 strains were used for subsequent analyses (Fig. 1a and Extended Data Table 2). We previously determined the draft genome sequences of 161 emm89 strains isolated in Japan between 2011 and 2019 and determined their phenotypes using the same criteria (Extended Data Table 2)10. We combined these two sets and finally considered a total of 311 emm89 strains, including 135 severe invasive and 176 non-invasive isolates, as the Japanese cohort.
We also collected public genome sequences of emm89 S. pyogenes strains isolated from nine countries to further characterize the genetic properties of the Japanese cohort (Extended Data Table 2)17-19. In this study, the phenotypes of these strains were considered invasive if the diagnoses included severe infections, STSS, invasive infections, necrotizing fasciitis, bacteremia, or sepsis, and isolation sites were described as normally sterile sites. Consequently, we identified 666 strains in the global cohort, including 420 isolates from invasive cases and 246 from non-invasive ones (Extended Data Table 2).
Pan-genome and phylogenetic analyses revealed both shared and distinct features in the Japanese and global cohorts
To determine the core genes and gene distribution in both cohorts, we performed pan-genome analyses. In the Japanese cohort, 1,417 core genes possessed by >99% of all isolates were determined out of the 3,334 different genes detected within the 311 strains. In contrast, the global cohort was more diverse, with 4,743 different genes, out of which 1,327 were core genes (Fig. 1b).
Next, the phylogenetic relationships of the core gene sequences were calculated (Fig. 1c and Extended Data Fig. 1). The tree for the global cohort branched into four clusters, with clusters A, B1, B2, and B3. Cluster B3 includes 640 genetically similar strains isolated mainly from Europe, North America, and Japan (Fig. 1c). The phylogenetic tree for the Japanese cohort could also be clustered as in the case of the global cohort, and there was no significant difference in the proportions of strains classified into each cluster (chi-square test, p=0.13; Supplementary Table 4); thus, we concluded that the overall phylogenetic features of emm89 strains were distributed quite similarly in Japan and other areas. Within cluster B3, we discovered a non-invasive strain from Japan that had no identical pattern to the reported nga promoter variations20 (Fig. 1c and Extended Data Fig. 1). This pattern is supposed to be a subtype of clade 3 because it shares the haplotype A–27G–22T–18, which is distinctive of clade 3, but has a mutation in the –10 box (Extended Data Fig. 2)20. Thus, we named this novel nga promoter variation type 3.420. Further details about the phylogenetic analysis are given in Supplementary Note 1. Overall, using phylogenetic approaches, we found that most strains from Japan and countries in Europe and North America share genetically close relationships.
GWASes detected SNPs/indels associated with invasiveness that were both common and specific to Japan and other countries
To discover all types of genetic variants within emm89 S. pyogenes that were associated with (severe) invasiveness, we applied pan-genome analysis and performed three types of independent GWASes targeting SNPs in core genes, the presence or absence of all genes, and other variants located in intergenic regions spanning several nucleotides.
We extracted and detected 24,627 and 47,060 SNPs/indels from core gene alignments in the Japanese and global cohorts, respectively. Subsequent GWASes identified SNPs/indels associated with severe invasiveness in a Japanese cohort and invasiveness in a global cohort. To control for population bias, we calculated pairwise distance matrices and chose seven and three dimensions for the analyses of the Japanese and global cohorts, respectively (Extended Data Fig. 3a, b). The GWAS of the Japanese cohort detected 17 SNPs/indels in 13 core genes (Fig. 2a and Extended Data Table 2). Of the 17 significant variants, there were seven single-nucleotide deletions (SNDs), seven SNPs causing non-synonymous amino acid substitutions, and three SNPs causing synonymous substitutions. The covS gene, encoding a sensor kinase of the two-component system CovR/S, contains four SNDs with the lowest p-values (p=1.16×10–7 for the 39th, 40th, and 46th nucleotides, and p=1.15×10–6 for the 125th nucleotide). These four deletions were associated with severe invasive infections.
We also performed a GWAS for the global cohort and detected 1,075 SNPs/indels significantly related to invasive infections among the 360 core genes (Fig. 2b, c and Supplementary Table 3). Among the significant SNPs/indels, 725 caused synonymous substitutions and 319 caused non-synonymous substitutions or frameshift mutations. Nineteen SNPs induced nonsense mutations, whereas the effects of 12 SNPs/indels were unpredictable because of a lack of reference sequences (Supplementary Table 3). Notably, 96 SNPs/indels accumulated in a single gene, murJ, which is involved in peptidoglycan biosynthesis (Extended Data Fig. 3c). The SNP with the lowest p-value (p=1.35×10–14) was lacE, which encodes the EIICB component of the lactose-specific phosphotransferase system (Fig. 2c). The mutation is associated with an invasive phenotype and is mainly observed in strains isolated in the US. Compared with the significant 17 SNPs/indels in the Japanese cohort, 10 SNPs/indels were also detected in the global cohort, including four SNDs in covS and one SNP each in six loci (Fig. 2c). Deletions at the covS locus are common among strains from several countries, including Japan. In contrast, SNPs found at six loci, gatA, group_1102, group_647, iscS_1, recU, and fhuB existed exclusively in Japan (Fig. 2c and Supplementary Table 3). These results suggest that several bacterial mechanisms cause severe invasive S. pyogenes infections, and some prevail worldwide, whereas others are specific to Japan.
The GWAS on COGs detected 2 and 109 genes associated with severe invasiveness in the Japanese cohort and invasiveness in the global cohort, respectively
Next, we examined the associations of accessory COGs with severe invasiveness and global invasiveness in Japanese patients. Two significant genes were detected in the GWAS for the Japanese cohort: group_184, which encodes a hypothetical protein, and divIC, which encodes a septum formation initiator protein (p=8.81×10–6 and p=6.72×10–6, respectively; Fig. 3a and Supplementary Table 4). Although analysis of the global cohort revealed the presence of 169 genes that were significantly related to invasiveness, no genes were identical or homologous to the two genes detected in the Japanese cohort (Fig. 3b, c and Supplementary Table 5). Among the 169 genes, 25 encoded phage-related genes and 14 encoded mobile genetic elements (MGEs) such as transposase, integrase, and recombinase. Overall, genes associated with invasiveness were found to encode MGEs and transporters, whereas major virulence factors were not significantly associated with invasiveness.
The k-mers-based GWAS detected both distinctive and identical variants compared to the SNPs- and COGs-based GWASes
To detect SNPs/indels and multiple mutations in the entire genome, we extracted 31-nt-length k-mers from whole genomes and performed a GWAS. The k-mers-based GWAS can handle polymorphisms spanning more than one base, such as indels, inversions, and translocations, in both the coding and non-coding regions.
The set of connected nodes and edges comprising each de-Bruijn graph is called a complex (Fig. 4a–f). In the Japanese cohort, the k-mers-based GWAS detected two complexes containing causative variants associated with severe invasiveness (Extended Data Table 4). The complex comprising the nodes with the lowest q-value (p=1.49×10–2) was Comp_11 in the covS locus (Extended Data Table 4). The causative mutation in covS was an SND (Fig. 4a). This deletion resulted in a frameshift mutation that shortened the length of CovS from 500 to 35 amino acids, leading to increased invasiveness, as previously reported in other emm types13. Another complex significantly associated with severe invasiveness is Comp_2 (p=4.22×10–2; Extended Data Table 4). Comp_2 is a highly variable region containing eight hypothetical protein-coding genes, with high similarity within the first 75 bp. Significant k-mers were also mapped to the first 26 bp of group_184 and 20 bp upstream (Fig. 4b). Together with this finding, group_184 possibly contributes to severe invasiveness through not only its presence itself but also that of the upstream region.
Next, we analyzed the global cohort. We identified mutations that were significantly associated with invasiveness in five regions (Extended Data Table 5). The mutation with the lowest q-value (p=1.90×10-2) was identical to the SND in covS found in the k-mers-based GWAS of the Japanese cohort (Fig. 4a).
Two significant k-mers were present in Comp_6, which were found to be an intergenic region of 270 bp (p=1.90×10–2; Extended Data Table 5 and Fig. 4c). In Comp_24, with a high sequence variation containing genes encoding transposases, the presence of a 281-bp sequence consisting of several k-mers was significantly associated with the invasive phenotype (p=1.90×10–2; Extended Data Table 5 and Fig. 4d). n27458, in Comp_10, the sagG locus encoding the ATP-binding protein of the efflux transporter of SLS, was significantly correlated with the non-invasive phenotype and causes a synonymous mutation (p=2.40×10-2; Extended Data Table 5 and Fig. 4e). The other significant mutation existed in the fhuB locus, encoding a putative ferrichrome transport system permease (p=2.40×10-2; Extended Data Table 5). The mutation was identical to SNP T218C detected in the SNP/indels-based GWAS (Fig. 4f).
The k-mer approach identified multiple variants, including the mutation identified in the SNPs/indels-based GWAS, fhuB SNP T218C. Additionally, the mutation detected in covS was different from that detected in the SNPs/indels-based GWAS, but both caused frameshift mutations.
AlphaFold-based prediction of the impact of the identified mutations on function
To assess the impact of mutations on protein function, we predicted the protein structure using AlphaFold21. Here, we present structural predictions for three representative proteins: CovS, whose invasion-related deletions prevailed worldwide (Fig. 5a, b); FhuB, which carries a prominent mutation in the Japanese cohort and is associated with non-invasiveness (Fig. 5c–e and Supplementary Note 2); and LacE, whose mutation was observed mainly in invasive strains from the US (Supplementary Note 3).
We predicted a homodimerized CovS model using AlphaFold because the CovS of S. pyogenes forms homodimers22. SOSUI predicted that CovS has two transmembrane regions (Fig. 5a). Mutations detected in the SNP/indels- and k-mers-based GWASes were predicted to shorten the CovS protein to 35 and 45 amino acids, respectively. Since the intracellular domain of CovS is in the C-terminal region and is involved in the phosphorylation of the transcriptional regulator CovR, frameshift mutations leading to CovS truncation would inactivate the protein, and thus, CovR function (Fig. 5b).
The SNP fhuB T218C substitutes the 73rd valine of FhuB with alanine. FhuB is a component of an ATP-binding cassette transporter system that utilizes ferrichrome, which is one of the carriers of Fe3+. FhuB is predicted to localize to the cell membrane and form a channel with FhuG (Fig. 5c). The FhuBG complex can bind to one molecule of the extracellular ferrichrome-binding lipoprotein, FhuD, and two molecules of the intracellular ATP-binding protein, FhuC. Therefore, we constructed a structural model of the FhuBCCDG complex (Fig. 5d). This implied that the 73rd residue of FhuB exists in a region adjacent to FhuD. The hydrophobicity of the side chain was attenuated by the mutation, which potentially affected ferrichrome transport (Fig. 5e).
The fhuB T218C mutation represses the growth of a severe invasive strain in human blood
Based on the GWAS results and predicted protein structures, we focused on the SNP fhuB T218C. We constructed a mutant strain in which the SNP fhuB T218C was introduced, to further investigate its potential virulence. We selected the strain TK02, which carries the wild-type (WT) allele T218 in fhuB and was originally isolated from a sample obtained from a patient with severe invasive infection in Japan. We used a several times-passaged TK02 strain, TK02', as a WT strain and introduced the SNP fhuB T218C into it via allelic exchange mutagenesis with a thermo-sensitive shuttle vector. We then confirmed that there were no differences between the WT and fhuB T218C strains using whole-genome resequencing.
To reveal the effects of the SNP on invasiveness, we performed transcriptomic analysis of the WT and fhuB T218C strains in THY broth and human blood. Principal component analysis revealed that the differences in the overall transcriptional profiles between the strains were more remarkable in blood than in THY (Fig. 6a, b and Supplementary Table 6–9). We found that the expression of CovR-regulating genes, including speB, nga-ifs-slo operon, and sag operon, was upregulated in the blood, compared with that in THY, in both the strains (Fig. 6c). In human blood, the mutant strain downregulated mga and emm and upregulated the expression of genes encoding virulence factors, such as speC, scpA/B, endoS, ska, and sfbX, as compared with those observed in the WT strain (Supplementary Table 12). Although Mga regulates the expression of surface and secreted molecules important for colonization and immune evasion23, no strong expression changes were observed in the Mga regulon, except for in the emm gene. Notably, fhuB, fhuC, and fhuD were upregulated in the fhuB mutant in both environments (Supplementary Table 9 and 12).
To determine whether fhuB T218C mutation affects iron transport, we measured the intracellular free ferric ion concentration in each strain and observed no differences among the strains in human blood (Fig. 7A). The upregulation of fhuBCD may compensate for the impaired function mediated by the SNP.
Next, to investigate the effects of SNP on bacterial survival in human blood, we performed a bactericidal assay using human blood. At 3 h after incubation, the fhuB T218C mutant strain showed a significantly decreased survival rate than the WT strain (Fig. 7B). To further determine the blood components that the attenuated survival of the mutant can be attributed to, we compared bacterial survival rates in erythrocytes, polymorphonuclear cells (PMNs), plasma, heat-inactivated plasma, and brain heart infusion broth (Fig. 7C–7G). Notably, after incubation with erythrocytes, PMNs, or plasma, the mutant strain showed a significantly lower survival index than the WT strain, as observed in whole blood (Fig. 7C–7E and 7G). However, there were no significant differences between the survival rates of the WT and mutant strains in heat-inactivated plasma, suggesting that the mutant strain is susceptible to complement (Fig. 7F). Overall, the polymorphism T218C in fhuB impaired the survival of severe invasive strains in human blood through interactions with erythrocytes, PMNs, and complements.