The purpose of our study was to investigate association between genetic determinants of infecting Shigella spp. and EIEC isolates and the symptoms and disease severity of the patients. If such associating genetic determinants were found, diagnostics could be developed that predict the severity of the resulting disease. Additionally, it could guide prioritization and optimization of infectious disease control measures regarding shigellosis. In the Netherlands, the severity predicting capabilities of genes of other pathogens were used before in prioritization of control measures. In 2016, case definitions for Shiga producing E. coli (STEC), another pathotype of E. coli, were extended from culture confirmation alone to the detection of STEC by Polymerase Chain Reaction (PCR) targeting the stx1 and stx2 genes and particular virulence genes. These combination of genes within STEC bacteria are known to have associations with a higher risk for severe disease and clinical complications (22).
First, the association of the presence or absence of single genes resulted in no statistically significant association between genes with specific symptoms or severity scores with high sensitivity and specificity. Second, the association of multiple genes resulted again in no statistically significant association with specific symptoms and severity scores of patients, indicating that no complex genetic interactions that may explain disease severity could be found. Third, the association of k-mers resulted in a consensus sequence consisting of multiple aligned k-mers that was associated with a high severity score of de Wit. The sequence of 100 bp, containing multiple associated k-mers, was located between two genes with a putative promoter region with an optimal inter-base distance of 16 bases but an unclear TATAAT box. When blasting the consensus k-mer against all assemblies, three difference bit scores were observed, suggesting there are three different genetic variants of this locus. Performing a Kruskal-Wallis test on these three different bit score groups, showed that the k-mer was not valid (p = 0.6), and presumably was a false positive.
In our study, the genes that were associated with specific symptoms in earlier described studies (15, 16), were not confirmed. In another study that was conducted in Brazil among children with shigellosis, sepA was associated with abdominal pain, and the combination of sepA, sigA and ial genes with bloody diarrhea (16). However, it was not clear if univariate or multivariate testing for virulence genes was performed. In another study from Brazil, a case-control study was conducted. They found that the sen (shET-2) gene was associated with diarrhea in children in general, but not with specific symptoms of shigellosis patients. They associated the virA gene with fever in children with shigellosis, however virA was also found in 44% of controls (15). In our study, we have used a larger sample size consisting of patients with other demographics in another setting, analyzed all harbored genes instead of a predefined selection, used other methods with higher resolution as it was based on whole genomes, and included correction for multiple testing.
Because all used algorithms in our study generated negative results for association, the “genus” was also tested as a benchmark. The used algorithms performed adequate, as they resulted in relevant genetic variants. Furthermore, a sensitivity analysis indicated that the group distribution of the characteristic “genus” was suitable for significant detection of associated single genes. This characteristic had an adverse unequal group distribution of 10% versus 90%, indicating that the number of isolates and the distribution over the groups was suitable for associating genetic content with all symptoms and severity, except for “diarrhea”, which was the only characteristic with a more unequal group distribution than “genus”. Moreover, other studies found genetic variants significantly associated with their tested traits using the microbial GWAS methods that were used in our study (23-27).
Using Scoary, single genes that had association with the characteristic “genus” were found, with low p-values and high sensitivity and specificity. Further, with Pyseer, over 3,000,000 potentially associated k-mers were found. This is in concordance with another study that demonstrated the suitability of k-mers for identification of Shigella spp. and E. coli isolates based on whole genome sequences (28). Moreover, using Random Forest, OOB estimate error rate for the characteristic “genus” was 15.9%. This indicated that the model that predicts the genus of unknown isolates performed better than random, however does not accurately predicts the genus of some isolates. Notably, six out of ten discrepant isolates also had an uncertain assignment with traditional laboratory tests. If we exclude these isolates, the OOB estimate error rate is 1.9%, indicating that not the used method, but the nature of these isolates that are possessing characteristics of both Shigella spp. and E. coli causes the uncertain assignments. The Random Forest method performed almost equally as good as the traditional laboratory tests and could be used for identification of the genus if whole genome data is available, although more isolates should be tested to validate this. Additionally, it would be useful to test the applicability of Random Forest for identification to species and serotype level. Furthermore, in a future study, the results of the traditional laboratory tests specifically can be associated with genetic variants. Consequently, if associated variants could be found, traditional tests could be omitted. This will save costs in workflows that already consists of draft genome sequencing of isolates for other purposes, for instance surveillance.
In addition to the methods using gene presence/absence and k-mers that were used in our study, other types of genetic variants can be used as input for microbial GWAS (29). However, for highly plastic genomes like E. coli and Shigella spp. (30), the use of SNP-based methods is not appropriate. Moreover, the large variation of k-mers found in our study also indicated that an SNP-based method will result in an even larger variation between groups. Another genetic variant that can be used in GWAS is based on De Bruijn Graphs. However, it is mainly based on the creation of overlaps of k-mers, therefore, it probably does not generate association with symptoms or disease severity using the data from our study (31).
One of the strengths of our study was the availability of isolates representative for the population structure as encountered in Western European countries, as well as the clinical data of the patients that they were infecting. Second, results of the performed traditional laboratory tests to determine the species of the bacteria were available for all isolates. Finally, another strength of our study is that several potential genetic variants were associated with the trait “genus”, and a sensitivity analysis was performed, both proving the suitability of the used algorithms.
Some considerations with regard to our study should be taken into account. First, the symptoms and severity of disease were characteristics of the patients and not directly of the bacterial isolates as for instance antibiotic resistance. Although the need for correction of the effects of underlying disease was investigated, the immune status of the patients was not taken into account. Second, the clinical characteristics used in our study were self-reported and not objectively measured, therefore subjective to the judgment and memory of the patients. Third, another consideration was that genus level was associated as characteristic, while other GWAS studies have concentrated on bacterial isolates of the same species (32, 33). However, according to multiple research groups (3, 34, 35) Shigella spp. and E. coli should be considered as one species based on their genetic relatedness, if present, their differences are more phenotypical. Fourth, the number of isolates for S. boydii and S. dysenteriae in our study were inadequate with two and no isolates, respectively. However, we believe the number of isolates to be adequate, as studies with similar sample sizes have been performed in the past in which genetic variation in pathogens was identified that had predictive value for the course of disease (27, 36). Finally, the used dataset only contained isolates encountered in the Netherlands, resulting in a geographical biased set (37, 38). Therefore, to avoid missing serotypes in future studies, the current dataset should be supplemented with isolates from other geographic areas.