Data preparation and exploration
To assess whether other pathogens present in the fecal samples caused the symptoms and severity of patients, presence of symptoms and severity scores of patients with coinfection were compared to those of patients without coinfection. In 15.5% of the patients, a coinfection was detected. The symptom blood in stool, known as a typical symptom of shigellosis (18), was significantly less present in patients with a coinfection (chi-square, p = 0.019), while the presence of other symptoms was not statistically different (chi-square, p > 0.05). The lower fraction of patients with coinfection that experienced blood in stool was also reflected in the de Wit severity score, in which blood in stool is a criterion with double weighing, as it was significantly lower for patients with coinfection (T-test, p = 0.017). The Modified Vesikari Score (MVS), in which blood in stool is not a considered factor, showed no significant difference between patients with and patients without coinfection (T-test, p = 0.076).
The assemblies of 277 isolates were used to construct a gene presence/absence table and k-mers of variable length. This resulted in a gene presence/absence table consisting of 2,890 core genes (i.e. present in all 277 isolates) and 9,869 genes in total. K-mer counting yielded 28,551,795 genetic variants.
A phylogenetic tree was created based on the core genome SNPs, and the distribution of the severity scores, coinfection and the effects of underlying diseases were visualized (Figure 1). The core SNP analysis resulted in some species-specific clusters. However, clusters that contain multiple species were also present (Figure 1). In addition, severity scores, effects of underlying diseases and coinfection were randomly distributed over the isolates in the tree (Figure 1). For the GWAS analysis, only isolates sequenced during this study and displayed in Figure 1 were used. However, for contextualization of the position of the isolates in this study compared to the global population structure of Shigella spp. and EIEC, an additional tree was inferred including genomes from each of the main lineages and phylogenetic groups (Additional File 1). It showed that the population structure of our EIEC isolates was mainly concentrated in three clusters containing ST270, ST6 and ST99 based on isolates from the United Kingdom (UK) (19). The UK ST270 cluster corresponded with cluster 8, the large EIEC cluster from Pettengill et al.(3). In our analysis, EIEC isolates belonging to cluster 4, EIEC small or cluster 7, the EIEC/EHEC/EAEC cluster were not included (3). For S. flexneri, a few isolates related to travel to Asia belonged to PG6 and PG2 (Figure 1 and Additional File 1). However, the majority of isolates were PG3, consisting solely of isolates with serotype 2a or Y, and PG1, consisting of isolates of serotypes 1a, 1b, 1c, Yv and 4av. For S. sonnei, almost all isolates were of lineage III, only a few isolates within lineage II were detected (Figure 1 and Additional File 1). The presence of large clusters of EIEC isolates, the presence and distribution of serotypes over the PGs for S. flexneri and the predominance of S. sonnei lineage III were described before, and are representative of population structures found in other western European countries (19-22).
GWAS using gene presence/absence of single genes
None of the tested symptoms and severity scales resulted in significantly associated genes with a sensitivity and specificity above 85%. However, eight significantly associated genes were found with sensitivity above 92% and a specificity of 87% for the characteristic “genus”, that was used as a benchmark to evaluate algorithm performance. The gene with the highest association, produces a hypothetical protein and had a Benjamini Hochberg corrected p-value of 7.01E-27 and a sensitivity and specificity of 99% and 87%, respectively.
Additionally, the p-values of all characteristics were compared to random permutation datasets by plotting the log transformed expected and observed p-values against each other (Figure 2). The gene associations with the tested severity scales (Figure 2A and 2B) and symptoms (Figure 2C) displayed similar plots as the random permutation datasets, indicating a performance as random cases. This did not apply to the benchmark characteristic “genus”, that plot showed a clear difference between expected and observed p-values, which was supported by the low Benjamini Hochberg corrected p-values (Figure 2D).
It followed from the sensitivity analysis based on the benchmark characteristic “genus” that genes present in 0.7% of total isolates within the smallest group (Escherichia, n=30), corresponding to two isolates of the total number of isolates, resulted in significant p-values. This indicated that a gene presence in a minimum of two isolates from the smallest group was enough to detect significance, if these genes were not present in the other larger group (Additional File 2).
GWAS using gene presence/absence of multiple genes
The generated random forest model, created using isolates from the training set resulted in an out-of-bag (OOB) estimate of error rates when testing the isolates from the test set. A random error rate of 66.7% for the severity scores and 50% for the symptoms and genus was expected, as respectively three and two classes were predicted. OOB error rates in the created random forest models using 5000 trees for the prediction of symptoms and severity scales of patients were as expected for random datasets when applied to the test set. Error rates ranged from 40.8% to 53.1% for all symptoms and 65.1% to 70.1% for the two severity scales (Table 1). The construction of additional trees did not lead to better predicting models.
In contrast, the OOB error rate of the model that predicted the benchmark characteristic genus was 15.9%, much lower than the random expected error rate of 50% (Table 1). The created model for genus prediction was further explored by examining the location of the misclassified isolates in the phylogenetic tree (Figure 1). Comparing them with the traditional laboratory results that were obtained during the IBESS-study showed that six out of ten discrepant isolates were so-called hybrid isolates and also had an uncertain assignment using the traditional laboratory tests (Table 2).
GWAS using k-mers
Associating k-mers with different characteristics using Pyseer did not lead to any significant k-mers for abdominal pain, abdominal cramps, blood in stool, fever, headache, mucus in stool, nausea, vomiting, and the severity score of MVS (Table 1). In contrast, 156 k-mers were associated with diarrhea, however, all k-mers had an invalid chi squared test and likelihood-ratio test (LRT) p-values higher than 0.313. The de Wit severity score resulted in 17 associated k- mers, whereof 15 k-mers with an LRT p-value lower than 0.05. An assembly of these 15 k-mers resulted in a single consensus sequence of 100 bp, based on overlapping k-mers. A BLASTn search of the consensus sequence against the database of the National Center for Biotechnology Information (NCBI, Bethesda, USA) revealed that the significant k-mers are located between two genes (Additional Figure 3), including a type II toxin-antitoxin gene (AYE47152.1) and a gene coding for DUF1391 (AYE48123.1), a protein of unknown function. A potential promoter region in the k-mer was found with a -10 box (CATTATTTT) at position 58 and a -35 box (TTGACG) at position 36 of the sequence (Additional Figure 3).
To validate the potential of the k-mer to predict the severity score of de Wit scale, the k-mer was queried by BLAST against a database with all isolate assemblies from our study. For every sample, the bit-score of the best scoring hit was plotted against the corresponding severity score (Figure 3A). Roughly, three groups resulted, one with a bit-score of >175 corresponding with a full-length match with the k-mer, one with a bit-score of 50-175 corresponding to a partial match and <50 corresponding to no match. Subsequently, the Kruskal-Wallis test was performed to investigate the difference in the de Wit severity score between the groups (Figure 3B). No statistically significant difference between the groups was found, with a p-value of 0.6.
To check the suitability of the Pyseer method for the association of k-mers with characteristics in our data-set, the benchmark characteristic “genus” was used and resulted in 3,036,507 potential associated k-mers.