Epidemic population structure of clinical Klebsiella aerogenes inferred by multilocus sequence typing

Background: Multilocus sequence typing (MLST) act as an accurate approach to characterize bacterial population genetics, phylogeny and epidemiology, and has not yet been applied to Klebsiella aerogenes. Results: A MLST scheme was established for a collection of 213 isolates of K. aerogenes. These strains exhibited considerable sequence diversity under purifying selection, and could be assigned into 135 sequence types, which were further divided into 8 clonal complexes and a lot of doubletons and singletons scatterred in the population snapshot. Five separately clustering lineages were presented in the population, which displayed evident homologous recombination occurred within and across lineages, with a tendency of linkage disequilibrium. Conclusions: K. aerogenes shows an epidemic population structure displaying high levels of recombination occurring more frequently than point mutation.


Sequence diversity analyses
The GC content was calculated as Count(G + C)/Count(A + T + G + C) ×100. The number of polymorphic sites, the average pairwise differences per-site (π), and the p-values of Tajima's D test were calculated by DnaSP Version 5.10 (11). The average non-synonymous/synonymous rate ratio (K a /K s ) was calculated with KaKs Calculator Version 2.0 (12) to estimate whether the group was under purifying selection or not. Distinct allelic profiles were assigned into different sequence types (STs), which were further clustered into clonal complexes (CCs), using eBURST Version 3 (13). Every two different STs who share six of the seven loci were connected with each other, and a connected graph with more than 2 STs was called a CC. If only two STs were connected together, the two were called doubleton. All other unconnected STs were named singletons. The founders (ancestry types) of CCs were predicted by eBURST with 1,000 re-samplings for bootstrap.

Population structure analyses
The sequences of the 7 MLST loci of each ST were concatenated, and the neighbor-joining tree was built by MEGA7 (14) with the concatenated sequences of all the detected STs. The percentages of replicate trees, in which the associated taxa clustered together in the bootstrap test (1000 replicates), were shown next to the branches. Linkage model was implemented by STRUCTURE Version 2.3 (15) to infer ancestries for each ST, which assumed that each ST was derived from fixed number of assuming ancestral subpopulations. The posterior probability P(X|K) was calculated to determine the probability of ST X derived from ancestral subpopulation. Individual runs (200,000 burn-in iterations and 1,200,000 iterations sampling iterations) per value of K (ranged from 2 to 10) were performed. The splits network was generated by SplitsTree4 (16) using the neighbor-net method.

Recombination analyses
The pairwise homoplasy test (Phi test) was performed with SplitsTree4, which was used to detect evidence of recombination events in a set of aligned sequences. When the evidence for recombination was found, the p value yielded from the Phi test was less than 0.05. The LDhat method unique STs in total from the 213 strains tested, indicating that these 7 MLST loci exhibited sufficient ability for strain discrimination. The 135 STs could be assigned into 8 CCs (CC1 to CC8; 36 STs, 96 strains), 12 doubletons (24 STs, 28 strains) and 75 singletons (75 STs, 89 strains) (Fig. 1). The Chinese and non-Chinese isolates were clustered together, and no significant pattern could be found when inspecting the relationship between STs/CCs and their years and geographic locations of isolation.
The largest CC1 contained 7 STs, and the predicted founder of CC1 was ST4, containing 20 (68.97%) of the total 29 strains in CC1, which represented as the most predominant ST in this CC as commonly observed (13). However, ST50, the predicted founder of the second largest CC3, contained only one strain (4.55% of the total 22 strains in CC3). The fact that the CC3 founder ST50 was not the predominant ST in the complex might be resulted from the sampling bias.

Clustering STs into several lineages
A neighbor-joining phylogenetic tree was constructed from the concatenated sequences of the 135 STs with 1000 bootstraps (Fig. 2). Although these STs were clustered into several lineages, most of the bootstrap supporting values were lower than 50%. This suggested conflicting phylogentic signals, due to frequent horizontal genetic transfer promoted by homologous recombination, leading to severe incongruence in phylogeny.
The Bayesian clustering approach implemented in STRUCTURE was used to infer the number of population units (denoted by K) within the 135 STs tested, yielding a maximal posterior value of K=4.
This suggested that these 135 STs comprised 4 ancestral subpopulations. Accordingly, the 135 STs could be separated into 5 lineages (Fig. 2), designated L1 to L5, which contained 81, 29, 14, 3 and 8 STs respectively. STs assigned to each of the 8 CCs were interspersed throughout the five lineages, suggesting that there was no correlation between MLST and Bayesian analysis. The split network (Fig.   3) of the 135 STs contained rich parallel links, which supported the inference of 5 lineages with high frequency of recombination. As shown in Fig. 2 and 3, L4 to L5 were distantly separated from not only each other but also L1 to L3, while relatively L1 to L3 came close together. In summary, assignment of 135 STs into 5 lineages, as inferred by STRUCTURE-based Bayesian clustering, was supported by the neighbor-joining phylogenetic tree and the split network.

High levels of homologous recombination
L3 to L5 were not included in estimation of recombination and linkage disequilibrium (see below) due to their limited numbers of STs contained. The p-values of Phi test for lineage L1 and L2 and the whole population (all the 135 STs) were less than 0.0001 (Table 3), indicating that homologous recombination occurred within and across lineages. The per-site ρ/θ values detected for L1, L2 and the whole population were 2.34, 3.44 and 1.189, respectively (Table 3): i) the recombination frequency was at least 2 times more likely to occur than recombination within L1 or L2; ii) the recombination rate was almost equal to the point mutation rate across the whole population; and iii) the recombination frequency within lineages was around 2 times higher than that across different lineages.

Tendency to linkage disequilibrium
Linkage disequilibrium was estimated with the I A parameter from STs, a statistic that was expected to be zero when the alleles were in the linkage equilibrium (free recombination). The I A values were 0.0146 (p-value=0.103), 0.0477 (p-value=0.021) and 0.0433 (p-value < 0.001) for L1, L2 and the whole population, respectively (Table 3), suggesting a tendency of linkage disequilibrium (nonrandom association of alleles at different loci) within and across lineages. Especially, the I A value detected for the whole population was significantly different from zero, indicating the credible linkage disequilibrium at the whole population level. Nevertheless, the above detecting I A values were yet relatively low, which was consistent with observation that evident recombination is occurring within the K. aerogenes population.

Discussion
K. aerogenes is widely distributed in the human gastrointestinal tract and in the environment, where physiochemical properties greatly change. This requires frequent adaptation of K. aerogenes to its niches. K. aerogenes strains are genetically diverse and can be assigned into multiple CCs and lineages as detected by MLST and Bayesian analysis. Clinical K. aerogenes shows an epidemic population structure (displaying high levels of recombination occurred more frequently than point mutation), which was distinct from not only a clonal population (composed of strains derived from a common ancestor that diversified predominantly by mutation) but a panmictic population (in which recombination occur freely and polymorphic sites are all at linkage equilibrium) (19-21). This is the first report of MLST-based inference of genetic diversity of K. aerogenes. Further genome sequencing study on a large collection of K. aerogenes isolates will gain a deeper insight into evolution and epidemiology of this pathogen.  Tables Table 1 MLST loci and PCR parameters

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.