Considerable sequence diversity under purifying selection
Sequences of each of the 7 MLST loci had no insertions and deletions. The length of each of the 7 loci ranged from 263 bp (pryG) to 646 bp (fusA) (Table 2). The GC content of each of 7 loci ranged from 50.7 (dnaA) to 57.4 (gyrB), with a mean value of 54.5. Polymorphism was detected at 196 (5.58%) sites in total, of which only 7 had tri-allelic single nucleotide polymorphisms (SNPs) (Table 2). The diversity index π was 7.13e-3 for the concatenated sequences and ranged from 2.97e-3 (rplB) to 2.29e-2 (leuS) at different loci (Table 2).
Ka/Ks >1 or <1 indicated positive or negative selection on the sequence tested, respectively. The Ka/Ks ratio of the concatenated sequences was 0.019. fusA displayed a Ka/Ks ratio of 0.104, while the Ka/Ks ratios of all other loci were less than 0.043 (Table 2). These denoted that all these loci were under a strong purifying selection. The Tajima’s D test of the concatenated sequences gave a value of -0.84, indicating that the sampled population was evolving randomly without evidence of positive selection (Table 2). However, rplB had a Tajima’s D test value of -1.9, which was less than 0 with a statistical significance at the 0.05 level (Table 2). This suggested that rplB suffered a recent selective sweep or population expansion after a recent bottleneck.
Abundant STs and CCs
The allele number of each of the 7 loci varied from 10 (rplB) to 37 (leuS) (Table 2), generating 135 unique STs in total from the 213 strains tested, indicating that these 7 MLST loci exhibited sufficient ability for strain discrimination. The 135 STs could be assigned into 8 CCs (CC1 to CC8; 36 STs, 96 strains), 12 doubletons (24 STs, 28 strains) and 75 singletons (75 STs, 89 strains) (Fig. 1). The Chinese and non-Chinese isolates were clustered together, and no significant pattern could be found when inspecting the relationship between STs/CCs and their years and geographic locations of isolation.
The largest CC1 contained 7 STs, and the predicted founder of CC1 was ST4, containing 20 (68.97%) of the total 29 strains in CC1, which represented as the most predominant ST in this CC as commonly observed (13). However, ST50, the predicted founder of the second largest CC3, contained only one strain (4.55% of the total 22 strains in CC3). The fact that the CC3 founder ST50 was not the predominant ST in the complex might be resulted from the sampling bias.
Clustering STs into several lineages
A neighbor-joining phylogenetic tree was constructed from the concatenated sequences of the 135 STs with 1000 bootstraps (Fig. 2). Although these STs were clustered into several lineages, most of the bootstrap supporting values were lower than 50%. This suggested conflicting phylogentic signals, due to frequent horizontal genetic transfer promoted by homologous recombination, leading to severe incongruence in phylogeny.
The Bayesian clustering approach implemented in STRUCTURE was used to infer the number of population units (denoted by K) within the 135 STs tested, yielding a maximal posterior value of K=4. This suggested that these 135 STs comprised 4 ancestral subpopulations. Accordingly, the 135 STs could be separated into 5 lineages (Fig. 2), designated L1 to L5, which contained 81, 29, 14, 3 and 8 STs respectively. STs assigned to each of the 8 CCs were interspersed throughout the five lineages, suggesting that there was no correlation between MLST and Bayesian analysis. The split network (Fig. 3) of the 135 STs contained rich parallel links, which supported the inference of 5 lineages with high frequency of recombination. As shown in Fig. 2 and 3, L4 to L5 were distantly separated from not only each other but also L1 to L3, while relatively L1 to L3 came close together. In summary, assignment of 135 STs into 5 lineages, as inferred by STRUCTURE-based Bayesian clustering, was supported by the neighbor-joining phylogenetic tree and the split network.
High levels of homologous recombination
L3 to L5 were not included in estimation of recombination and linkage disequilibrium (see below) due to their limited numbers of STs contained. The p-values of Phi test for lineage L1 and L2 and the whole population (all the 135 STs) were less than 0.0001 (Table 3), indicating that homologous recombination occurred within and across lineages. The per-site ρ/θ values detected for L1, L2 and the whole population were 2.34, 3.44 and 1.189, respectively (Table 3): i) the recombination frequency was at least 2 times more likely to occur than recombination within L1 or L2; ii) the recombination rate was almost equal to the point mutation rate across the whole population; and iii) the recombination frequency within lineages was around 2 times higher than that across different lineages.
Tendency to linkage disequilibrium
Linkage disequilibrium was estimated with the IA parameter from STs, a statistic that was expected to be zero when the alleles were in the linkage equilibrium (free recombination). The IA values were 0.0146 (p-value=0.103), 0.0477 (p-value=0.021) and 0.0433 (p-value < 0.001) for L1, L2 and the whole population, respectively (Table 3), suggesting a tendency of linkage disequilibrium (non-random association of alleles at different loci) within and across lineages. Especially, the IA value detected for the whole population was significantly different from zero, indicating the credible linkage disequilibrium at the whole population level. Nevertheless, the above detecting IA values were yet relatively low, which was consistent with observation that evident recombination is occurring within the K. aerogenes population.