Population genomics provides insights in diversity, epidemiology, evolution and 1 pathogenicity of the waterborne pathogen Mycobacterium kansasii 2

47 Mycobacterium kansasii is a nontuberculous mycobacterium that can cause serious pulmonary 48 disease. Genotyping suggested that the species is composed of at least six subtypes that vary in 49 clinical significance, with subtype I being clinically dominant but less commonly isolated from 50 environmental sources. Here we report a population genomics study of 358 M. kansasii isolates 51 obtained from global water and clinical sources. Phylogenomic analyses revealed that the six 52 subtypes are more accurately designated as closely related subspecies. These subspecies show 53 ample evidence of recombination mediated by distributive conjugative transfer that has 54 contributed to subspeciation and on-going diversification. Water was confirmed as a source of 55 clinical infections by showing that genomes of clinical strains from an Australian outbreak were 56 almost indistinguishable from strains contaminating the drinking water supply. Most clinical 57 infections (nearly 80%) were due to a recently emerged group of strains designated the M. 58 kansasii main complex (MKMC), which appears to have originated in Europe in 1900s and 59 expanded globally over the past century. Comparative genomic analyses revealed that the 60 MKMC has maintained the methylcitrate cycle and expanded ESX-I secretion-associated 61 proteins, perhaps facilitating metabolic adaptation and pathogenicity for human hosts. Evidence 62 of on-going positive selection in isolates of the MKMC was found in genes involved in carbon 63 and secondary metabolism, metal ion homeostasis and cell surface remodeling that could 64 represent adaptation to human hosts. These results further our understanding of the epidemiology 65 and pathogenicity of M. kansasii and emphasize the importance of monitoring its potential 66 transition to a more human-adapted pathogen.


Introduction
Nontuberculous mycobacteria (NTM) are environmental bacteria and some species can cause opportunistic infections in humans.While they are not as pathogenic as M. tuberculosis, diseases due to NTM have been an increasing concern in global health [1][2][3][4] , and in some developed countries NTM are now responsible for more disease than M. tuberculosis 2,4 .M. kansasii is among the most pathogenic NTM and has the highest clinical relevance 5 .It is one of the last species to have diverged from a common ancestor before the appearance of the M. tuberculosis complex 6 , and is capable of causing aggressive and destructive pulmonary disease resembling tuberculosis 7 .In the mid-20 th century, before the emergence of the HIV pandemic, M. kansasii was dominant among NTM diseases in several regions of United States, Europe and Japan 3 .It is currently one of the most frequent causes of NTM pulmonary disease throughout the world (Fig. 1A), with a relatively high incidence in regions of Europe, South America, Africa and Asia 3,8 .In China, M. kansasii has been isolated from pulmonary infections in many areas, but the incidence is highest in the highly urbanized eastern and southern coastal regions [9][10][11][12] .From 2008 to 2012 in Shanghai, it was responsible for nearly half of all NTM infections 13 .
As with other NTM, M. kansasii infections are generally assumed to be acquired from environmental sources rather than by human-to-human transmission.Although municipal water distribution systems are believed to be the major reservoir for human M. kansasii infections 3,5,14 , water isolates are usually genetically distinct from clinical strains.Molecular typing has revealed that M. kansasii comprises at least six distinct subtypes that vary in prevalence and clinical relevance [15][16][17][18] .M. kansasii subtype I is responsible for the vast majority of infections worldwide but is not often isolated from water sources 15,17,18 .Until now, no definitive epidemiological link has been established between water reservoirs and clinical infections 19,20 .Instead, genotyping has shown that clinical strains of M. kansasii subtype I isolated from diverse geographic locations constitute a homogenous population 15,18 , suggesting potential human-to-human transmission of a successfully clone.Actually, potential transmission of M. kansasii between family members has been reported in several cases 21,22 .In addition, transmission has been recently revealed as a major route for the dissemination of dominant clones of another NTM, the M. abscessus 23 .
Consistent with the clinical dominance, M. kansasii subtype I showed the highest clinical relevance, as it has been associated with severe and even fatal disease in both immune-competent and immune-compromised patients, while the other subtypes are isolated only from immunecompromised patients or environmental sources 17 .Although subtype I causes more disease than the other subtypes, the genetic determinants of its pathogenic adaptation have not been addressed.
In addition, clinical M. kansasii isolates can vary phenotypically, with strains showing either a smooth or rough colony morphology due to difference in cell wall hydrophobicity 24,25 .Similar to M. abscessus 26 , M. kansasii strains with the rough colony appear to be more virulent which could establish chronic systemic infections in mouse 25,27 , but the genetic basis for the phenotypic differences has not been explained.
In the current study, we analyzed the genomes of a worldwide collection of isolates to better define the global population structure of M. kansasii.The genomic analyses provided insights into its speciation, diversification, the sources of clinical infections and possible genetic determinants associated with its ability to proliferation and cause disease in human.

Global diversity of M. kansasii
We performed whole genome sequencing on 271 M. kansasii isolates, including 155 isolates from China, 74 isolates from Australia, 34 clinical isolates from European and North American countries, 5 from South Africa and 3 from Japan.These genomes, together with an additional 84 M. kansasii genomes available from public databases (Supplementary Table 1), were analyzed for global diversity.Given the close genetic relationship with M. kansasii, three M. gastri genomes (one sequenced in the current study) were also included.In total, we included the genomes of 358 isolates obtained from 18 countries with varying burdens of disease caused by M. kansasii (Fig. 1A).On average, the M. kansasii genomes are 6.29 Mb in length and contain 5,757 protein coding genes.A genomic alignment of 2,280 core genes with at least 90% amino acid identity between strains and covering 2.12 M nucleotides was used to generate a maximum likelihood phylogeny (Fig. 1B).The phylogeny consisted of seven distinct lineages corresponding to the six M. kansasii subtypes and M. gastri.The pairwise genome-wide average nucleotide identities (gANI) within each lineage were all over 98%, while the gANIs between lineages were all below 95% (Fig. 1C), suggesting that the distinct lineages should be regarded as different subspecies rather than subtypes 28,29 .The M. gastri strains were closely related to M. kansasii ssp.IV, and were therefore designated as a seventh subspecies of M. kansasii.Four subspecies (I, II, III and VI), each containing more than ten strains, were designated as the major subspecies in the current study.Pairwise comparison of single nucleotide variants (SNV) amongst the strains of each of the four major subspecies revealed a median difference of 129 to 9,137 SNVs along the 2.12Mbp core genome (Fig. S1).To determine the general evolutionary forces that shaped the diversity of M. kansasii subspecies, we calculated Tajima's D statistics for the major subspecies along the core genome (Fig. S2).The median values for ssp.II, III and VI (0.17, 0.72 and 0.16 respectively) indicated balancing selection, while the median value for the dominant ssp.I (-2.68) suggested recent population expansion and a potential selective sweep.

Recombination driving subspeciation and diversification of M. kansasii subspecies
Alignment of the 16S, 23S rRNA and spacer region revealed several sequence mosaics that infer a reticulate network for the M. kansasii subspecies (Fig. S3), consistent with evolutionary processes in the presence of recombination.A more complex network was obtained based on 378,876 SNVs along the core genome alignment (Fig. 2A), suggesting that recombination has occurred across the whole genome.Analysis of the core-genome alignment with the fastGEAR algorithm identified seven population clusters corresponding to the seven subspecies, with extensive ancestral recombination (occurring during the subspeciation) and recent recombination (occurring after the subspeciation) between subspecies that resulted in highly mosaic genomes (Fig. 2B and Fig. S4).An average of 411 Kb (18.4%) of the core genome was involved in ancestral recombination that contributed to the origin of the subspecies.
Recent recombination was detected in all subspecies, with the total genomic fraction of recombinant fragments varying from 0 to 12.2% amongst strains of the different subspecies.For the recent recombination events, most recombinant fragments share high identity (≥ 99%) with genomic regions of other subspecies, suggesting they represent recombination between subspecies.For the few remaining fragments showing relatively low sequence identity to all of the genomes in our collection, nucleotide BLAST analysis revealed that they were more similar to M. kansasii sequences than to any other mycobacteria, suggesting that the recombination had occurred with unknown species or subspecies closely related to M. kansasii.Removing the SNVs present in the recent recombinant regions significantly decreased the genetic distance between strains, demonstrating the importance of recombination in the diversification of the subspecies (Fig. S1).
The core genome alignment consists of concatenated sequences of discontinuous sequence fragments in each strain, which does not fully represent the features of recombination, i.e. the genomic distribution and length of recombinant fragments.Therefore, recent recombination events were further explored by analysis of whole genome alignments for each of the four major subspecies.Evidence of recombination was found evenly distributed across the genomes of all four subspecies (Fig. S5, Fig. 3A), with fragment lengths ranging from a few base pairs to a maximum of 212.9kb (Fig. 2C), reminiscent of recombination by distributive conjugative transfer (DCT), a form of horizontal gene transfer (HGT) in mycobacteria 30 .
In each subspecies there were both unique recombinant sequences seen in only one isolate and shared recombinant sequences seen in multiple isolates, demonstrating that recombination events occurred at different stages during the diversification of subspecies.For the clinically associated ssp.I, the recombination donors were mainly from ssp.II and III (Fig. 2B, Fig. 3A).Fragments derived from ssp.III were more common in ssp.I strains from Europe, while ssp.I strains from East Asia had recombined more frequently with ssp.II (Chi-square test, P<0.01).

Genetic evidence for independent environmental acquisition of clinical infections
After excluding the SNVs in the recombinant regions, we inferred a phylogeny for each of the major subspecies to investigate the genomic differences between isolates based solely on de novo mutations (Fig. 2D, Fig. 3A).Within each subspecies there were deep branches, where the clinical isolates were separated by thousands of SNVs, consistent with independent infections caused by unrelated environmental isolates.Besides, there were also 14 complexes, covering strains in ssp.I, II, III and VI, which contained closely related isolates with an average pair-wise difference less than 100 SNVs, suggesting potential dissemination of successful clones.Seven of these complexes contained both isolates obtained from water and human patients, consistent with the environmental strains as the source for the clinical infections.Nine of the complexes contained isolates from a single geographic region, while the remaining five complexes contained strains isolated on different continents.
The largest complex, with 268 isolates of ssp.I, was named the M. kansasii main complex (MKMC).It contained 79.2% (244/308) of all the clinical isolates included in the current study.
The MKMC contained 20 strains isolated from water sources, including one strain from Czech that clustered with clinical strains from neighboring Poland.The remaining 19 strains were isolated from an exposed cooling tower linked to a geothermal water source in Portland (a small town in southeast Australia) during an outbreak of M. kansasii infections in 1990s.In the ML phylogeny, these water isolates clustered with eight strains isolated from patients in the town, including six patients who were part of the outbreak (Fig. 3A).The phylogenetic (medianjoining) network of these 27 isolates from Portland formed a star-burst structure with most descendant strains surrounding the central ancestral genotype (Fig. 3B).Isolates with the ancestral genotype were consistently cultured from water samples between 1990 and 1996, strongly suggesting the cooling tower was the source for this outbreak.The clinical isolates all had different genotypes, and all but one were closest to the ancestral type with genomic differences of 0 to 7 SNVs.The genotype of the remaining isolate was identical to a water isolate and was 3 SNVs different from the ancestral type.This strongly suggests that the human infections were each acquired independently from the contaminated water supply rather than by human-to-human transmission.Although the outbreak had ended by 1996, after the cooling tower was bypassed and the use of geothermal water discontinued, two strains belonging to the same "Portland" clone were isolated from patients in 2005 and 2007 (Fig. 3B).

The origin and global dissemination of the M. kansasii main complex
Phylogeographic analysis of the MKMC revealed two basal branches, designated Lineage 1 and Lineage 2. The strains from China and Australia (121/131 and 57/58 respectively) predominantly belong to Lineage 2, while nearly all strains from USA and Canada were mostly (20/21) belong to Lineage 1. Strains isolated in Europe were the most diverse, constituting several branches in both lineages (Fig. 3A), and Bayesian phylogeographic analysis suggested Europe as the most likely origin of the entire complex (Fig. S6).Several local branches contained isolates exclusively from China (CHN-1, 2 and 3) or Australia (AUS-1), suggesting early introductions and subsequent local expansions.
The isolation time of the M. kansasii strains ranged from the 1990s to 2010s, which made it possible to estimate of the date of origin of the MKMC and its substitution rate using Bayesian evolutionary analyses calibrated by the sampling dates.This analysis employed a subset of 121 ssp.I strain with short sequence reads, unambiguous collection dates and a genomic recombination proportion less than 1.0%, together with the reference strain ATCC 12478 isolated in 1953 31,32 .The existence of a significant temporal signal was confirmed by both the root-to-tip regression and the date randomization test (Fig. S7).Bayesian phylogenetic analysis estimated the date of the MRCA of the MKMC to be around 1917 (1897-1955) (Fig. 3A, Fig. S8A) with an evolutionary rate of 1.12e-7 (95% CI, 7.93e-8, 1.62e-7) nucleotide changes per site per year (Fig.

Potential gene contents contributing to the success of M. kansasii subspecies I
The recent expansion of M. kansasii ssp.I and its association with clinical infections suggest that it may have evolved greater pathogenicity for human hosts than the other subspecies.
By comparative genomic analysis, we identified 146 genes specific to ssp.I, and several of which have been associated with metabolic adaptation or virulence within human hosts (Table S2).
Among these are three clustered genes encoding enzymes PrpC and PrpD and regulator PrpR, which are components of the methylcitrate cycle (MCC) that eliminates the toxic propionyl-CoA produced during in vivo catabolism of cholesterol and fatty acids (Fig. 4A) 33,34 .The MCC genes are located in a highly variable genomic region, and these three, along with a few flanking genes, are completely or partially deleted in the other subspecies.Eighteen of the other ssp.I specific genes encode potential secretory proteins of the ESX-1 system (ESP), a type VII secretion system associated with virulence in M. tuberculosis 35 .The genes are distributed in three genomic loci, one of which is comprised almost entirely of ESPs, the WhiB6 regulator and a PPE protein associated with ESX-1 (Fig. 4B, Fig. S9A).Among the 18 ESPs, 10 are paralogs encoding EspElike proteins and all contain the WxG motif, a characteristic of ESX-1 substrates (Fig. 4C) 36 .A BLAST search revealed that these espE-like genes are not present in the other M. kansasii subspecies, nor in most other mycobacteria except M. marinum and closely related species such as M. ulcerans and M. liflandii (Fig. S9B).In M. marinum, the espE-like genes are arranged in tandem immediately upstream of the ESX-1 locus, while in M. kansasii ssp.I the three espE-like containing loci are each separated from the ESX-1.Evolutionary analysis suggested that the espE-like genes were independently acquired by M. marinum and M. kansasii ssp.I (Fig. S9C), and then expanded in parallel (Fig. S10).

Genes under positive selection in the M. kansasii main complex
The fixed de novo mutations (allele frequency ≥95%) of all the MKMC isolates were used to identify convergent mutations or genes containing an unusually high number of mutations, which could be evidence of positive selection 37,38 .By calculating the density of nonsynonymous mutations along the genome, we found that the distribution of mutations per gene followed a Poisson distribution in the range of 0-3 mutations per gene (Fig. S11).However, a significant deviation from this distribution was detected in 19 genes that each harbored four or more mutations.Additionally, four genes were found to harbor convergent mutations that has independently evolved at least three times.These 23 genes encode proteins associated with diverse functions, including secondary metabolism (seven genes), DNA replication, recombination and repair (four genes), metal ion transport and metabolism (three genes) and carbon metabolism (three genes) (Table S3, Fig. 5).
The most polymorphic locus encodes Zur, the regulator of zinc uptake.A total of 36 fixed zur mutations were identified in 38 clinical isolates, all of which were nonsynonymous or frameshifts (Fig. 5B, Fig. S12), implying loss of function.All zur mutations could be mapped to terminal nodes in the phylogeny (Fig. S12), and none was found in isolates from aquatic sources, suggesting they could be adaptive mutations that emerged within the human host.Besides the fixed mutations, we also identified 41 unfixed zur mutations (allele frequency < 95%) in an additional 35 clinical isolates, with several isolates carrying multiple un-linked mutations (Fig. S13), strongly supporting their emergence within the host.The second most polymorphic locus encodes a pair of TetR family transcriptional regulators (TetR1/2) that potentially involved in Gamma-butyrolactone (GBL) signaling 39 .A total of 15 mutations in these two genes were found in 63 isolates.This locus also exhibited highest recombination density (Fig. 5B), with evidence of 42 independent recombinant events affecting 70 isolates.As opposed to the zur mutations, many of the mutations and recombination events in the TetR1/2 locus could be mapped to inner nodes in the phylogeny (Fig. S14), suggesting that they represent potential early adaptations prior infection.
The third most polymorphic locus contained two genes encoding a putative polyketide synthase (Pks5) and a glucosyltransferase (WbbL2), enzymes that are involved in lipooligosaccharide (LOS) synthesis [40][41][42] .A total of 14 fixed mutations distributed in16 clinical strains were detected in these two genes.Additional nonsynonymous mutations, including nonsense and frameshift mutations, were detected in neighboring genes encoding other enzymes associated with LOS synthesis: a fatty acyl-CoA synthetase (fadD25), an acyltransferase (papA3) and two glucosyltransferases (gtf1 and gtf2) (Fig. 5C).All the fixed mutations were exclusively found in clinical isolates and all but one of them could be mapped to terminal nodes in the phylogeny (Fig. S12).Furthermore, unfixed mutations in those genes were identified in 18 clinical isolates (Table S3), suggesting they likely represent adaptive mutations emerging within the host.Loss-of-function mutations in LOS biosynthesis genes have been associated with the transformation from smooth to rough colony morphology in other mycobacteria 42,43 .Colony morphology was described for 110 of the clinical strains from Shanghai (Table S1), of which 12 were recorded as having rough colonies.Of these 12 rough colony strains, all harbored mutations in the putative LOS synthesis genes (Table S4), while none were found in the 98 smooth colony strains.This high correlation between the mutations and morphology suggests that these mutations affect LOS synthesis.
Convergent mutations that evolved at least three times were identified in the upstream of three genes encoding two L-lactate dehydrogenases (lldD1, -12 C>T in three isolates; lldD2, -44 C>T in six isolates) and a lipase (MKAN_RS10545, -157 G>A in four isolates), which involved in lactate and lipid metabolisms respectively.The mutations were exclusively identified in clinical isolates, and an additional of 14 clinical isolates were found harbor fixed/unfixed mutations in these loci and loci nearby (Table S3).Convergent mutations were also found in the coding region of mce1D, a subunit of the Mce family transporter that is putatively responsible for lipid/sterol transportation 44 .Two different nucleotide substitutions were identified in codon 324 of mce1D, and the mutations resulted in the same amino acid change, suggesting they were likely gain-of-function mutations.

Discussion
The population genomic analyses of global M. kansasii yielded several insights into the population diversity, epidemiology, evolution and pathogenicity of this important pathogen.The phylogenomic analysis revealed that previously defined M. kansasii subtypes represent species level lineages, as has been recently proposed 45 .We also found ample evidence of on-going homogenous recombination between these lineages including M. gastri, but no trace of recombination with any other mycobacterium species.We therefore suggest that the seven lineages should be classified as M. kansasii subspecies, one of which consists of strains designated as M. gastri.Extensive recent and ancestral recombination events, likely driven by DCT, resulted in the mosaic genomes observed in the M. kansasii subspecies 30 , thereby demonstrating the importance of DCT in both the subspeciation and diversification of M. kansasii.Notably, there was evidence of recent recombination in the 16S-23S rRNA locus (Fig. S3), emphasizing that species/subspecies identification should include multiple genes.
The comparative genomic analysis revealed genes associated with metabolism and virulence that were only found in the predominant ssp.I strains, and perhaps contribute to their success in colonizing and causing disease in humans.When mycobacteria infect humans, they use host cholesterol and fatty acids as carbon sources, but the beta-oxidation of cholesterol and odd-chain fatty acids generate propionyl-CoA that is toxic to the bacilli 33,34,46,47 .Pathogenic mycobacteria alleviate the toxicity by consuming propionyl-CoA through the MCC cycle, which is important for the growth of mycobacteria within macrophages 34,47 .The maintenance of the MCC genes in ssp.I, and their absence in the other subspecies, may represent an adaptation to the host that partially explains why this subspecies is most often associated with clinical infections.
The analysis also found that ssp.I contains a special cluster of espE-like genes encoding potential ESX-1 substrates that resemble the EspE, which is a highly abundant mycobacterium cell surface protein secreted through the ESX-1 system 48 .A similar espE-like gene cluster is present in the genomes of M. marinum and closely related species, and their inactivation led to defective granuloma formation in zebrafish embryo 49 .It is therefore possible that the espE-like genes in M. kansasii ssp.I also encode similar secretory proteins involved in modulating the host immune response and causing disease.A strain of M. kansasii ssp.I was recently isolated from a river fish with granulomatous nodules 50 , and ssp.I strain ATCC 12478 can cause chronic infection and granulomas in zebrafish embryo 51 , suggesting that fish may be a host for ssp.I. Since M. marinum is also known to cause infections in fish, a common host might explain the parallel evolution of the espE-like gene clusters in the two species.
Isolates of ssp.I predominantly belong to a homogenous complex designated MKMC, and its clinical predominance but rare isolation from city water raises the possibility of human-tohuman transmission.However, by investigating an outbreak of M. kansasii infections in Australia, we found genetic evidence supporting the patients were more likely to have acquired their infections independently from M. kansasii strains present in the city water supply system.This is consistent with previous suggestions that city water distribution systems constitute the principal reservoir for M. kansasii 5,52 .Our evolutionary analysis estimated that the MKMC originated in the early 1900s, possibly in Europe.Considering the association of M. kansasii infections with urban areas, we speculate that the initial expansion of the MKMC was associated with the rapid urbanization in Europe since the 1900s 53 , and that it then spread and expanded with the urbanization of other global regions.Although it is unclear how the water-born MKMC could have achieved global dissemination during the past century, systems for storing potable water during long voyages could have played a role.
Several genes involved in metabolism and the stringent response appear to be under positive selections in the MKMC.Host cell lipids is a major carbon source for mycobacteria during infection and critical for the survival of bacteria within the host 54 .We observed convergent mutations in the subunit (Mce1D) of a putative lipid/sterol transporter and the upstream region of a putative lipase involved in the lipid hydrolysis 44 , which may represent adaptations to lipid-abundant environment within the host.Besides lipids, host cell lactate was recently revealed as a carbon source that is important for bacteria growth within human macrophages 55 .More recently, convergent mutations in promoter and coding regions of lactate dehydrogenase gene lldD2 were extensively identified in M. tuberculosis, and the mutations were supposed to represent adaptation to changes of host ecology and were associated with higher transmissibility 56 .We identified mutations in the upstream of lldD1/2 in 14 MKMC isolates, emphasizing the importance of lactate metabolism in host adaptation of mycobacteria.The convergent mutations in regions upstream of lldD1/2 and the lipase gene likely resulted in upregulation of corresponding enzymes, which may enhance the metabolic capabilities of M. kansasii within the host and facilitate its survival and replication.
Potential positive selection was also identified in genes involved in metal ion acquisition, which may represent adaptations to the limitation of metal ions (i.e., nutritional immunity) from the host 57,58 .The highest polymorphism was found in zur, which in M. tuberculosis encodes a transcriptional repressor that regulates the expression of genes involved in zinc and iron uptake and zinc mobilization 59,60 .Inactivation of Zur could increase the expression of genes that improve the ability of M. kansasii to compete with the host for the acquisition of zinc and iron (Fig. S15).Two additional genes encode the iron dependent repressor (IdeR) and the subunit B of the SUF system (SufB) for Fe-S cluster biosynthesis were also found under potential positive selection.Mutations in these genes may augment the ability of the bacilli to maintain the iron homeostasis in the iron-limited environment encountered during infection 58,61 (Fig. S15).
Seven genes associated with secondary metabolism were found under potential positive selections (Table S3), and mutations in these genes likely cause loss-of-function as the existence of nonsense and/or frameshift mutations.Among these are two TetR family regulators of GBL signaling and two genes involved in LOS biosynthesis.GBL signaling molecules are involved in the regulation of secondary metabolism and morphological development in actinomycetes 39 .The most-studied GBL signaling is the A-factor system associated with secondary metabolism and sporulation in Streptomyces 62,63 .Many of the mutation and recombination events in these two genes mapped to inner nodes of the ML phylogeny, suggesting that they could have been acquired before infecting humans, probably in city water distribution systems.While there is no solid evidence of mycobacterium sporulation, inactivation of these regulators could nevertheless modulate bacterial metabolism to facilitate the survival of M. kansasii in the urban water systems, where they would encounter low levels of nutrients and disinfectant residuals.LOS are polar glycolipids associated with cell wall hydrophilicity in several mycobacteria 25,42 .We found a high correlation between loss-of-function mutations in these genes and rough colony morphology in clinical isolates, consistent with previous findings in other mycobacteria 42 .The rough phenotype resulting from the absence of LOS has been associated with an enhanced within-host survival and increased virulence in several mycobacteria including M. kansasii and M. marinum 27,64 .In M.
tuberculosis, a recent study demonstrated that the loss of LOS occurred in its M. canettii-like ancestor, which may have played a vital role in its evolution from an environmental mycobacterium to a professional pathogen 43 .Similarly, the mutations in the LOS synthesis genes of M. kansasii ssp.I could also represent in-host selection for increased virulence and the ability to establish a persistent infection within the host.
The extensive selection of mutations in clinical MKMC isolates represent a feature of opportunistic infections, where adaptive mutations be rapidly selected out due to the shift from environment to the host niches 65 .Several mutations including the ones in LOS synthesis genes and lldD2 promoter of the MKMC strains mimic the ancestral and/or ongoing host adaptations of M. tuberculosis, which suggests a potentiality of M. kansasii to evolve to be a professional pathogen.The putative adaptive mutations in the MKMC isolates were all mapped to the terminal nodes in the phylogeny in this study, suggesting no transmission of the more human-adapted strains.However, considering the highly similar clinical syndromes of pulmonary infections due to M. kansasii and M. tuberculosis, the possibility of aerosol transmission of M. kansasii among susceptible individuals cannot be excluded, which could provide multiple rounds of host adaptations for the bacteria.This should raise concern that some members of this species, particularly in the MKMC, may be able to evolve into highly adapted pathogens that could persist in human populations.

Strain collection and whole genome sequencing
In China, patients with suspected pulmonary tuberculosis are referred to local designated hospitals, and the positive cultures of putative NTM are subjected to species identification by 16s rRNA sequencing in the hospitals or by a local or national branch of Center for Disease Control and Prevention (CDC).Water strains of M. kansasii were isolated from public tap water across Shanghai using a filtration method as previously described 66 .In Canada, clinical isolates were obtained from the McGill University Health Centre mycobacteriology laboratory and identified as M. kansasii by the Laboratoire de Sante Publique du Quebec by 16S rRNA PCR and DNA sequencing.In Australia, clinical isolates were referred to the mycobacterial reference laboratory at the Victorian Infectious Diseases Reference Laboratory and identified by 16S rRNA PCR DNA sequencing.Genomic DNA was extracted from mycobacterial cultures and sequenced on either an Illumina Hiseq 2000 or an Illumina NextSeq 500 platform in single-or paired-end mode with an expected depth of 100.All sequence data associated with this study was deposited in the Sequence Read Archive (SRA) of NCBI under project accession PRJNA323639.Publicly available genomic sequences and short read data (PRJNA317047, PRJNA374853, PRJNA431546, PRJNA401515) were downloaded from the Assembly and SRA databases of NCBI respectively (Supplementary Table 1).

Genome assembly and genomic nucleotide identity analysis
Sequencing reads were trimmed and filtered using Trimmomatic (v0.30) and draft genomes were assembled using SPAdes (v3.13.1) in the carful mode with read correction, auto sized k-mers and mismatch corrections.The quality of assembly was evaluated using Quast (v5.02) and contigs less than 200bp were filtered out.Pairwise genomic ANIs were calculated using fastANI (v1.1) with default parameters based on the assemblies.

Core genome analysis
Core genes for all M. kansasii and M. gastri isolates included in this study were analyzed by Roary (v3.11.2), through which draft genomes were first annotated using Prokka (v1.14.6) and then homologous genes were clustered using the CD-Hit and MCL algorithms.To generate the core-genome alignment, a minimum 90% blastp identity, a 100% coverage (i.e., the gene must be present in all isolates) and no paralog splitting (i.e., cluster containing paralogous genes gets filtered out) were set.Sequences of individual core genes were aligned with MAFFT and were then concatenated into a core genome alignment according to their genomic coordinates in the reference genome (NC_022663.1).RaxML (v8.2.12) was used to construct the maximum likelihood phylogeny based on the core genome alignment with a GTR model and 1000 rapid bootstrap replications.SplitsTree (v4.14.5) was used to construct the phylogenetic network by the NeighborNet algorithm based on the core genome alignment.The Tajima D statistic was calculated for nonoverlapping 10-kb sliding window of the core genome alignment in each subspecies using PopGenome (v2.7.5) in the R package.For identification of subspecies I specific genes, a minimum 80% blastp identity with paralog splitting was set in Roary.Genes present in all isolates of ssp.I but absent in all isolates of any other subspecies were selected.

Population structure and recombination analysis
Population structure was inferred using hierBASP (http://www.helsinki.fi/bsg/software/BAPS/).The core genome alignment was subjected to hierBAPS analysis with a uniform prior on the number of clusters.Genomic recombination was inferred using fastGEAR (https://mostowylab.com/news/fastgear) based on the core genome alignment with an integration number of 15 (default value).The fastGEAR used BAPS to define the "best" number of clusters and then detect 'lineages' that are genetically distinct in at least 50% of the alignment.fastGEAR detects both ancestral recombination that affects all isolates in a lineage as well as recent recombination that affects a subset of isolates in a lineage.For each ancestral recombination, the larger lineage was assumed to be the donor (by default).
The recent recombinations in isolates of the major subspecies (ssp.I, II, III and VI) were further explored by using Gubbins (v2.3.4).Whole-genome alignment of each subspecies were generated by an in-house pipeline based on Minimap2 (v2.17).Briefly, the contigs of individual strains were aligned to the reference genome by Minimap2 with the preset parameter 'asm20' for the alignment.By filtering secondary and short alignments (< 200 bp), the nucleotide corresponding to the reference genome at each site is determined to generate the 'pseudogenome' for each isolate.Pseudogenomes of each subspecies were concatenated to make a whole genome alignment, which were subjected to Gubbins for recombination identification with default parameters.The donors of the recombination fragments were determined by a BLAST search against a local database containing all of the M. kansasii and M. gastri genomes included in current study and the representative reference genomes for other mycobacteria collected in the NCBI database.A cutoff value of identity was set to 99% to identify the probable donor strains.
The outputs from Gubbins were viewed with Phandango (v1.1.0),or by an in-house python script that plots the recombinations with information including genomic coordinates and donor subspecies.

Mapping based analysis
We applied mapping-based analysis to study the genetic variants among the MKMC strains.The trimmed reads were mapped to the reference genome ATCC 12478 by Bowtie2 (2.3.5) and variants including SNVs and short-indels were called by a SAMtools (v1.9)/VarScan (v1.4.3) pipeline.Variants were called at loci where the alternative basecalls were supported by at least five reads that aligned both forwardly and reversely to the reference.Variants in repeat regions, putative PE/PPE family genes and transposable elements were excluded.Variants supported by > 95% of the mapped reads were defined as fixed/homozygous mutations, otherwise variants were defined as unfixed/heterozygous mutations.The homozygous SNVs in non-recombinant regions detected by both mapping-and assembly-based analysis were used to construct the ML phylogeny of the MKMC by RaxML based on the GTR model.We found several extraordinarily long terminal branches in the ML phylogeny for isolates from PRJ374853, which likely represent assembly errors, and the branches were thus truncated to 0. Network (v5.0) was used to generate median-joining networks for the outbreak strains from Australia based on the concatenated SNV sequences.

Bayesian evolutionary analysis
The geographic origin of the MKMC was analyzed using the Bayesian Binary MCMC (BBM) method integrated in RASP (v4.0).The BBM method inputs posterior distribution of Bayesian inference to reconstruct the possible ancestral distributions of given nodes via a hierarchical Bayesian approach.The ML phylogeny of the MKMC constructed in the above section was used for the analysis.The strains were classified into six geographic regions based on where they had been isolated: East Asia, Australia, Europe, North America, South America or South Africa.The Bayesian analysis was run with a fixed JC model for 5,000,000 cycles, 10 chains, a temperature parameter of 0.1, with sampling every 100 generations.Bayesian dating of the phylogeny was based on a subset of 121 strains of the MKMC with short read data (to exclude assembly errors in publicly available genomes), clear dates of isolation and genomes with proportions of recombinations <1.0%.The reference strain ATCC 12478, which was isolated in 1953, was also included.The temporal signal in the sequence alignments was investigated using TempEst (v1.5.3).As a complementary assessment of the temporal signal in the data, a date randomization test was performed on our datasets with the R package TipDatingBeast (v1.1-0).Sampling dates of the strains were randomly shuffled 20 times, and the randomized datasets were analyzed with BEAST using the same parameters as for the original datasets.If the 95% HPD intervals of rootheight obtained from the original data does not overlap with the estimates obtained from the randomized datasets, a statistically significant temporal structure could be confirmed.BEAST (v1.8.0) software was used to determine the timescale and the evolutionary rate of the MKMC using the tip-date calibration.We used an uncorrelated lognormal distribution for the substitution rate, and a constant population size for the tree priors.The analysis was done in three chains of 5×10 7 generations sampled every 1,000 generations to assure independent convergence of the chains.Convergence was assessed using Tracer (v1.6) to ensure that all relevant parameters reached an effective population size of >100.

Detection of genes under positive selection in the MKMC
Genes with convergent mutations or high numbers of de novo mutations could have been subject to positive selection.A subset of 247 MKMC isolates with short reads data were used for the analysis.To identify genes with multidiverse signatures, the de novo homozygous mutations of all 247 MKMC isolates were used to calculate the mutation density (number of mutations per gene).Under the neutral evolution model, the number of substitutions per gene is expected to follow a Poisson process.The 95% confidence interval of mean predicted values of Poissonity were estimated based on the Wald interval for the mean, and a significant deviation from the interval was taken as signal of potential positive selection 37,38 .To identify convergent mutation, de novo homozygous mutations were analyzed against the maximum-likelihood phylogeny by TimeTree (v0.6.4).Homoplastic mutations independently evolved at least three times were identified as under potential positive selection.Pairwise genomic average nucleotide identity (gANI) within and between M. kansasii subspecies.707         ); outer red lines, location of highly polymorphic genes; outer blue line, location of genes with convergent mutations.B. Schematic diagrams depicting the distribution and frequency of non-synonymous mutations in zur and tetR1/2, and the recombinations around tetR1/2.Hom., homozygous mutation; Het., heterozygous mutation; Rec., recombination (strip colors represent donor subspecies corresponding to Figure 2).C. Schematic representation of mutations in LOS biosynthesis genes and corresponding morphology of the mutant strains.Genes were colored according to their functions.Green, genes involved in polyketide synthesis; orange, acyltransferase; cyan, glucosyltransferase.

Figure 1 .
Figure 1.Global diversity of M. kansasii.A. Geographical distribution of the 358 isolates in the 703 study.The gradient blue colors indicate the prevalence of M. kansasii among NTM disease.B. 704 Core genome based maximum-likelihood phylogeny of the 358 isolates.The colors of the 705 terminal node correspond to the geographical origin of individual isolates, as denoted in A. C. 706 Pairwise genomic average nucleotide identity (gANI) within and between M. kansasii subspecies.707

Figure 2 .Figure 2 .
Figure 2. Genomic recombination and its contribution in subspeciation and diversification of M. kansasii.A. Phylogenetic network of M. kansasii subspecies based on the core genome alignment of 358 isolates.I-VI, M. kansasii subspecies; G, M. gastri.B. Population structure and genomic recombinations inferred by fastGEAR.Each line represents the genomic constitution (exhibited as color strips) of individual isolates according to ancestral (upper panel) or recent (lower panel) recombinations.Colors represent the different subspecies corresponding to panel A. Strips in white color (lower section) represent recent recombinations from unknown sources.C. Length distribution of recombinant fragments in the four major subspecies.The box plots indicate the 5-95% percentiles and quartiles.D. Maximum likelihood phylogeny of the four major subspecies based on de novo SNVs in non-recombinant regions.Branches highlighted in bold indicate homogenous complexes containing isolates with an averaged pair-wise genomic difference less than 100 SNVs.Colors of terminal nodes indicate geographical origins corresponding to Figure 1.Filed circles indicate a human source; empty circles, an environmental source.

Figure 3 .Figure 3 . 1 BFigure 4 .
Figure 3. Phylogenomic analyses of the M. kansasii main complex (MKMC).A. Left panel, the maximum-likelihood phylogeny of the MKMC based on de novo mutations.Isolation from non-human or unknown sources are indicated by triangles in the terminal nodes.Right panel, genomic pattern and proportion of recent recombinations for individual isolates.Donor subspecies are colored corresponding to Figure 2. B. Median-Joining network for the Australia outbreak strain cluster.Node size, number of isolates; node color, source and year of isolation.

Figure 4 .Figure 5 .
Figure 4. Genomic loci specific to M. kansasii ssp.I. A. Left, synteny map for the genomic region flanking the MCC genes in M. kansasii subspecies.Full-length and truncated genes are represented by arrows and rectangles respectively.The full MCC genes in ssp.I and their orthologous sequences in the other subspecies are indicated with different colors.The flanking genes are in gray or white to represent homologous or orphan genes, respectively.Right, a scheme of the methylcitrate cycle of mycobacteria and its relation with the betaoxidation, TCA and glyoxylate cycles.B. Synteny map of the three ESP loci specific to M. kansasii ssp.I. C. Sequence similarity between the EspE and the EspE-like proteins of M. kansasii ssp.I.

Figure 5 .
Figure 5. Genes under positive selections in the MKMC. A. Circular plot of genes under potential positive selection.Innermost inward bars, number of recombination per gene; innermost outward bars, number of mutations per gene (red bars, n≥4); outer red lines, location of highly polymorphic genes; outer blue line, location of genes with convergent mutations.B. Schematic diagrams depicting the distribution and frequency of non-synonymous mutations in zur and tetR1/2, and the recombinations around tetR1/2.Hom., homozygous mutation; Het., heterozygous mutation; Rec., recombination (strip colors represent donor subspecies corresponding to Figure2).C. Schematic representation of mutations in LOS biosynthesis genes and corresponding morphology of the mutant strains.Genes were colored according to their functions.Green, genes involved in polyketide synthesis; orange, acyltransferase; cyan, glucosyltransferase.