Genomic features of M. luteus
All (106) Micrococcus genome sequences available at the NCBI (National Center for Biotechnology Information) assembly database (April 2020) were downloaded and subject to strict quality control to reduce potential bias in the subsequent analyses due to different assembly qualities (see Methods for details). As a result, 76 high-quality (completeness > 95% and contamination < 5%) and non-redundant Micrococcus genomes were remained (Table S1). Whole-genome average nucleotide identity (ANI) analysis classified all these genomes into six clades (species) based on an ANI cutoff of 95% (Figure 1A), three of which (including 72 strains) corresponded to three published species with inner ANI > 97.4%: M. luteus (n = 66), M. lylae (n = 5), and M. terreus (n = 1), respectively. As the other three clades did not contain type strain genomes, the remaining four strains (‘Micrococcus luteus’ AS2, ‘Micrococcus luteus’ DE0384, ‘Micrococcus flavus’ BCRC 80069, and ‘Micrococcus luteus’ DE0113) in these three clades were then assigned to known species by comparing their near full-length 16S rRNA gene sequences to those of type strains. Strains AS2 and DE0384 were reclassified into M. endophyticus (type strain: YIM 56238T, identity > 99.5%, 1432 nt) [23], strain DE0113 was reclassified into M. cohnii (type strain: WS4601T, identity = 100%, 1109 nt) [24], and strain BCRC 80069 was reclassified into M. flavus (type strain: LW4T, identity = 99.9%, 1402 nt) [25]. Further phylogenomic analysis supported the species classification (Figure 1B).
Our dataset contained genomes for 66 M. luteus strains. These strains were from diverse habitats globally, including marine, soil, food, plant- and human-associated, and other common sources (Table S1). Compared to high intra-species ANI values (> 96.8%), the ANI values with other species were much lower (< 89.4%), suggesting a clear inter-species genomic difference. The M. luteus genome size ranged from 2.41 Mb (strain OG2) to 2.74 Mb (strain DE0230) (Figure S1A), with an average genome size (2.52 Mb, n = 66) smaller than M. lylae (2.68 Mb, n = 5; p = 0.001, Wilcoxon test), M. endophyticus (2.78 Mb, n = 2; p = 0.02, Wilcoxon test), and M. terreus (3.09 Mb, n = 1), but larger than M. cohnii (2.24 Mb, n = 1). In addition, the GC content of M. luteus (mean 72.9%, n = 66; ranging from 72.3% to 73.3%) was higher than M. lylae (mean 71.3%, n = 5; p = 0.0002, Wilcoxon test), M. cohnii (70.8%, n = 1) and M. terreus (68.9%, n = 1), but lower than M. endophyticus (mean 73.4%, n = 2; p = 0.02, Wilcoxon test) and M. flavus (73.5%, n = 1) (Figure S1B). Taken together, the differences in ANI values, genome size, and GC content, suggested an apparent divergence between M. luteus and the other species of Micrococcus.
The M. luteus genomes each had 2,148 (strain OG2) to 2,501 (strain SGAir0127) coding sequences (CDSs) (mean 2,287). On average, 94.6% of them could be annotated and functionally categorized using the eggNOG database [26]. Each genome contained 159 to 315 identifiable insertion sequences (ISs), accounting for 7.22% to 13.4% of the genome (mean 8.51%). Each genome also harbored 4 to 15 genomic islands (GIs; mean 8.3) that were 3.1 to 96.2 kbp in size, which together accounted for an average of 6.6% of the whole genome (ranging from 2.3% to 11.4%). These results suggested that strains of M. luteus might have undergone frequent genomic exchanges [27]. In addition, 57 strains (86.4%) encoded Rpf, a secreted protein with an N-terminal transglycosylase-like domain (PF06737) and a C-terminal LysM domain (PF01476), which could stimulate the growth and resuscitation of dormant bacteria [7, 8]. This dormancy-resuscitation mechanism may help M. luteus to survive over extended periods when conditions are not conducive for growth, and to rapidly respond to environmental changes.
Potential virulence factors and antibiotic resistance genes of M. luteus
All M. luteus genomes were locally compared against the Virulence Factors Database (VFDB) [28] to detect virulence genes (Figure S2). We found 30 different putative virulence factors (VFs), 12 of which (40%) were related to GIs. Each genome contained 19 (strains RIT608 and 1058_MLUT) to 31 (strain SK58) VF genes (mean 22). Twelve VFs were shared by all strains, of which three (clpC, clpP, katA) were involved in stress response, three (ideR, phoP, relA) in regulation, two (lirB, CBU_1566) in secretion system, and the remaining four (htpB, gnd, bauE, icl) in adherence, immune evasion, iron uptake and metabolic adaptation, respectively. These core VFs might play important roles in the pathogenicity of M. luteus. Copy number variation of the VF genes was also found. For example, genes ureABG were enriched (three to six copies) in strain SK58 (isolated from human skin). These genes are involved in bacteria urea hydrolysis and have been reported to be crucial to pathogenic bacteria virulence and defense against host immunity [29, 30]. Genes msrA/B(pilB) were expanded (three copies) in five strains and their gene products can promote the successful infection of humans and also respond to adverse conditions [31, 32]. The expansion of VF genes may promote the pathogenicity of the harboring strains. However, these results may be biased due to the incomplete genome sequences for most strains, though only high-quality genome sequences (completeness > 95% and contamination < 5%) were included in our dataset.
We also detected the presence of antibiotic resistance genes (ARGs) in M. luteus. A total of 22 distinctive putative ARGs were identified, half of which were related to GIs (Figure S3). Overall, three ARGs were identified in all strains, including genes associated with macrolide- and penam-resistance (mtrA, with two copies), fosfomycin-resistance (murA) and rifamycin-resistance (rbpA). Additionally, an aminoglycoside phosphotransferase encoded by strA, which confers aminoglycoside-resistance, was found in 59.1% (n = 39) of the strains. The remaining genes (n = 18, 81.8%) exhibited sporadic distribution patterns (including eight strain-specific genes). These genes included 13 antibiotic efflux pump-encoding genes, three antibiotic inactivation enzymes encoding genes (aac(3)-IIb, blaCTX-M-141, blmS), and two genes for antibiotic target alteration (rmtF) and protection (msrE). Different ARG repertoire suggested that different strains may have different antibiotic resistance.
It has been reported that M. luteus could cause severe infections in immunocompromised populations [3-5], and most of the previously reported cases could be successfully treated with a combination of vancomycin and rifampin [33]. There was also a report that a treatment regimen consisting of vancomycin, gentamicin, and rifampicin for four weeks was not successful in the case of native aortic valve endocarditis secondary to M. luteus [34], suggesting the antibiotic resistance mechanism of M. luteus is still very complicated. Consistently, our analyses showed that all of the total 66 genomes contained the rbpA gene, product of which has been reported to confer basal levels of rifampicin resistance on Streptomyces coelicolor [35]. In addition, it was noted that about half the ARGs and VFs were related to GIs, suggesting that horizontal gene transfer (HGT) might be an important driving force for antibiotic resistance and pathogenicity acquisition in M. luteus. Therefore, the VF and ARG profiles may provide guidance for the future treatment of M. luteus infections.
Core- and pan-genome of M. luteus
To explore the entire genomic repertoire of the M. luteus population, estimates of the cloud (genes present in only one or two genomes), shell (genes present in 3-61 genomes), softcore (genes present in at least 62 genomes) and core (genes present in all 66 genomes) genomes were generated using GET_HOMOLOGUES [36]. The 66 M. luteus strains had a pan-genome of 8,077 genes and a core genome of 1,134 (14.0%) genes, including 991 single-copy core genes (Figure 2). The core genome only represented 45.3% to 52.8% of the gene content of each strain, illustrating a relatively high degree of genomic diversity. This core gene ratio was much lower than that of the actinobacterial species Streptomyces albidoflavus (65.3% to 73.0%, recalculated with the same clustering algorithm and parameters, intraspecies ANI > 98.1%) [18]. Furthermore, the cloud genes contained more than half of the pan-genome (4,210 genes, 52.1%; 3.36% for each strain, on average), of which 3,301 were strain-specific genes (singletons, 40.9%; 2.14% for each strain, on average), indicating an exceptionally flexible genome of M. luteus. Correspondingly, the pan-genome for M. luteus still increased with approximately 50 new genes after addition of a 66th genome (Figure 2C). Analysis of the pan-genome curve using a power-law regression model confirmed that the pan-genome was open (Bpan = 0.5), as the curve did not approach a constant as more genomes were selected. This ‘open’ pan-genome is considered an indicator of high HGT rates, especially for strains living in multiple niches [37, 38], suggesting that M. luteus had a high capacity to absorb and integrate external genetic elements from environments. All above results showed a considerable intraspecies heterogeneity of M. luteus, which has been proposed based on macro-restriction analysis using pulsed-field gel electrophoresis [39]. This high genome variability may contribute to the functional diversity of M. luteus strains thriving in various habitats.
We next performed a Clusters of Orthologous Groups (COG) functional classification for each orthologous group (OG) to define possible differences in the functions encoded by the soft-core, cloud and shell genomes of M. luteus (Figure S4). As a result, the soft-core genome had a higher proportion of genes classified in COG categories C (energy production and conversion), E (amino acid transport and metabolism), F (nucleotide transport and metabolism), H (coenzyme transport and metabolism) and J (translation, ribosomal structure and biogenesis) (p < 0.01, Fisher’s exact test), all associated with basic biological functions. The cloud and shell genes were biased toward COG categories L (replication, recombination and repair) and V (defense mechanisms) (p < 0.01, Fisher’s exact test), which may contribute to the intraspecies heterogeneity of the species, as these genes have been proved to play important roles in the acquisition of foreign DNA [40].
Phylogenomic relationship and population structure of M. luteus
To uncover the evolutionary process that led to the genomic diversity, we constructed a highly robust phylogenomic maximum-likelihood (ML) tree of the 66 M. luteus strains based on concatenated 922 single-copy core genes, using the two M. endophyticus strains (‘Micrococcus luteus’ AS2 and ‘Micrococcus luteus’ DE0384) as outgroups (Figure 3). Based on the phylogenomic tree, M. luteus could be divided into three basal clades: Clade I, II, and III. Further analyses revealed clear inter-clade genomic boundaries. Firstly, the ANI heatmap and clustering results (Figure 1A) suggested that the species could be divided into three groups, with high intra-group ANI and relatively low inter-group ANI. Comparison of this clustering result with the phylogenomic tree indicated that the three clades corresponded nicely with the three ANI groups, indicating the existence of inter-clade ANI boundaries. Secondly, in order to investigate whether there were gene content differences between the three clades, a hierarchical clustering tree based on the content of dispensable genes was constructed (Figure S5). This tree clearly showed three clades, which were the same as those in the core genome tree, indicating the existence of inter-clade gene content boundaries. These results suggested that the separation between the three clades has emerged, both in the core and accessory genes.
The intraspecies difference was also studied using methods which are widely used in population structure analyses. By using Fastbaps [41], the entire M. luteus population could be partitioned into three subpopulations at Level 2 (Figure S6). These three subpopulations corresponded to the three major clades on the phylogenomic tree. At Level 3 or higher, one additional subpopulation emerged, which corresponded to only strain NCTC 7563 in Clade II. Furthermore, an admixture model implemented in STRUCTURE [42] software was applied, which showed maximal posterior probability at K = 4, indicating the existence of four ancestral subpopulations (Figure 3 and Figure S6). Among them, three ancestral subpopulations (POP1, POP2, and POP3) were mainly represented by the three major clades. It was noted that, the fourth ancestral subpopulation (POP4, yellow) existed in high proportions in all the three clades, with the highest proportion in the fourth subpopulation uncovered by the above Fastbaps analysis (i.e., strain NCTC 7563 in Clade II). However, since POP4 accounted for only 47.5% of the genome of NCTC 7563, it is unclear whether POP4 can be represented by the branch of strain NCTC 7563. It is possible that POP4 corresponded to an unsampled clade. Besides, since isolation source information for strain NCTC 7563 was missing in the current dataset, other information and more sampling are needed to clarify what the predicted POP4 really is. In summary, while the population structure analysis supported the inter-clade differences, it suggested the existence of an unknown ancestor that contributed to the diversity of the species or the existence of the fourth, yet unsampled, clade.
Gain and loss of genes during the evolution of M. luteus
To unravel the evolutionary events that contributed to the intraspecies gene content differences, gene contents of all ancestral nodes on the phylogenomic tree were predicted and the numbers of gene gain and loss events occurred on all branches were calculated using a parsimony method (Figure S7). As a result, the last common ancestor of M. luteus was inferred to possess 2,028 gene families. The gene family numbers had only slightly increased during evolution, as the gene gain events were a bit more numerous than the gene loss events in most stages during the evolutionary process. Also, a relatively high number of gene content variances occurred at the divergence of the three major clades (77, 64 and 55 for Clades I to III, respectively), consistent with inter-clade gene content differences mentioned above (Figure S5). Interestingly, 88.3% of total gene gain events and 82.8% of total gene loss events occurred at terminal nodes, suggesting that the pan-genome diversity of M. luteus is largely due to frequent but recent strain-specific gene gain/loss events.
High level of homologous recombination in M. luteus species
Recombination, especially homologous recombination, is one of the main forces shaping bacterial evolution [43, 44]. By recombination, bacteria can receive DNA fragments from both the same and other species and integrate them into their chromosomes. Here, a series of analyses were used to evaluate the extent of homologous recombination in core genome of M. luteus. Firstly, NeighborNet network [45] based on the concatenated single-copy core genes showed a reticulate structure, indicating a high level of non-vertical inheritance in phylogenies (Figure 4A). Meanwhile, the pairwise homoplasy index (PHI) [46] statistic provided highly significant evidence for recombination within M. luteus species (p = 0.00).
We also used mcorr [47] to calculate the probability that a pair of genomes differs at one locus conditional on having differences at the other locus. The resulting correlation profile exhibited a monotonic decay (Figure 4B), indicating the presence of recombination. Similar decaying correlation profiles have also been shown in other recombining bacteria, such as Helicobacter pylori, Pseudomonas aeruginosa, Salmonella enterica and Cronobacter sakazakii [47-49]. Meanwhile, mcorr showed the mean fragment size (f̅) of a recombination event in M. luteus species was 874.45 bp; the recombination coverage (c), which indicates the proportion of sites in the genome whose diversity has come from recombination events since its last common ancestor, was 0.46; and the ratio γ/μ, the relative rate of recombination to mutation, was 10.87 (Figure 4C). All these recombination parameters above further confirmed a high recombination rate within M. luteus. A summary of the parameters of M. luteus and other typical species reported was shown in Table 1.
Table 1. Recombination parameters of different species inferred by mcorr.
Species
|
f̅
|
c
|
γ/μ
|
References
|
Acinetobacter baumannii
|
860
|
0.40
|
1.30
|
[47]
|
Campylobacter jejuni
|
1000
|
0.32
|
3.40
|
[47]
|
Cronobacter sakazakii
|
816
|
0.53
|
1.61
|
[49]
|
Klebsiella pneumoniae
|
5800
|
0.27
|
4.20
|
[47]
|
Micrococcus luteus
|
874
|
0.46
|
10.87
|
This study
|
Micrococcus luteus (NHA only)
|
903
|
0.45
|
10.41
|
This study
|
Micrococcus luteus (HA only)
|
849
|
0.49
|
10.93
|
This study
|
Mycobacterium abscessus
|
1200
|
0.21
|
13.00
|
[47]
|
Pseudomonas aeruginosa
|
590
|
0.52
|
11.00
|
[47]
|
Salmonella enterica
|
-
|
0.46
|
6.16
|
[48]
|
Staphylococcus aureus
|
550
|
0.36
|
1.00
|
[47]
|
Staphylococcus pseudintermedius
|
407
|
0.24
|
3.97
|
[50]
|
Yersinia pestis
|
530
|
0.03
|
3.00
|
[47]
|
NHA, non-host-associated;
HA, host-associated;
-, information unavailable.
We next sought to identify the frequently recombining genes. Among the 991 single-copy core genes, 628 genes (63.37%) showed significant evidence of genetic recombination, with P-values (computed from 1,000 permutations) lower than 0.05 in at least two of the three methods implemented in PhiPack [46]. This result was further confirmed by fastGEAR [51], by which a total of 708 genes (71.44%) were detected to have been affected by recent or ancestral recombination events (Figure 4D). The most frequently recombining genes included pabB, cstA, betT, and comEC: pabB encodes an enzyme that catalyzes the two-step biosynthesis of anthranilate and has experienced trans-kingdom gene fusions [52]; cstA is involved in peptide utilization during carbon starvation [53]; betT encodes a choline-glycine betaine transporter, which could help to overcome osmotic stress by the accumulation of compatible solutes [54, 55]; and comEC encodes a transformation protein and has been shown to be absolutely required for DNA uptake and recognition [56].
Recombination has been proved as an important driver of the evolution of most prokaryotes, and acquisition of novel alleles of existing genes will also accelerate ecological adaptation of bacterial populations [57, 58]. The above analyses revealed a high level of homologous recombination in M. luteus, probably allowing frequent gene exchange between M. luteus strains and with other organisms. These results were also consistent with the high proportion of admixture inferred from population assignment and numerous insertion sequences and transposases detected in the genomes.
Habitat-associated accessory OGs underlying putative differential adaptation
M. luteus is widely distributed in various environments and a few strains were isolated as human pathogens or from the mammalian skin or other tissues. In order to investigate whether the genomic differences between clades are relevant to the habitat, the isolation resources of all strains were mapped to the phylogenomic tree (Figure 3 and Figure S6). Strains whose isolation sources were known could be classified into two groups: host-associated (HA), including 18 strains isolated from mammal hosts, and non-host-associated (NHA), including 47 strains isolated from other sources or free living. HA strains were not clustered on the tree but rather widely distributed in all clades, indicating the inter-clade genetic difference was irrelevant to the habitat transition. It was noted that HA strains tended to locate on the lately diverged branches, and that strains on the early diverged branches as well as the closest characterized relatives of M. luteus (M. endophyticus and M. flavus) were all isolated from NHA habitats. These results suggested that the last common ancestor of M. luteus was more likely to be genuine dwellers in NHA niches, and a few descendants transited to a mammal-associated lifestyle recently. This proposition is different from an early speculation that M. luteus was primarily adapted to (mammalian) skin, and its presence in water or soil might come from contamination by skin flakes [12].
Lifestyle changes are drastic events followed by gene gain and loss. We sought to study whether strains from different habitats were associated with different sets of genes. A pan-genome-wide association study (pan-GWAS) revealed 101 accessory genes that were present in at least half of the strains from one habitat but no more than half from the other. Among the 13 genes whose associations were statistically significant (Figure 5 and Table S2, Benjamini-Hochberg P-value < 0.1 and Empirical P-value < 0.05), eleven were enriched in NHA strains and two were enriched in HA strains. It was noted that the two HA genes (OG_2452 and OG_2453) and one of the NHA genes (OG_1888) were located at the same genomic region (Figure 6A). Genes of OG_2452 and OG_2453 encoded a sortase and an excalibur calcium-binding domain-containing protein, respectively. Sortases are enzymes responsible for covalent anchoring of specific proteins to the peptidoglycan of the cell wall of Gram-positive bacteria, performing critical biological functions that are required for the colonization and invasion of host tissues [59, 60]. Sortases have also been considered as important virulence factors, as they play key roles in the infection process [61]. Thus, consistent with their association with HA habitats, genes of OG_2452 and OG_2453 probably enabled HA strains better colonization on mammalian hosts. The NHA gene OG_1888 encoded a protein that was a member of COG0739 of membrane proteins related to metallo-endopeptidases. However, the function of OG_1888 in the adaption to the NHA habitats is unknown. Considering the conserved flanking genes and the presence of a transposase gene inside the region, it is possible that the integration of this region was a result of HGT. Actually, the NHA gene OG_1888 was located within a putative GI, further supporting the hypothesis. Additionally, the OG_2452-OG_2453 gene cluster existed in HA strains located on different clades, indicating that different HA lineages obtained this gene cluster separately and that there may be HGT between different HA lineages.
Most NHA genes were clustered within two regions. One such region contained six NHA genes (OG_1971 to OG_1976) (Figure 6B). In this region, three genes, fadA, fadB, and fadE, responsible for the degradation and recycling of fatty acids via β-oxidation, were identified. Fatty acids are essential components of membranes and are important sources of metabolic energy in all organisms [62]. The other region contained genes encoding a type IV toxin-antitoxin 'innate immunity' bacterial abortive infection (Abi) system, consisted of a bicistronic operon encoding an AbiEi antitoxin and an AbiEii toxin (OG_1985 and OG_1986, Figure 6C). The Abi system, abbreviated for the phage abortive infection system, provides a post-infection resistance mechanism that could block phage multiplication and result in the death of the infected bacterial cell upon phage infection [63-65], and may also act as stress-response elements that help cells survive unfavorable growth conditions [66]. Beside these two genomic regions, there were also two genes (OG_1915 and OG_2057) showing a significant association with NHA habitat. One of these (OG_2057) encoded a glyoxalase (COG0346 / PF18029), which has been reported to play a role in the detoxification of methylglyoxal in bacterial cell metabolism when suffering environmental stresses [67]. Thus, these NHA genes probably enhanced the adaptation to complex environments.
The above habitat-associated genes suggested that the nascent ecological differentiation has already been initiated and pathogenic strains and lineages in this species are merging. However, it is unclear whether the differentiation has been completed. It has been proposed that microbial speciation is usually driven by natural selection for adaptation to distinct ecological niches, and the distinctness is further maintained by barriers to gene flow [68]. This process is not necessarily continuous and complete, and may be terminated at any time [14]. In fact, we did not detect any clues to the existence of recombination barriers between habitats, as there was no obvious difference in recombination parameters between datasets of HA, NHA, and all strains (Table 1). Here, using M. luteus, we have shown another case to depict the process of bacterial ecological differentiation, whether or not it proceeds to completion. During this process, niche-specifying (adaptive) variation has already emerged. However, the separation process seems to be suppressed probably because of the high-level recombination and the strong diffusibility between niches.