Integrated analysis of genomic islands
The n-TASE cluster sequence from B. seminalis TC3.4.2R3 was not identified by IslandViewer as a potential genomic island (Fig. 1). The prediction based on IslandPath-DIMOB uses six ORFs as a cluster because single-ORF dinucleotide bias is highly variable. Previous codon-based analysis showed that a minimum cluster of genes of approximately 4.5 kb (corresponding to approximately 6–8 ORFs) is required for reliable estimation of nucleotide composition. Moreover, IslandPath cannot detect HGT of individual genes or islands obtained from organisms with similar DNA signals (Hsiao et al. 2003). In the SIGI-HMM approach, GI prediction is performed on the gene level (Waack et al. 2006). GIs usually range in size from 5–500 kb in the IslandPick prediction. GIs smaller than 8 kb were not targeted, and only those 8–31 kb in size can be predicted as GIs using this method (Langille et al. 2008).
Comparative analysis of the genetic structure and PCR detection of n-TASE cluster
The common 23,720 bp encoding regions of TC3.4.2R3 (comprising genes from locus tag Bsem_02847 to Bsem_02869) were used as a reference for the development of a comparative syntenic gene map (Fig. 2 and Supplementary material Table S1) of Burkholderia strains (with or without the n-TASE cluster sequence). The upstream (Bsem_02847 to Bsem_02856) and downstream (Bsem_02861 to Bsem_02869) homologous genes flanking the n-TASE cluster were considered for comparative analysis. The genetic organization was developed based on the gene map available at the IMG website (details such as gene and intergenic spaces were represented based on the IMG gene map).
The structures of the n-TASE cluster and flanking genes were conserved between B. seminalis TC3.4.2R3 and B. cenocepacia DWS37E-2. Except for the absence of the n-TASE cluster, the structure of this region was similar to B. cenocepacia J2315, H111, and B. ambifaria AMMD. Aquamicrobium sp. SK-2 and B. contaminans LMG23361 presented a similar structure, including the n-TASE cluster. However, there is an extra insertion with six genes between Bsem_02862 and Bsem_02863 (co-chaperonin GroES gene and carbohydrate-selective porin OprB gene, respectively in TC3.4.2R3) in Aquamicrobium sp. SK-2 and B. contaminans LMG23361 strains (Fig. 2), responsible for encoding proteins homologous to the ABC or the iron transport system. The strain B. seminalis FL-5-4-10-S1-D7 also presents this extra six-gene insertion, which is also related to transport; however, the n-TASE was absent. The structure of this 23,720 bp common region with n-TASE was more similar between B.seminalis TC3.4.2R3 and B. cenocepacia DWS37E-2 than between B.seminalis TC3.4.2R3 and B. seminalis FL-5-4-10-S1-D7. PCR analysis confirmed a ~ 4,378 bp cluster in the TC3.4.2R3 strain absent in B. cenocepacia H111 and other B. seminalis strains (from the BCCM/LMG Bacteria Collection), which generated a 500-bp amplified DNA fragment (Fig. 3).
Sequence similarity searches and genome composition
The TC3.4.2R3 strain n-TASE cluster presented ≥80% identity and e-value of 9e-178 with four B. contaminans strains (Xl73, SK875, ZCC, and CH-1) (Supplementarymaterial Table S2). No homologous sequences of other Burkholderia species or other bacteria were returned in this sequence similarity search.
We conducted further homolog searches using the Gene Ortholog Neighborhoods tool in the IMG Gene ID website. The search started with the Bamb_0734 (locus tag for putative pyrimidine kinase gene in the B. ambifaria AMMD genome), homologs of pyrimidine kinase, chaperonin GROEL, and 16S rRNA gene sequences from representative Burkholderia genomes that were downloaded from the IMG website. For the genomes containing the n-TASE cluster region, the same methodology was applied; however, in the course of a further search among the genomes returned as orthologs for the Bamb_0734 locus tag, some showed the four-gene cluster between the same pyrimidine kinase and chaperonin GROEL genes. The strains also had their gene sequences of interest (16S rRNA, pyrimidine kinase, glycosyltransferase, methyltransferase, hypothetical protein, hypothetical protein, and chaperonin GROEL) downloaded from the IMG website for further analyses.
We selected nine bacteria and 14 Burkholderia strains with or without the n-TASE cluster (bacterial names, strain identification, and NCBI taxon ID are listed in the materials and methods section) to perform the 16S rRNA gene phylogeny, GC content, and nucleotide identity with B. seminalis TC3.4.2R3 (Fig. 4). The phylogenetic analysis based on the 16s rRNA gene revealed that species that presents the n-TASE cluster homologous sequence are not evolutionary related. This result suggest that the origin of the n-TASE cluster can not be explained by this phylogeny, which shows that B. seminalis is closely related to B. cenocepacia. B. cenocepacia DWS 37E-2 strain presented higher synteny in this n-TASE region with B. seminalis TC3.4.2R3 grouped in a distinct clade of B. cenocepacia species. These results suggested that the n-TASE cluster has a extensive history of independent evolution, although it undoubtedly shares a common ancestor with the Aquamicrobium sp. SK-2 and B. contaminans LMG23361 strains.
Almost all compared sequences were from the Bcc group, belonging to opportunistic pathogenic strains, except for Paraburkholderia sacchari, B. gladioli, and the non-Burkholderia strains. The differences in the GC content of n-TASE cluster genes and their flanking genes, pyrimidine kinase, and chaperonin GROEL genes were slight for all compared sequences between the Burkholderia strains (Fig. 4, numbers showed in red). The GC content of the pyrimidine kinase gene for most of the Burkholderia ranged from 69.4% to 70% for B. cenocepacia DWS 37E-2 and B. latens AU17928 was 67.9% (slightly lower). For the chaperonin GROEL gene, the GC content ranged from 60.7% to 62.7% for all compared sequences. The GC content for the four genes of the n-TASE cluster showed a pattern value around 60% in the TC3.4.2R3 strain. This value was also observed in strains containing this sequence.
Aquamicrobium sp. SK-2 showed the highest nucleotide identity in comparison with the n-TASE cluster of TC3.4.2R3. There was 88% of identity for the glycosyltransferase gene, 86% for the methyltransferase gene, and 80% and 81% identity for the hypothetical proteins (Bsem_02859 and Bsem_02860, respectively). Some sequences showed “no possible alignment” with the TC3.4.2R3 nucleotide sequences, as in Burkholderia sp. HI2500, AU6039 and AU31652 strains, where the methyltransferase gene showed no correspondence alignment (ns in Fig. 4) with the methyltransferase (Bsem_02858) of TC3.4.2R3. For the B. cenocepacia DWS 37E-2, the alignment was possible only for the latter hypothetical protein (61%); for B. latens AU17928, none of the four genes showed possible alignment with the TC3.4.2R3 n-TASE cluster. Most of the compared strains showed 89%–99% nucleotide identity with the pyrimidine kinase gene; the same range was observed for the chaperonin GROEL gene, except for B. contaminans LMG233361 (63% identity) and B. cenocepacia DWS 37E-2 (63% identity) (Fig. 4).
Comparing the amino acid sequences of the strains that presented a result of “no significant similarity found” in at least one of the genes composing the n-TASE cluster in the search for nucleotide identity (Burkholderia sp. AU6039; B. cenocepacia DW 37E-2; B. latens AU17928; Burkholderia sp. AU31652; Burkholderia sp. HI2500 and B. territori MSMB1919WGS), we found that the four genes that compose the n-TASE cluster had a range of 76–87% amino acid identity with the n-TASE cluster of TC3.4.2R3 (Fig. 5); however, all chaperonin GROEL gene sequences from these strains resulted in “no significant similarity” for amino acid comparison (Fig. 5).
CAI analysis was performed with genes located at the upstream position of the n-TASE cluster (locus tag ranging from Bsem_02850) and genes located downstream of the n-TASE cluster (locus tag ranging to Bsem_02870) to determine if the host genome composition of TC3.4.2R3 differs from the n-TASE cluster sequence. The CAI failed to show a significant difference between genes around the cluster and the n-TASE cluster. The CAI value for genes Bsem_02850 to Bsem_02856 and Bsem_02861 to Bsem_02870 was 0.714 ± 0.028; for n-TASE genes, the CAI value was 0.7± 0.009 (Supplementary material Table S3).
The same calculation was performed for GC content. For the n-TASE upstream and downstream genes, the surrounding genes' GC content was 69.5% ± 4.59, and for the n-TASE genes, GC content was 60.65% ± 2.24. According to the t-test, these values (GC content and CAI) did not differ statistically.
Phylogenetic analysis
Genes from several strains of Burkholderia and non-Burkholderia species (i.e., Aquamicrobium sp. and Mumia flava) were used to perform gene-by-gene phylogenetic analyses. The topology of the phylogenetic trees was used to understantd the evolutionary history of the n-TASE cluster (Fig. 6). Genes from the n-TASE cluster region in other strains are distant from those of the TC3.4.2R3 strain. Interestingly, the n-TASE cluster genes from Aquamicrobium sp. SK-2 are more closely related to genes in the B. contaminans LMG23361 than other Burkholderia strains.
The pyrimidine kinase and chaperonin GROEL genes (Bsem_02856 and Bsem_02861 in TC3.4.2R3) were concatenated to build a phylogenetic tree (Fig. 6a). Sequences from strains without n-TASE cluster clustered closely; however, strains containing the n-TASE cluster region clustered in two major groups, in which B. cenocepacia DWS 37-2, B. latens AU17928, and B. territori MSMB1919WGS formed a cluster, and all other strains formed another clade (Fig. 6a).
The phylogenetic tree made only with the pyrimidine kinase gene concatenated with the chaperonin GROEL from strains containing the n-TASE region (Fig. 6b, also showing the kinase/GROEL sequence from TC3.4.2R3 as a distinct group. The flanking genes of the n-TASE region in TC3.2.4R3, although coding the same gene product and showing the same genetic organization, probably have a completely different evolutionary history.
The phylogenetic tree for all four genes from the n-TASE cluster presented a five-clade tree (Fig. 6c), in which the n-TASE cluster from TC3.4.2R3 once more did not cluster with any other compared sequence; however, it formed a sister group to the clade composed by Aquamicrobium sp. SK-2; B. contaminans LMG23361 and Burkholderia sp. strains (AU3165, HI2500, and AU6039). Comparing the 16S rRNA gene phylogeny (Fig. 4) and the genetic composition for the n-TASE cluster and flanking genes (Fig. 6c) showed no agreement between these trees.
The analysis of GC content failed to support the notion of HGT of the n-TASE cluster because there is little variation between the compared values. Furthermore, genome composition analysis based on the CAI comparison did not support the HGT event. On the other hand, the 16S rRNA revealed an incongruency in phylogeny in which the groups evolutionarily closest to TC3.4.2R3 did not show the n-TASE cluster region in their genomes, leading us to hypothesize that the n-TASE cluster has a long horizontally acquisition history.
To perform a glycosyltransferase comparative analysis, 48 putative glycosyltransferase gene sequences were selected and downloaded from the IMG website. There were five glycosyltransferase sequences from B. cepacia AMMD and B. cenocepacia DWS37E-2, seven sequences from B. seminalis FL-5-4-10-S1-D7, and B. cenocepacia H111, two sequences from B. cenocepacia J2315, four sequences from B. contaminans, and three sequences from Aquamicrobium sp. SK-2 (Supplementary material Table S4). In the TC3.4.2R3 genome, we selected 15 glycosyltransferases with diverse family annotation, including the Bsem_02857 glycosyltransferase. For the strains containing the n-TASE cluster, the Bsem_02857 homologous gene was also selected for this analysis [Aquamicrobium sp. SK-2 (Ga0098313_1081193); B. contaminans LMG23361 (WR31_RS18360) and B. cenocepacia DWS37Ee-2 (DM40_RS21930)].
The Bsem_02857 glycosyltransferase gene (TC3.4.2R3_GTF11) clustered (node G1) with gene sequences from Aquamicrobium sp. (SK2_GTF48) and B. contaminans LMG23361 (LMG 23361_GTF42), forming a sister group (node G2) with the sequence DWS37E-2_GTF32 from B. cenocepacia DWS37E-2. Sequences from this node form a sister group (node G3) with another clade (node G4 – Fig. 7) formed by three B. seminalis sequences, FL-5-4-10-S1-D7 (FL-5-4-10-S1-D7_GTF19, FL-5-4-10-S1-D7_21), FL-5-4-10-S1-D7_22, and a sequence from B. cenocepacia H111 (H111 _GTF41).
Despite the distant phylogeny according to the 16S rRNA gene tree, the homologous genes of Bsem_02857 showed the maximum evolutionary history with glycosyltransferase in question; 42% of selected glycosyltransferases from B. seminalis FL-5-4-10-S1-D7 clustered with a sister group in the clade containing Bsem_02857. None of the 14 selected glycosyltransferase sequences selected from TC3.4.2R3 strain grouped with Bsem_02857. In this comparison, there were no homologous Bsem_02857 glycosyltransferase genes in the genomes (Fig. 7). The Bsem_02857 probably shares a common ancestor with the homologous glycosyltransferase from Aquamicrobium sp. SK-2 and B. contaminans LMG23361.
We selected 70 putative methyltransferase gene sequences to perform a methyltransferase gene comparison to identify possible homologous sequences in the TC3.4.2R3 genome or the other seven bacterial genomes. There was a diverse function annotation of these selected genes. Of these, there were 30 methyltransferase gene sequences (including the Bsem_02858) from TC3.4.2R3 strain, 11 sequences from B. seminalis FL-5-4-10-S1-D7, five sequences from B. cepacia AMMD; B.cenocepacia H111 and B. cenocepacia J2315, 11 sequences from B. cenocepacia DWS37Ee-2, four sequences from B. contaminans LMG23361, and nine sequences from Aquamicrobium sp. SK-2 (Supplementary material Table S4).
The frequency of gene sequences annotated as methyltransferases varied from strain to strain, B. cepacia AMMD, B. cenocepacia H111, B. cenocepacia J2315, and B. contaminans LMG23361 were strains with fewer available methyltransferase sequences. The Bsem_02858 homologous genes from the selected n-TASE containing strains were included in this comparison [Aquamicrobium sp. SK-2 (Ga0098313_1081194); B. contaminans LMG23361 (WR31_RS18355) and B. cenocepacia DWS37Ee-2 (DM40_RS21925)]. The 70 collected putative methyltransferase gene sequences were encoded in chromosomes. Genes encoded in plasmids were not taken into account for this comparison.
A methyltransferase maximum likelihood tree was constructed, and Bsem_02858 (TC3.4.2R3_MTF20) grouped with the methyltransferase sequence from B. cenocepacia DWS 37E-2 (DWS 37E-2_MTF49) (node M1 - Fig. 8). This consists of a sister group of node M3 composed of methyltransferases from Aquamicrobium sp. SK-2 (SK2_MTF69) and B. contaminans LMG23361 (LMG23361_58), which form the clade M2. Sequences were saved from the IMG website, and the locus tag used in this website were not always the same as those used in the “Burkholderia Genome DB” (https://www.burkholderia.com/). The methyltransferase genes that clustered (node M3) with the Bsem_02858 sequence (TC3.4.2R3_MTF20) were homologous in the respective genomes of Aquamicrobium sp. SK-2 (Ga0098313_1081194), B. contaminans LMG23361 (WR31_RS18355), and B. cenocepacia DWS37Ee-2 (DM40_RS21925). None of the 19 compared methyltransferases selected from TC3.4.2R3 grouped with the Bsem_02858 sequence. It is clear that Bsem_02858 shares no common ancestor with any other compared methyltransferase genes, even those that were classified as its homologous gene. There are no methyltransferases in the TC3.4.2R3 genome homologous to the Bsem_02858 methyltransferase gene (Fig. 8). All other methyltransferase sequences from TC3.4.2R3 clustered in two groups (the respective nodes are marked with red circles in figure 8). Neither transferase enzyme from the n-TASE cluster showed any similarity nor shared evolutionary background with any other enzymes in the genome of TC3.4.2R3.