General features of Impatiens
As a result, the genomic libraries have a total of 28.6 GB. The 12 complete Balsaminaceae species cp genomes ranged in size from 151,538 bp (I. fanjingshanica) to 154,189bp (H. triflora ) (Table 1). The newly sequenced impatiens cp genomes maps were provided in Fig. 1 and Supplementary Fig. S1-S6 (I. chlorosepala, I. fanjingshanica, I. guizhouensis, I. linearisepala, I. loulanensis, and I. stenosepala). Similar to other typical chloroplast genomes of angiosperms, the common feature of the complete cp genomes consisted of four conjoined regions forming a circular molecular structure. The pair of inverted repeats are separated by a large and small single copy(LSC, SSC). In the family Balsaminaceae, The LSC constituted for 54.47-55.04% of the total chloroplast genome size, ranging from 82,542 bp (I. fanjingshanica) to 84,865 bp (H. triflora); The SSC constituted for11.37-11.73% of the total chloroplast genome size, ranging from 17,309 bp (I. linearisepala) to 18,080 bp (H. triflora); The IR constituted for 16.62-16.98% of the total chloroplast genome size, ranging from 25,622 bp (H. triflora) to 25,726 bp (I. fanjingshanica). In the newly sequenced genus Impatiens, The LSC constituted for 54.47-54.86% of the total chloroplast genome size, ranging from 82,542 bp (I. fanjingshanica) to 83,508 bp (I. linearisepala); The SSC constituted for 53.58-58.27% of the total chloroplast genome size, ranging from 17,309 bp (I. linearisepala) to 17,547 bp (I. fanjingshanica); The IR constituted for 16.83-16.98% of the total chloroplast genome size, ranging from 25,720 bp (I.stenosepala) to 25,726 bp (I. fanjingshanica).
Like other typical angiosperms, the chloroplast genomes of the Balsaminaceae species encoded 114 total distinct genes except for I. glandulifera and H. triflora, including 81 protein-coding, 29 transfer RNA genes(tRNA), and 4 ribosomal RNA genes (rRNA) (Supplementary Table S2). But the trnG-UCC gene was annotated as a pseudogene in H. triflora comparing to the other Impatiens species in a total number of 115 genes. The gene names of the ycf15 and trnfM-CAU gene are interchanged due to incorrect annotation in I. glandulifera. These genes were classified into three groups based on the functions: (1) photosynthesis-related genes encoding (Rubisco, ATP synthase, Photosystem I, Cytochrome b/f complex, Photosystem II, Cytochrome c synthesis, and NADPH dehydrogenase). (2) Transcription and RNA genes including 4 transcription genes (rpoA, rpoB, rpoC1*, and rpoC2), 20 ribosomal proteins, 4 ribosomal RNA (rrn4.5, rrn5, rrn16, and rrn23), and 30 transfer RNA. (3) Other genes including four genes(matK,cemA,accD, andclpP) with known function and conserved reading frames (ycf1, ycf2, and ycf15) encoding proteins (Table 2 and Supplementary Table S1).
16 unique genes were annotated in Impatiens species, whereas introns are missing in one of these genes in I.piufanensis and H. triflora, namely the rps16 gene and trnG-GCC tRNA gene. Among the 16 genes, two genes (ycf3 and clpP) contained two introns and 14 genes contained single intron. Moreover, there were 11 genes(clpP, ycf3, trnv-UAC, rps12, trnK-UUU,rpoC1, petB, trnL-UAA, atpF, trnG-GCC,and rps16)in the LSC regions, 4 genes(tRNA-GAU, trnA-UGC, ndhB, and rpl2) in the IR regions and only one gene(ndhA) in the SSC regions. The longest intron is the trnK-UUU, which is ranging from 2,488 bp (I. loulanensis) to 2,548 bp (I. guizhouensis); and the exon of rpoC1 is the longest. The rps12 is a trans-splicing gene in which is divided into 5’ -rps12 in the LSC region and 3’-rps12 in the IR region (Table 2 and Supplementary Table S3).
Table 1 Complete chloroplast genomes of six firstly sequenced Balsaminaceae species
|
I. chlorosepala
|
I. fanjingshanica
|
I. guizhouensis
|
I. linearisepala
|
I. loulanensis
|
I. stenosepala
|
reference
|
this study
|
this study
|
this study
|
this study
|
this study
|
this study
|
Family
|
Balsaminaceae
|
Balsaminaceae
|
Balsaminaceae
|
Balsaminaceae
|
Balsaminaceae
|
Balsaminaceae
|
Genus
|
Impatiens
|
Impatiens
|
Impatiens
|
Impatiens
|
Impatiens
|
Impatiens
|
Total length(bp)
|
152,763
|
151,538
|
152,774
|
152,212
|
152,472
|
152,802
|
GC(%)
|
36.7
|
36.9
|
37
|
37
|
36.7
|
36.9
|
LSC length(bp)
|
83,740
|
82,542
|
83,572
|
83,508
|
83,460
|
83,626
|
GC(%)
|
34.3
|
34.6
|
34.8
|
34.8
|
34.4
|
34.5
|
SSC length(bp)
|
17,477
|
17,547
|
17,662
|
17,309
|
17,541
|
17,739
|
GC(%)
|
29.5
|
29.4
|
29.9
|
30
|
29.6
|
29.8
|
IR length(bp)
|
25,773
|
25,726
|
25,772
|
25,699
|
25,737
|
25,720
|
GC(%)
|
43.1
|
43.1
|
43
|
43
|
43
|
43.2
|
CDS length(bp)
|
79,562
|
79,689
|
79,941
|
79,533
|
79,650
|
79,581
|
GC(%)
|
37.2
|
37.2
|
37.4
|
37.3
|
37.1
|
37.2
|
rRNA Length (bp)
|
9,048
|
9,048
|
9,046
|
9,048
|
9,048
|
9,048
|
GC(%
|
55.1
|
55.1
|
55.1
|
55.2
|
55.1
|
55
|
tRNA Length (bp)
|
2,876
|
2,872
|
2,872
|
2,872
|
2,872
|
2,884
|
GC(%)
|
52.4
|
52.6
|
52.7
|
52.5
|
52.6
|
52.6
|
Total Genes
|
114
|
114
|
114
|
114
|
114
|
114
|
CDS
|
81
|
81
|
81
|
81
|
81
|
81
|
tRNA
|
29
|
29
|
29
|
29
|
29
|
29
|
rRNA
|
4
|
4
|
4
|
4
|
4
|
4
|
Differences Genome Size
Among the 12 Balsaminaceae species, the shortest genome was I. fanjingshanica (151,538 bp), and the longest was H. triflora (154,189 bp). In the 6 new sequenced species, I. stenosepala was the longest genome length (152,802 bp) while that of the shortest was I. fanjingshanica (151,538 bp). Except for I. stenosepala and I. fanjingshanica, the sizes of Impatiens species were between 152,212 bp and 152,774 bp (Table 1). Except for I. fanjingshanica, lengths of other Balsaminaceae species were longer than 152,000 bp (Supplementary Table S1). In the 12 Balsaminaceae species, The Length of Protein Coding Genes constituted ranged from 79,533 bp (I. linearisepala) to 80,952 bp (H. triflora), The length of the rRNA constituted 9,048 bp except for I. guizhouensis, I. glandulifera, and H. triflora, which the length is 9,046 bp, 9,050, and 9,046 bp respectively. The length of the tRNA constituted 2,872 bp except for I. chlorosepala, I. stenosepala, I. glandulifera, and H. triflora, which the length is 2,876 bp, 2,884 bp, 2,419 bp, and 2,815 bp respectively (Supplementary Table S1). The overall guanine-cytosine (GC) contents of each species were very similar in the whole cp genomes and the same regions of the LSC, SSC, and IRs. The whole GC content in the Balsaminaceae species ranged from 36.7% to 37%, with I. chlorosepala and I. loulanensis having the lowest and I. guizhouensis and I. linearisepala having the highest GC content (Table 1). The GC contents in the LSC, SSC, and IR regions are average 34.56%, 29.7%, 43.0%, respectively(Table 1 and Supplementary Table S1).
Table 2. The List of genes in the chloroplast genomes of Impatiens species
Function of Genes
|
Group of Genes
|
Gene Names
|
Photosynthesis-related genes
|
Rubisco
|
rbcL
|
Photosystem I
|
psaA psaB psaC psaI psaJ
|
Assembly and stability of Photosystem I
|
ycf3** ycf4
|
Photosystem II
|
psbA psbB psbC psbD psbE psbF psbH psbI psbJ psbK psbL psbM psbN psbT psbZ
|
ATP synthase
|
atpA atpB atpE atpF* atpH atpI
|
Cytochrome b/f complex
|
petA petB* petD petG petL petN
|
Cytochrome c synthesis
|
ccsA
|
NADPH dehydrogenase
|
ndhA* ndhB*(2) ndhC ndhD ndhE ndhFndhG ndhH ndhI ndhJ ndhK
|
Transcription and translation-related genes
|
Transcription
|
rpoA rpoB rpoC1* rpoC2
|
Ribosomal proteins
|
rpl2*(2) rpl14 rpl16 rpl20 rpl22 rpl23(2) rpl33 rpl36 rps2 rps3 rps4 rps7(2) rps8 rps11 rps12*(2) rps14 rps15 rps16* rps18 rps19(2)
|
RNA genes
|
Ribosomal RNA
|
rrn4.5 rrn5 rrn16 rrn23
|
Transfer RNA
|
trnA-UGC(2) trnC-GCA trnD-GUC trnE-UUC trnF-GAA trnfM-CAU trnG-GCC* trnG-UCC trnH-GUG trnI-CAU*(2) trnI-GAU(2) trnK-UUU* trnL-CAA(2) trnL-UAG trnL-UAA* trnM-CAU trnN-GUU(2) trnP-UGG trnQ-UUG trnR-ACG(2) trnR-UCU trnS-GCU trnS-GGA trnS-UGA trnT-GGU trnT-UGU trnV-GAC(2) trnV-UAC* trnW-CCA trnY-GUA
|
Other genes
|
RNA processing
|
matK
|
Carbon metabolism
|
cemA
|
Fatty acid synthesis
|
accD
|
Proteolysis
|
clpP**
|
Genes of unknown function
|
Conserved reading frames
|
ycf1 ycf2(2) ycf15(2)
|
(2) indicates the m=number of the repeat unit is 2; *Gene contains one intron; **Gene contains two intron
Codon Usage
To analyze the genetic information and the relationship between evolution and phylogeny of Impatiens, we analyzed the codons in its coding region. All protein-coding genes were encoded by 50,512 (I. fanjingshanica) to 51,396 (H. triflora ). The termination codons were considered by the UGA, UAG, and UAA. For these Balsaminaceae species(Supplementary Table S4), we found that the most abundant AA was leucine, which is the UUA encoded the highest RSCU (Relative Synonymous Codon Usage) value at approximately 1.92. Tryptophan was the lowest frequency AA in the Balsaminaceae species. All amino acids, except for methionine and tryptophan, have more than one synonymous codon. Among them, leucine, arginine and serine have 6 codons. The results of RSCU in A or T nucleotide frequency at the third codon position was biased toward a higher than G or C nucleotide frequency in the 12 Balsaminaceae species. I. glandulifera had 30 codons, which was less frequently used than the expected usage at equilibrium (RSCU < 1). H. triflora had 36 codons more frequently used than the rest of Impatiens species showed the codon usage bias in 34 codons.
Repeat Structure Analyst
Of the 12 Balsaminaceae species, 246 long repeats of four types (forward, complement, reverse, and palindromic)using REPuter (Supplementary Table S5). The most common repeat type was forward and palindromic repeats. compliment repeats were only identified in I. guizhouensis; reverse repeats were only found in I. Chlorosepala, I. fanjingshanica, I. linearisepala, I. balsamina, and I. hawkeri, respectively. The range of most copy length was 30-40 bp(Figure 2B). The individual accession with the greatest number of repeats was I. chlorosepala with 25, comprising 14 forward, 9 palindromic, and 2 reverse repeats. I. linearisepala which had the smallest number of repeats had 5 forward, 7 palindromic, and 3 Reverse repeats(Figure 2A). The greatest numbers of Forward, palindromic, and Reverse repeats were found in I. chlorosepala (14), I.balsamina (34), and I. linearisepala (3), respectively.
Simple Sequence Repeat Analysis
Simple Sequence Repeat, called microsatellites are widely used as molecular markers and play a significant role in plant identification and classification. Among these SSRs in the 12 Balsaminaceae species, the distribution of 51-109 SSRs ranged in size from 10 to 20 bp. There were 6 kinds of SSRs were discovered (Figure 3A and Supplementary Table S6). Among these SSRs, only H. triflora had the hexanucleotide repeats, and I. loulanensis, I. stenosepala, I.balsamina, I.walleriana, and H. triflora had the Pentanucleotide repeats. The numbers of mononucleotide repeats ranged from 59(I. linearisepala) to 82(I. chlorosepala), followed by Dinucleotide ranging from 5(I.hawkeri) to 13(I. chlorosepala, I. fanjingshanica, and I. glandulifera)(Figure 3B-G). Therefore, mononucleotide and Dinucleotide repeats may play a more significant role in genetic variation.
In the six newly species, mononucleotide repeats were more abundant with A/T repeats being the most highly represented repeats, whereas poly C/G repeats were rather rare. poly C/G repeats were only found in I. chlorosepala, I. fanjingshanica, I. guizhouensis, and I. loulanensis. Moreover, the number of A mononucleotide repeats ranged from 24(I. fanjingshanica, and I. linearisepala) to 37(I. loulanensis,), with T mononucleotide repeats ranging from 35 (I. linearisepala) to 48 ( I. fanjingshanica)(Figure 3B-G).
Among the dinucleotide repeat motifs AT/AT were more abundant. In the newly sequenced species, the SSR analysis showed that I. chlorosepala had the highest number of SSRs (109) while I. linearisepala had the lowest (74). Trinucleotide motif (ATT, GAA, TAA, TTA, TAT, ATA, and TTG) and tetranucleotide( AAAT, AATA, AATT, ATAA, TAAA, TATT, TTCA, TTTA, GTTT, and TTCT) were identified. However, among these cp genomes, only Pentanucleotide (AAAAG and CAAAA) repeat was found in the I. loulanensis and I. stenosepala cp genome.
Comparison of the Genome Structure
The structure and size of the chloroplast genome can change based on the different evolution and genetic backgrounds. The collinear method was used to analyze and compare the chloroplast genomes. Mauve alignment of plastomes shows that the impatiens plastome structure is similar to the dicot Rosa(MK947051.1)(Figure 4A). But compare with monocot Triticum aestivum(NC_002762.1) and Oryza sativa(NC_008155.1), the results showed that the monocot and dicot structures derive from intermolecular recombination events(Figure 4A). There are no interspecific and intraspecific rearrangements within six species revealed that all genes (including ribosomal RNA, tRNA, and protein-coding genes) in the Balsaminaceae is comparatively conserved and presented in the same order(Figure 4B); this also applies optimal collinearity between subgenus Impatiens, there is no gene rearrangement. Moreover, Compared with H. triflora, the linear relationships with genome structure and gene sequences indicate that there was high chloroplast genome homology.
Comparative Genomic Divergence and Genome Rearrangement
A comparative cp genomes analysis of the whole regions between H. triflora and other impatiens species notably, I. chlorosepala, I. fanjingshanica, I. guizhouensis, I. linearisepala, I. loulanensis, and I. stenosepala was conducted by using the mVISTA software and DnaSP to detect hyper-variable regions and analyze the sequence identity plots between the entire cp genome by sequence identity diagram(Figure 5A). The comparison showed that the number and sequence of genes in the IR regions were relatively conservative and less divergent than the LSC and SSC regions, The IR regions were the most conserved(Figure 5B and 5C).
Among the coding genes, the highly divergent regions such as matK, psbK, petN, psbM, atpE, rbcL, accD, psaL, rpl16, rpoB, ndhB, ndhF, ycf1, and ndhH (Figure 5A). For the intergenic regions, atpH-atpI, trnC-trnT, rps3-rps19, and ndhG-ndhA were the most variable. In the LSC region, the psbK-psbI, atpI, and rps4-trnF genes showed some sequence divergence in I. piufanensis, I. glandlifera, and H. triflora. The three genes of ndhF, ycf1, and ndhH were detected in the SSC region. The rpl32-trnN showed the highest variation among the hypervariable regions and the ycf1 gene is the most divergent. In comparison with H. triflora, the large copy of the trnl-trnN and trnA-trnL locus in the cp genomes of I. fanjingshanica, I. guizhouensis, and I. loulanensis have been deleted.
Sequence Divergence and Mutational Hotspots
We compared the Pi values in DnaSP 5.1. to determine the divergent hotspots region in 12 Balsaminaceae species, The analysis indicated that the variation in LSC and SSC regions was much higher than that in the IR regions(Figure 6). The highest value of nucleotide diversity (Pi) was ycf1(0.17356) and trnG-GCC(0.12911). 6 mutational hotspots which exhibited remarkably higher Pi values (>0.06) in LSC and SSC regions were trnk-UUU-rps16, trnG-GCC, atpH-atpL, rpoB-petN, rps4-ndhJ, and accD-psal, while in the SSC region with three hotspots(ndhF, rpl32-ccsA, and ycf1) above 0.06. Similarly, we determined the average pairwise sequence divergence among new sequenced impatiens species. The nucleotide variability (Pi) of these 140 regions ranged from 0. 0% (rrn16) to 9.3% (rps12). The rps12 gene demonstrated the highest average sequence divergence (0.93), followed by rpl32 (0.91), and rps4-ndhJ(0.90)(Figure 6 and Supplementary Table S7). By contrast, the Pi values of 6 new sequenced species were higher than those of the 12 Balsaminaceae species. Therefore, these coding regions and non-coding genes may provide strong molecular evidence for resolving future low-level phylogeny and phylogeography in Balsaminaceae.
Contraction and Expansion of Inverted Repeats (IRs)
The genomic structure, the number and sequence of genes were highly conserved in the 12 Balsaminaceae species. But, the contraction and expansion of IR boundaries changed in structure and the sizes. In the 12 Balsaminaceae species, we found the ycf1 existed in the boundaries of IRa-LSC, the sizes of the IRs of I. chlorosepala and I.balsamina involucrate was the longest (25,773bp) and that of H. triflora was the shortest (25,622 bp). The LSC-IRB junctions were embedded in the rps19 genes. The length of rps19 in the LSC region varied from 0bp to 246bp. However, the overlap between rps19 in the IRb region was varied from 0 bp to 200 bp. The IRB-SSC junction was located or adjacent to gene ycf1 and ndhF; all species except for I. linearisepala were all located and adjoined the end of ycf1 from 0bp to 1256bp, the distances between ycf1 and IRB-SSC junction in I. linearisepala was 204 bp. The overlap between ndhF and ycf1 gene was detected in I. guizhouensis, I. linearisepala, and I. Hawkeri, where ndhF expanded into the IRB region for 18bp, 176bp, and 98bp, respectively(Figure 7).
In the other species, the distances between ndhF and IRB-SSC junction were varied from 1 bp to 2000 bp. The SSC-IRA junction is located in the pseudogene ycf1 which covered the IRA and SSC region. the length of pseudogene ycf1 in the SSC region varied from 4356bp to 4891bp. However, the overlap between pseudogene ycf1 was varied from 810 bp to 1254 bp in the IRA region. The IRb/SSC and SSC/IRa regions were variable. The rps19-psbA coding region intervened in the boundaries of LSC/IRa regions except for I.piufanensis, I. glandulifera, and H. triflora, in which there was rps19 gene missing in the junctions of the LSC/IRb regions. However, the length of rps19 in the LSC region varied from 0 bp to 136 bp, in contrast, the length of rps19 in the IRB region of I. guizhouensis and I. stenosepala was 31bp, 137bp, respectively. For the 6 new sequenced species(I. chlorosepala, I. fanjingshanica, I. guizhouensis, I. linearisepala, I. loulanensis, and I. stenosepala), the sizes of the IRs of I. chlorosepala was the longest (25,773 bp) and that of I. linearisepala was the shortest (25,699 bp).
Phylogenetic Analyses Within Balsaminaceae Species
We used the MP and BI phylogenetic trees to explore the phylogenetic positions and evolutionary relationships of Balsaminaceae species based on the complete chloroplast genomes(Supplementary Table S8). These chloroplast genomes from eight families within impatiens including four Ebenaceae species, four Styracaceae, five Actinidiaceae species, five Theaceae species, five Primulaceae species, 12 Balsaminaceae species, two Saxifragaceae species, and three Rosaceae species as outgroups. The 12 Balsaminaceae species included three cultivated species(I. balsamina, I. hawkeri, and I. walleriana) ,three published plastid genomes(I. piufanensis, I. glandulifera, and H. triflora) , and 6 newly sequenced species (I. chlorosepala, I. fanjingshanica, I. guizhouensis, I. linearisepala, I. loulanensis, and I. stenosepala).
The topologies of the two datasets ( ML and BI) generated a high approval rate, and the five selected families (Primulaceae, Actinidiaceae, Theaceae, Ebenaceae, and Styracaceae) were clustered into five monophyletic branches. The Genus Hartia Dunn and the stewartia of the family Ebenaceae were clustered into a clade, While the Theoideae also consisted of the Actinidia and the Rhododendron. The Saxifragaceae families were clustered into a monophyletic branch which is close to the outgroup Rosaceae species. Only three nodes(Primulaceae, Actinidiaceae, and Ebenaceae) with bootstrap values under 90% in the ML tree(Fig. 8A). The remaining nodes had support values 100%. Only two nodes(Primulaceae and Actinidiaceae) with bootstrap values under 90% in the BI tree. The remaining nodes had support values 100% (Fig. 8B).
All Balsaminaceae species formed a monophyletic subclade in ML and BI trees, which was sister to two Saxifragaceae species (Hydrangea Serrata and hydrangea heteromalla). Moreover, H. triflora was initially clustered together with other 11 impatiens species and similar clustering results were also measured in previous studies. The I. guizhouensis was the earliest branch and it was sister to the other species in impatiens. Support values in the ML and BI trees showed a sister relationship with cultivated species(I. balsamina, I. hawkeri, and I. walleriana) formed a clade with I. chlorosepala indicating their close relationship; Besides that, I. fanjingshanica and I.piufanensis formed a close relationship while I. glandlifera and I. loulanensis formed a clade with I. linearisepala and I. stenosepala with the most similar morphological characteristics were clustered together. As a whole, the BI and ML phylogenetic trees were cleared to reveal the internal relationships among the Balsaminaceae species.