Characteristics of plastomes
Newly sequenced plastomes of the five species had a total length of 152,435 (S. albifolia) – 173,114 bp (S. diamantica) (Fig. 1 and Table 1). Because some parts of mitochondrial DNA sequences including two tRNAs were inserted in the small single copy (SSC) region (ndhF-pseudo ycf1) of plastome of S. diamantica, the total length of S. diamantica was longer than that of other Saussurea species by 20,550 bp (Fig. 1b and Table 1). BLAST searches were performed to determine the characteristics of the insertion. The result demonstrated that the inserted sequences were highly matched to Chrysanthemum, Diplostephium, Lactuca, Helianthus, and Paraprenanthes mitochondrial DNA sequences, ranging from 5,159 bp (Paraprenanthes diversifolia, MN661146) to 7,090 bp (Diplostephium hartwegii, KX063855), but this does not mean the continuous consistency of the whole 20,550 bp on the mitochondrial genome. The transfer of mitochondrial DNA sequences to plastoms has been reported in the families Apiaceae, Apocynaceae, and Poaceae [23–25]. The previous studies indicated that genes of less than 3 kb of mitochondrial DNA are inserted into the IR or LSC regions. Given that a plastome is highly conserved, the large insertion of mitochondrial DNA sequences is an unusual event. Thus, confirmation is needed that the insertion was not merely a product of assembling error. By comparing sequencing depth of before and after the insertion, Ma et al. [25] confirmed that it is not a product of misassembly. Because the plastome occupies the smallest portion of the genomic DNA, it can be easily distinguished from mitochondrial and nuclear DNA sequences by sequencing depth. Ma et al. [25] also inferred that the nuclear and mitochondrial genomes are larger than the plastome and would have a lower sequencing depth. The sequencing coverage of S. diamantica was evaluated instead of the sequencing depth suggested by Ma et al. [25]. The average sequencing coverage was 322.3 and that of the inserted regions was 356.8. The average sequencing coverage of the surrounding regions, which was 349.1 and 346.9 before and after the insertion with 200 bp, is similar to that of the inserted region. These results indicated that misassembly is not a cause of mitochondrial DNA insertion into the plastome of S. diamantica. In addition, PCR amplification and Sanger sequencing were conducted to confirm the insertion using designed two primer sets (SD1f: GTAGGGGGTGGGCGTATTTC, SD1r: GATGTCGAGTGCCGCTTTTC and SD2f: AGGGTGATGCTTGGCTTCT, SD2r: TTTTCGTGGTTAGAGCGGCT), and amplifications were successful but not for other species (data not shown). It also supported that there was no error in the assembly process.
Seven Saussurea species have a typical quadripartite structure that comprised a pair of inverted repeats (IR: IRA and IRB) of 25,185–25,893 bp, which were separated by SSC of 18,671–37,846 bp and large single copy (LSC) of 83,387–83,482 bp. The largest insertion (62 bp) was identified in the trnE–rpoB intergenic region of LSC of S. grandicapitula, S. polylepis, and S. seoulensis. Other than S. diamantica, length of IR, SSC, and LSC of each species was similar. The overall GC content was 37.7% and the GC content in the IR, SSC, and LSC was respectively 43.1%, 31.4%, and 35.8%, except for S. diamantica (overall GC: 38.6%, IR: 42.8%, SSC: 39.1%, LSC: 35.8%) (Table 1).
Seven Saussurea plastomes possessed the two inversions in LSC region like other Asteraceae. The seven plastomes depicted high similarity at the LSC, IR, and SSC boundaries (Fig. S1). Rps19 was across the LSC–IRB boundary without any change in sequence length, and trnH–GUG was located three base pairs away from the LSC–IRA boundary in all species. The ycf1 gene crossed SSC–IRB, with 4,022–4,740 bp within the SSC region and 561–1,240 bp within the IRB region. Ycf1 was a pseudogene, located at the SSC–IRA boundary, with 6–752 bp within the SSC region and with 561–1,240 bp within the IRA. In particular, S. diamantica had longer ycf1 (pseudogene) than others.
The seven plastomes contained 114 identical genes, including 80 protein-coding genes, 30 tRNA, and four rRNA genes. Eighteen genes including 12 protein–coding genes (atpF, clpP, ndhA, ndhB, petB, petD, rpl2, rpl16, rpoC1, rps12, rps16, and ycf3) and 6 tRNA genes (trnA–UGC, trnG–UCC, trnI–GAU, trnK–UUU, trnL–UAA, and trnV–UAC) had intron, and 17 genes (ndhB, rrn4.5, rrn5, rrn16, rrn23, trnA–UGC, trnI–CAU, trnI–GAU, trnL–CAA, trnN–GUU, trnR–ACG, trnV–GAC, rpl2, rpl23, rps7, ycf2, and ycf15) were duplicated in IR regions (Table S1). However, S. diamantica had two additional tRNA genes (trnC–GCA and trnM–CAU) from the mitochondrial genome in the SSC region. The formation of tertiary structure of two tRNAs was confirmed by simulation through the tRNAscan-SE 2.0.
Table 1
Summary of seven Saussurea plastid genomes
|
S. albifolia
|
S. calcicola
|
S. chabyoungsanica
|
S. diamantica
|
S. grandicapitula
|
S. polylepis
|
S. seoulensis
|
Reference
|
This study
|
This study
|
Cheon et al. [18]
|
This study
|
This study
|
Yun et al. [19]
|
This study
|
Total length (bp)
|
152,435
|
152,453
|
152,446
|
173,114
|
152,484
|
152,488
|
152,578
|
LSC length (bp)
|
83,387
|
83,399
|
83,397
|
83,482
|
83,431
|
83,417
|
83,441
|
SSC length (bp)
|
18,678
|
18,672
|
18,679
|
37,846
|
18,671
|
18,689
|
18,727
|
IRa length (bp)
|
25,185
|
25,191
|
25,185
|
25,893
|
25,191
|
25,191
|
25,205
|
Total GC content (%)
|
37.7
|
37.7
|
37.7
|
38.6
|
37.7
|
37.7
|
37.7
|
GC content of LSC
|
35.8
|
35.8
|
35.8
|
35.8
|
35.8
|
35.8
|
35.8
|
GC content of SSC
|
31.4
|
31.4
|
31.4
|
39.1
|
31.4
|
31.4
|
31.4
|
GC content of IR
|
43.1
|
43.1
|
43.1
|
42.8
|
43.1
|
43.1
|
43.1
|
Protein-coding gene number
|
80
|
80
|
80
|
80
|
80
|
80
|
80
|
rRNA gene number
|
4
|
4
|
4
|
4
|
4
|
4
|
4
|
tRNA gene number
|
30
|
30
|
30
|
32*
|
30
|
30
|
30
|
Total gene number
|
114
|
114
|
114
|
116
|
114
|
114
|
114
|
Identification of variable regions
SNP patterns that can be divided into 2 groups were found in 42 regions (Table S2). Of them, 34 regions were in LSC, followed by SSC and IR. Based on S. involucrata as a reference, there were no large differences among the seven Korean endemics. The LSC and SSC regions were more divergent than IR regions, and the coding regions were more conserved than the non-coding regions (Fig. S2). However, coding regions such as rbcL in the LSC region, ycf1 in the SSC region, and ycf2 in the IR regions showed variability.
The nucleotide diversity (Pi) ranged from 0 to 0.0053 (Fig. 2). The IR region had a relatively low nucleotide diversity value, ranging from 0 to 0.00286. We detected nine divergence hotspots with Pi values over 0.0033. Among them, seven were located in the LSC region, and two were located in the SSC region. Other than ycf1, the variable regions were concentrated in intergenic spaces. The hotspot with the highest Pi was ycf4–cemA (0.0051), followed by seven intergenic regions (psbC–trnS, rbcL–accD–psaI, rpl32–ndhF, trnT–trnD, psbE–petL, rps4–trnT–trnL, and rpl16–trnQ–psbK) and one gene region (ycf1).
Simple sequence repeats (SSRs) analysis
Five categories (mononucleotide, dinucleotide, trinucleotide, pentanucleotide, and hexanucleotide) of SSRs were detected, and the types and numbers of SSRs were similar across the seven Saussurea (Fig. 3). The total number of SSRs was 72 in S. albifolia, 75 in S. calcicola, 74 in S. chabyoungsanica, 76 in S. diamantica, 77 in S. grandicapitula, 78 in S. polylepis, and 82 in S. seoulensis. The detected SSRs were mainly located in the LSC region (67.1–77%) and distributed in the IR and SSC regions ranging from 9.8–11.1% and from 12.2–22.4%, respectively. Twenty-three of the SSRs detected from the seven Saussurea were located in 15 genes (cemA, ndhB, petA, psaA, psbC, rbcL, rpoA, rpoB, rpoC1, rpoC2, rps15, rrn23, trnS–UGA, ycf1, and ycf2) with 3–10 repeat numbers (Table S3). The most abundant type was mononucleotides A/T and species–specific SSR was identified from S. seoulensis as hexanucleotides TACAAA/TTTGTA.
Synonymous and non-synonymous substitution rate analysis
The non-synonymous (Ka) to synonymous (Ks) substitution rate ratio (Ka/Ks) has been used to determine whether protein-coding genes are subjected to selective pressure. If Ka/Ks is greater than 1, it could indicate that it is under positive selection [26]. We calculated synonymous and nonsynonymous substitution rates between S. involucrata and Korean endemic Saussurea (Fig. S3). Approximately 90% of protein coding genes were below 1 in Ka/Ks values. In seven Korean species, the Ka/Ks value was close to zero at 12 protein coding genes (clpP, ndhB, ndhH, petD, psaC, psbA, psbB, psbC, psbD, rpoB, rps11, and rps15) while that of five protein coding genes (ndhI, psaJ, psbL, rpl33, and ycf2) were 50, indicating positive selection influenced the differentiation of Saussurea.
Codon usage analysis
We detected similar patterns in the frequency of codon usage of seven Korean endemics. The 80 annotated protein-coding genes are encoded by 22,739 codons in S. albifolia, 22,831 in S. calcicola and S. polylepis, 22,826 in S. chabyoungsanica, 22,821 in S. diamantica, and 22,835 in S. grandicapitula, and 22,834 in S. seoulensis (Table S4). Leucine was the most abundant amino acid (10.6%), whereas cysteine was the least (1.1%). The most used synonymous codon was ATT, encoding isoleucine, and the least used was TGC, encoding cysteine. Usage of the start codon methionine (ATG) and tryptophan (TGG) had no biases (relative synonymous codon usage, RSCU = 1). All preferred relative synonymous codons (RSCU > 1) ended with an A or a T, other than TTG (leucine) (Fig. 4). The tendency for codon preference was similar among species. Of 61 codons (except for stop codon), 14 (Ala–GCT, Arg–AGA, Asn–AAT, Asp–GAT, Gln–CAA, Gly–GGA, His–CAT, Leu–TTA, Lys–AAA, Pro–CCT, Ser–TCT, Thr–ACT, Tyr–TAT, and Val–GTA) were highly preferred (RSCU > 1.5).
Phylogenetic analysis and molecular age estimation
In this study, 32 plastomes were used to determine the phylogenetic relationships among Korean endemic Saussurea. As a result, the higher resolution phylogenetic tree showed that Saussurea based on the current sampling is not monophyletic (Fig. 5). Of the seven endemic species, S. diamantica diverged first. The morphologically similar S. albifolia and S. seoulensis did not form a sister relationship; S. albifolia formed a sister with the group including S. odontolepis, S. bullockii, S. tianshuiensis, and S. chabyoungsanica. Limestone endemic, S. calcicola, shared its common ancestor with the group consisting of S. brachycephala, S. amurensis, S. polylepis, S. grandicapitula, S. seoulensis. S. kuschakewiczii, S. leucophylla, S. tomentosa, S. komaroviana, and S. subtriangulata. However, relatively low bootstrap values hindered us to determine precisely their phylogenetic relationships. Also, S. chabyoungsanica, which is narrow limestone endemic to central Korea, is sister to S. tianshuiensis, which occurs narrowly in high montane meadows (1800–2500 m) in three provinces of northwestern China (i.e., SE Gansu, Shaanxi, and Ningxia).
The molecular age estimation suggested that endemic Korean Saussurea originated in the late Miocene (Tortonian), with the estimated crown age of approximately 9 million years ago (95% HPD, 3.03–18.8 million years ago, MYA) (Fig. 6). The clade containing all but S. diamantica, which has unusual mitochondrial DNA sequences insertion, was estimated to be 6.18 MYA (95% HPD, 2.14–13.28 MYA). Two major lineages of the Korean endemics, i.e., S. albifolia–S. chabyoungsanica and S. calcicola–S. seoulensis–S. polylepis–S. grandicapitula, appear to be speciated even more recently, during the Pleistocene.