Organization and features of the chloroplast genomes
The chloroplast genomes of 15 species and an unidentified sample of Sorbus s.s. exhibit similar structure and organization (Table S1, Fig. 1). The size of cp genomes of the 16 Sorbus s.s. samples range from 159,646 bp in S. wilsoniana C.K.Schneid. to 160,178 bp in S. hypoglauca. All the 16 cp genomes consist of a large single-copy (LSC) with length between 87,612 bp in S. sargentiana Koehne and 88,125 bp in S. hypoglauca; a small single-copy (SSC) with length between 19,219 bp in Sorbus sp. and 19,359 bp in S. tianschanica Rupr.; and a pair of inverted repeats (IRs) with length between 26,378 bp (S. aestivalis Koehne and other nine taxa) and 26,405 bp (S. amabilis Cheng ex T.T.Yu; Table S1). The total GC content is nearly similar, 36.5% for five samples and 36.6% for the other 11 samples (Table S1).
All the 16 cp genomes assembled here encode 113 unique genes (79 protein-coding genes, 30 tRNA genes and four rRNA genes), and 19 of these are duplicated in the IR, giving a total of 132 genes (Table S1, S2 and Fig. 1). Eighteen genes contain one (atpF, ndhA, ndhB, petB, petD, rpl2, rpl16, rpoC1, rps12, rps16, trnA-UGC, trnG-UCC, trnI-GAU, trnK-UUU, trnL-UAA and trnV-UAC) or two (clpP and ycf3) introns, and six of these are the tRNA genes (Table S2, Fig. 1). The cp genomes consist of 56.5 or 56.6% coding regions (49.1 or 49.2% protein coding genes and 7.4% RNA genes) and 43.4 or 43.5% non-coding regions, including both intergenic spacers and introns (Table S1).
The boundaries between IR and LSC/SSC regions of 16 Sorbus s.s. cp genomes and eight species in other genera in Rosaceae were compared (Fig. 2). The IRb/LSC boundary is located within the rps19 gene (the 5′ end of the rps19 is located in the IRb region while 3′ end is located in the LSC), therefore creating a pseudogene of the 5′ end of this gene (rps19Ψ) in the IRa region in all cp genomes compared. The length of rps19Ψ is 116 bp in Micromeles thibetica (Cardot) Mezhenskyj (Fig. 2 C), 182 bp in Prunus persica (L.) Batsch (Fig. 2 F), and 120 bp in the other 22 species (Fig. 2 A–B, D, E). The IRa/LSC border is adjacent to the rps19Ψ in all species except in Micromeles thibetica (Fig. 2 C), which is within the rps19Ψ. The IRa/SSC boundary is located in the ycf1 gene (the 5′ end of the ycf1 is located in the IRa region while 3′ end is located in the SSC), thus creating a pseudogene of the 5′ end of this gene (ycf1Ψ) in the IRb region. The size of ycf1Ψ range from 1003 (Prunus persica; Fig. 2 F) to 1092 bp (Torminalis glaberrima (Gand.) Sennikov & Kurtto; Fig. 2 D), and 1083 bp in all the Sorbus s.s. species (Fig. 2 A–C, E). The IRb/SSC boundary slightly varies: 21 species are located within the overlapping region of the pseudogene ycf1Ψ and ndhF, while the other three species (Malus hupehensis (Pamp.) Rehder, Prunus persica, Pyrus pashia Buch.-Ham. ex D.Don) are located within the ndhF gene (Fig. 2 E, F).
Codon preference analysis
According to the codon usage analysis, the total sequence sizes of the protein coding genes are 78,570–78,588 bp in the 16 Sorbus s.s. genomes, and 26,190–26,196 codons were encoded (Table S3). Leucine encoded with the highest number of codons ranging from 2,753 to 2,757, followed by isoleucine, with the number of codons encoded between 2,255 and 2,260. Cysteine is the least (297 or 298). The relative synonymous codon usage (RSCU) values vary a little among 16 Sorbus s.s. sequences. Thirty codons are used frequently with RSCU > 1 and 32 codons used less frequently with RSCU < 1. UUA shows a preference in all the 16 cp genomes. The frequency of use for the start codons AUG and UGG, encoding methionine and tryptophan, show no bias (RSCU = 1). Codons with A (32.1%) or U (38.2%) in the third position are all 70.3%, thus the codon usage is biased towards A or U at the third codon position.
Repeated sequences analysis
The total number of SSRs in 16 Sorbus s.s. genomes ranges from 44 to 53 (Fig. 3 A–C; Table S4). The most abundant SSRs are A or T nucleotide repeats, which account from 88.2 to 96.3% (Table S4). The most common SSRs are mononucleotides, which range from 29 to 38, followed by tetranucleotides ranging from 5 to 7, and pentanucleotides ranging from 2 to 5. Dinucleotides are all four in the examined samples except for five in S. tianschanica. Trinucleotides were discovered only in three species: S. filipes Hand.-Mazz., S. hypoglauca and S. rutilans McAll. There are three hexanucleotides in S. cibagouensis H.Peng & Z.J.Yin, two in S. helenae Koehne, one in S. aestivalis, S. albopilosa T.T.Yu & L.T.Lu, S. amabilis and S. rehderiana, and none in the other 11 samples. SSRs are mainly distributed in the intergenic regions (76.6–89.4%), with much lower quantities distributed in the intron regions (10.6–21.3%) and exon regions (0–2.1%; Fig. 3B). Furthermore, SSRs are found mainly in LSC regions (78.4–89.4%), and are remarkably lower in the SSC (6.4–17.6%) and IR (3.8–8%) regions (Fig. 3 C).
The REPuter screening discovered 58 to 130 dispersed repeats 20 bp or longer among the 16 Sorbus s.s. cp genomes examined (Fig. 3 D–E). Sorbus tianschanica has the largest number of repeats with 130 and S. sargentiana has the fewest with only 58. The majority of the repeats (69.4–87.7%) in all cp genomes range between 20 and 25 bp. The longest repeats is 123 bp and is only found in S. foliolosa Spach. Six taxa, S. albopilosa, S. cibagouensis, S. helenae, S. pteridophyslla Hand.-Mazz., S. tianschanica and Sorbus sp., have a maximum size of 44 bp. Only four taxa, S. foliolosa, S. hypoglauca, S. rehderiana and S. ursina S.Schauer, have repeats size larger than 60 bp (Table S5, Fig. 3 E). Among them, forward repeats (25–47) are the most common, followed by palindromic repeats (19–35), reverse repeats (12–38) and complement repeats (1–12, Fig. 3 D).
Comparative analysis of chloroplast genomes
Comparative cp genome analysis reveals that noncoding regions are generally more divergent than coding regions and LSC/SSC regions are more divergent than IR regions (Fig. 4). The highest levels of divergence were found in 17 intergenic regions: 15 in the LSC regions, namely trnH-psbA, trnK-rps16, trnG-trnR, trnR-atpA, atpF-atpH, atpH-atpI, trnC-petN, petN-psbM, trnT-psbD, psbZ-trnG, trnT-trnL, ndhC-trnV, trnM-atpE, accD-psaI and rps8-rpl14; and two in the SSC regions, namely ndhF-rpl32, rpl32-trnL. Apart from these regions, two intron regions: clpP and rpl16 also show high sequence variation.
To elucidate levels of diversity at the sequence level, the nucleotide diversity (Pi) values were calculated. The Pi values range from 0 to 0.00975, with mean value of 0.00098 (Fig. 5, Table S6). The SSC region shows the highest nucleotide diversity (Pi = 0.00173), while the lowest Pi is in the IR boundary regions (Pi = 0.00016). Meanwhile, six hypervariable sites with Pi between 0.005 and 0.01 were screened, which are trnR-atpA (Pi = 0.00975), petN-psbM (Pi = 0.00932), rpl32-trnL (Pi = 0.00753), trnH-psbA (Pi = 0.00636), trnT-trnL (Pi = 0.00642) and ndhC-trnV (Pi = 0.00616).
Phylogenetic Analysis
The ML and BI analyses of cp genomes result in highly congruent topologies. There are only slight differences in support values among the phylogenetic trees. Therefore, only the ML topology is shown here with the ML/BI support values added at each node (Fig. 6).
Our analyses confirmed that Sorbus s.l. is polyphyletic and six segregated genera, i.e., Aria, Chamaemespilus, Cormus, Miromeles, Sorbus s.s. and Torminalis, are monophyletic. Aria, Chamaemespilus and Torminalis are resolved in one branch near the base of the cp genome phylogeny together with Malus trilobata C.K. Schneid., Aronia arbutifolia (L.) Pers. and Cydonia oblonga Mill. Miromeles is sister to Sorbus s.s. and nested in one branch with Cormus and Pyrus L.
Within the monophyletic genus Sorbus s.s., two major clades are resolved. Clade I comprises two full support subclades (A and B). Subclade A is consistent with subg. Albocarmesinae McAll. Nevertheless, three sections, sect. Hypoglaucae McAll., sect. Insignes (T.T. Yu) McAll. and sect. Multijugae (T.T.Yu) McAll. within subg. Albocarmesinae, are not monophyletic. Subclade B contains two samples representing S. tianschanica belongs to subg. Sorbus sect. Tianshanicae (Kom. ex T.T.Yu) McAll., however, it is resolved in a branch with subg. Albocarmesinae with full-support. Clade Ⅱ contains two full support subclades (C and D) and is sister to the rest of the genus. Subclade C includes five taxa belonging to three different sections, S. aucuparia in sect. Sorbus McAll. and S. hupehensis var. paucijuga in sect. Discolores (T.T.Yu) McAll. are nested within sect. Commixtae McAll. Amongst, sect. Sorbus and sect. Commixtae were classified in subg. Sorbus while sect. Discolores was placed in subg. Albocarmesinae. Subclade D contains two species in subg. Sorbus sect. Wilsonianae McAll.