Structural characteristics of the Santalum chloroplast genome
The complete chloroplast genomes of Santalum species were assembled into circular molecules and contained the typical quadripartite structures (Fig. 1 and Supplemental Table S1). The Santalum chloroplast genomes ranged from 143,291 bp (S. acuminatum) to 144,263 bp (S. boninense) in length (Table 1), with LSCs (large single copies) of 82,944 bp (S. acuminatum) to 83,942 bp (S. paniculatum), IRs (inverted repeat) of 24,477 bp (S. paniculatum) to 24,511 bp (S. album), and SSCs (small single copy) of 11,237 (S. leptocladum) to 11,379 bp (S. acuminatum). The overall GC content was 38.0%.
Table 1
Characteristics of newly sequenced Santalum chloroplast genomes.
| Nucleatide length (bp) | Nmuber of genes | |
Species | Total | LSC | IR | SSC | Protein | tRNA | rRNA | Total | Genbank Accession number |
S. acuminatum | 143291 | 82944 | 24484 | 11379 | 67 | 30 | 4 | 101 | MW464925 |
S. album | 144034 | 83793 | 24488 | 11265 | 67 | 30 | 4 | 101 | MW464915 |
S. album | 144101 | 83802 | 24511 | 11277 | 67 | 30 | 4 | 101 | MW464922 |
S. boninense | 144263 | 83912 | 24501 | 11349 | 67 | 30 | 4 | 101 | MW464916 |
S. ellipticum | 144250 | 83911 | 24495 | 11349 | 67 | 30 | 4 | 101 | MW464917 |
S. ellipticum var. littorale | 144255 | 83911 | 24498 | 11348 | 67 | 30 | 4 | 101 | MW464920 |
S. freycinetianum var. pyrularium | 143895 | 83582 | 24481 | 11351 | 67 | 30 | 4 | 101 | MW464921 |
S. leptocladum | 143801 | 83576 | 24494 | 11237 | 67 | 30 | 4 | 101 | MW464918 |
S. sp. | 143923 | 83603 | 24489 | 11342 | 67 | 30 | 4 | 101 | MW464919 |
S. paniculatum | 144239 | 83942 | 24477 | 11343 | 67 | 30 | 4 | 101 | MW464914 |
S. spicatum | 143638 | 83314 | 24495 | 11334 | 67 | 30 | 4 | 101 | MW464924 |
S. yasi | 144019 | 83736 | 24497 | 11289 | 67 | 30 | 4 | 101 | MW464923 |
The Santalum chloroplast genomes had 72 protein-coding genes, 35 tRNA genes, eight rRNA genes, and nine pseudogenes. The whole set of ndh genes and the infA gene were found to have lost their function. The ndhA gene had complete loss of function, and the other ndh genes and infA were pseudogenizations. Sixteen genes have introns, with two (ycf3 and clpP) harbor two introns.
Comparative analyses of the chloroplast genome
A total of 17–31 SSRs were found in the Santalum chloroplast genomes. Mono-, di-, tri -, tetra-, penta-, and hexanucleotide SSRs were all discovered (Fig. 2 and Supplemental Table S2). The majority of SSRs were mononucleotide repeats in all Santalum species, followed by tetranucleotide repeats. Tri- and tetranucleotide repeats were not found, and dinucleotide repeats were limited to one occurrence in S. acuminatum and S. album. Most of the mononucleotide repeats were composed of A/T, with very little G/C. The LSC region contained the largest number of SSRs (184), with 39 identified in the SSC region and 66 in the IR region.
The chloroplast genomes were plotted using mVISTA and with S. leptocladum as the reference. The results revealed collineation, no rearrangement, and high sequence similarity across the chloroplast genome (Fig. 3). There were 2,352 variable sites in the 145,671-bp Santalum chloroplast genome alignment (Table 3). The overall nucleotide diversity (π) was 0.0036. The SSC exhibited the highest π value (0.00926), compared with the IR (0.00087) and LSC (0.00457) regions. The genetic p-distance and number of nucleotide substitutions among these ten Santalum species are given in Supplemental Table S3. The mean genetic distance was 0.00401, the lowest divergence (p-distance: 0.0003) was between S. ellipticum and S. ellipticum var. littorale, and the largest sequence divergence (p-distance: 0.00828) was between S. spicatum and S. yasi.
Table 3
Sequences divergence of Santalum chloroplast genomes.
Regions | Alignment length (bp) | Number of variable sites | Nucleotide polymorphism |
Polymorphic | Singleton | Parismony informative | Nucleotide diversity | Haplptypes |
LSC | 84,949 | 1,704 | 1,149 | 549 | 0.00457 | 12 |
SSC | 11,527 | 455 | 298 | 157 | 0.00926 | 12 |
IR | 24,617 | 96 | 68 | 28 | 0.00087 | 11 |
Whole plastomes | 145,671 | 2,352 | 1,582 | 764 | 0.00366 | 12 |
To identify the mutation hotspots in the chloroplast genome, the nucleotide diversity values are displayed in Fig. 4. The number of single nucleotide substitutions ranged from 0 to 46, and the π value ranged from 0 to 0.01485 within a 800-bp sliding window size. We defined the mutation hotspots with pi values > 0.012. There were three regions (ccsA–trnL, ΨndhE–ΨndhG-rps15, and ycf1), and those three regions were all located within the SSC region. Among these three regions, the ccsA-trnL had the highest nucleotide diversity values.
The most commonly employed loci used in plant phylogeny and DNA barcoding (e.g., rbcL, matK, trnH-psbA) were not selected in our study. We compared the sequence divergence of highly variable regions and the three conventional candidate chloroplast DNA barcodes (matK, rbcL, and trnH-psbA). Sequence variation values, such as genetic distance, nucleotide diversity, and the number of variable sites, are given in Supplemental Table S4. The three newly identified markers had higher genetic divergence and had more information sites than the three conventional candidate chloroplast DNA markers. The primers designed for the three variable markers are given in Supplemental Table S5.
Microstructural mutation variable
Among the chloroplast genomes of Santalum species, there were 460 indels in total, including 269 normal indels, 104 repeat-related indels, and 87 SSR-related indels. Most of the indels (77.17%, 355 times) were in the spacer regions, 57 indels were found in the intron regions (12.39%), 26 indels occurred in the pseudogene regions, and 22 indels in the exon regions. All SSR-related indels were located in non-coding regions. The length of normal indels ranged from 1 to 331 bp (Fig. 5), and 1-bp indels were the major type (37.92%). The longest normal indel occurred in the ycf4-cemA region, and was a deletion in S. spicatum. Repeat related indels ranged from 2 bp to 28 bp; the longest indel was located in atpH-atpI and was an insert in S. boninense, S. paniculatum, and S. ellipticum var. littorale. Most of the repeat-related indels were 4 to 6 bp long (71.15%). A total of 109 regions had indels: ycf3-trnS had 17 indels, followed by trnL-rpl32 (15 indels), rps16-trnQ (14 indels), atpH-atpI (13 indels), petA-psbJ (12 indels), and matK-rps16 (12 indels). For the coding regions, the ycf1 gene had the most indels (9 indels).
Fourteen small inversions were identified in the Santalum chloroplast genome. All of the inversions and their inverted repeating franking sequences formed stem-loop structures. The inversions length was 2 to 33 bp, and the franking repeats were from 7 bp to 25 bp. There was no correlation between the length of inversion and the franking repeats sequences. Seven inversions occurred in the LSC regions; four were located in the SSC region, and three in the IR regions. All the inversions were located in the non-coding regions. Five inversions (in ndhB, rpl33-rps18, rps15-ycf1, trnH-psbA, and trnM-atpE) were specific to S. acuminatum. The inversion in ndhD-psaC occurred in S. spicatum, while the inversions in trnL-rpl32 and petN-psbM were specific to S. album. S. album had one sample with inversions at ycf2-trnL and psaJ-rpl33.
Phylogenetic inference
The 13CPG dataset matrix included 150,415 nucleotide sites, of which 6,259 were variable sites. The second data matrix, 70g50s, contained 66 protein-coding genes and four rRNA genes from 50 Santalales species. After excluding ambiguous regions and sites, this dataset contained 56,789 nucleotide sites, of which 13,458 (23.70%) were parsimony-informative sites.
The optimal partitioning scheme using the 70g50s dataset identified under the Akaike Information Corrected Criterion (AICc) and using strict hierarchical clustering analysis in PartitionFinder (lnL= -263607.113888; AICc = 528592.787194) contained 57 partitions (Supplemental Table S6). The ML tree under the unpartitioned and the three partitioned schemes produced identical topologies (Fig. 6 and Supplemental Figures S1–S3). The ML tree inferred from the 13CPG and 70g50s datasets were similar to the phylogenetic relationships of Santalum species (Fig. 6).
According to the 70g50s datasets, we inferred the phylogeny of Santalales. The ML tree showed that all the family was generally resolved and supported a monophyletic clade. Erythropalum scandens (Erythropalaceae) were selected as the outgroup, according to the results of Chen et al. and Guo et al. [19, 20]. Ximeniaceae was supported position as early diverging lineages. Loranthaceae and Schoepfiaceae formed a clade (BS = 100/PP = 1). Opiliaceae followed by Cervantesiacea were successive sisters to a clade comprising the remaining Santalales. Santalaceae was sister to Viscaceae plus Amphorogynaceae (BS = 100/PP = 1).
All Santalum species formed a monophyletic clade (BS = 100/PP = 1) and were sister to Osyris wightiana within Santalaceae. Santalum had a shortened branch on the phylogenetic tree, indicating low divergence among Santalum species. S. spicatum was the first diverging branch. S. acuminatum was sister to the remaining species, which formed two lineages. The first lineage included three species (S. leptocladum, S. freycinetianum var. pyrularium, and S. sp.) and the second lineage include three branches. Two samples of S. album were sister to the remaining species, and the relationships of the three branches were not clear (70g50s: BS = 48/BI = 0.53, 13CPG: BS = 49/BI = 0.72).
The estimated divergence time
Bayesian relaxed molecular clock analyses suggested that the crown age of the Santalales was 113.91 Mya (Fig. 7). The split between the Santalaceae and its closest relatives, Viscaceae and Amphorogynaceae, occurred 81.07 Mya (95% HPD: 71.71–96.27 Mya). The mean crown ages of Santalaceae, Viscaceae, and Amphorogynaceae were 38.44, 47.87, and 6.18 Mya, respectively. The crown age of Santalum was 8.46 Mya (95% HPD: 3.8–14.06 Mya) in the later Miocene. The first divergence occurred around 6.97 Mya (95% HPD: 3.03–12 Mya), followed by independent branch-splitting events within the two lineages at 3.02 Mya (95% HPD: 1.41–4.95 Mya). Diversification within the two lineages occurred over a short period of approximately 1 Mya.