Chloroplast phylogenomics and the dynamic evolution of Santalum (Santalaceae)


 Background

 Santalum (Santalaceae, sandalwood) is a hemiparasitic genus including approximately 15 extant species. It is known for its aromatic heartwood oil, which is used in incense and perfume. Demand for sandalwood-based products has led to drastic over-harvesting, and wild Santalum populations are now threatened. Knowledge of the phylogenetic relationships and genetic diversity will be critical for the conservation and proper management of this genus. Here, we sequenced the chloroplast genome of 11 Santalum species. The data were then used to investigate the chloroplast genome evolutionary dynamics and relationships and divergence time within Santalum and related species.
Results

The Santalum chloroplast genome contains the typical quadripartite structures, ranging from 143,291 to 144,263 bp. The chloroplast genome contains 124 genes. The whole set of ndh genes and the infA gene were found to lose their function. Between 17 and 31 SSRs were found in the Santalum chloroplast genome, and mononucleotide simple sequence repeats (SSRs) were the major type. The P-distance among the Santalum species was 0.0003 to 0.00828. Three mutation hotspot regions, 14 small inversions, and 460 indels events were discovered in the Santalum chloroplast genome. Our phylogenomic assessment provides improved resolution compared to past analyses. Our divergence time analysis shows that the crown age of Santalum was 8.46 Mya, the first divergence occurred around 6.97 Mya, and diversification was complete within approximately 1 Mya.
Conclusions

By sequencing the 12 chloroplast genomes of Santalum, we gain insight into the evolution of its chloroplast genomes. The chloroplast genome sequences had sufficient polymorphic information to elucidate the evolutionary history of Santalum.

Chloroplast genomes are usually inherited uniparentally, without recombination and, thus, effective expansion of the genetic information. Phylogenetic analyses based on whole chloroplast genome sequences have been widely used at different taxonomic levels [14][15][16]. Chloroplast genome data also provide effective genetic markers to resolve complex evolutionary histories [17,18].
Also, a comparison of chloroplast sequences can help understand the evolutionary patterns of plants.
To better resolve the relationships within Santalum and to gain insight into the pattern of Santalum chloroplast genome evolution, we sequenced the chloroplast genome of 11 species of Santalum. Speci cally, we attempted to (1) investigate the relationships within Santalum, (2) estimate the divergence time of Santalum, and (3) elucidate chloroplast genome evolution within Santalum.

Results
Structural characteristics of the Santalum chloroplast genome The complete chloroplast genomes of Santalum species were assembled into circular molecules and contained the typical quadripartite structures ( Fig. 1 and Supplemental Table S1). The Santalum chloroplast genomes ranged from 143,291 bp (S. acuminatum) to 144,263 bp (S. boninense) in length (Table 1), with LSCs (large single copies) of 82,944 bp (S. acuminatum) to 83,942 bp (S. paniculatum), IRs (inverted repeat) of 24,477 bp (S. paniculatum) to 24,511 bp (S. album), and SSCs (small single copy) of 11,237 (S. leptocladum) to 11,379 bp (S. acuminatum). The overall GC content was 38.0%. The Santalum chloroplast genomes had 72 protein-coding genes, 35 tRNA genes, eight rRNA genes, and nine pseudogenes. The whole set of ndh genes and the infA gene were found to have lost their function. The ndhA gene had complete loss of function, and the other ndh genes and infA were pseudogenizations. Sixteen genes have introns, with two (ycf3 and clpP) harbor two introns.
Comparative analyses of the chloroplast genome A total of 17-31 SSRs were found in the Santalum chloroplast genomes. Mono-, di-, tri -, tetra-, penta-, and hexanucleotide SSRs were all discovered ( Fig. 2 and Supplemental Table S2). The majority of SSRs were mononucleotide repeats in all Santalum species, followed by tetranucleotide repeats. Tri-and tetranucleotide repeats were not found, and dinucleotide repeats were limited to one occurrence in S. acuminatum and S. album. Most of the mononucleotide repeats were composed of A/T, with very little G/C. The LSC region contained the largest number of SSRs (184), with 39 identi ed in the SSC region and 66 in the IR region.
The chloroplast genomes were plotted using mVISTA and with S. leptocladum as the reference. The results revealed collineation, no rearrangement, and high sequence similarity across the chloroplast genome (Fig. 3). There were 2,352 variable sites in the 145,671-bp Santalum chloroplast genome alignment (Table 3). The overall nucleotide diversity (π) was 0.0036. The SSC exhibited the highest π value (0.00926), compared with the IR (0.00087) and LSC (0.00457) regions. The genetic p-distance and number of nucleotide substitutions among these ten Santalum species are given in Supplemental Table S3. The mean genetic distance was 0.00401, the lowest divergence (p-distance: 0.0003) was between S. ellipticum and S. ellipticum var. littorale, and the largest sequence divergence (p-distance: 0.00828) was between S. spicatum and S. yasi. To identify the mutation hotspots in the chloroplast genome, the nucleotide diversity values are displayed in Fig. 4. The number of single nucleotide substitutions ranged from 0 to 46, and the π value ranged from 0 to 0.01485 within a 800-bp sliding window size.
We de ned the mutation hotspots with pi values > 0.012. There were three regions (ccsA-trnL, ΨndhE-ΨndhG-rps15, and ycf1), and those three regions were all located within the SSC region. Among these three regions, the ccsA-trnL had the highest nucleotide diversity values.
The most commonly employed loci used in plant phylogeny and DNA barcoding (e.g., rbcL, matK, trnH-psbA) were not selected in our study. We compared the sequence divergence of highly variable regions and the three conventional candidate chloroplast DNA barcodes (matK, rbcL, and trnH-psbA). Sequence variation values, such as genetic distance, nucleotide diversity, and the number of variable sites, are given in Supplemental Table S4. The three newly identi ed markers had higher genetic divergence and had more information sites than the three conventional candidate chloroplast DNA markers. The primers designed for the three variable markers are given in Supplemental Table S5.

Microstructural mutation variable
Among the chloroplast genomes of Santalum species, there were 460 indels in total, including 269 normal indels, 104 repeat-related indels, and 87 SSR-related indels. Most of the indels (77.17%, 355 times) were in the spacer regions, 57 indels were found in the intron regions (12.39%), 26 indels occurred in the pseudogene regions, and 22 indels in the exon regions. All SSR-related indels were located in non-coding regions. The length of normal indels ranged from 1 to 331 bp (Fig. 5), and 1-bp indels were the major type (37.92%).
Fourteen small inversions were identi ed in the Santalum chloroplast genome. All of the inversions and their inverted repeating franking sequences formed stem-loop structures. The inversions length was 2 to 33 bp, and the franking repeats were from 7 bp to 25 bp. There was no correlation between the length of inversion and the franking repeats sequences. Seven inversions occurred in the LSC regions; four were located in the SSC region, and three in the IR regions. All the inversions were located in the non-coding regions.

Phylogenetic inference
The 13CPG dataset matrix included 150,415 nucleotide sites, of which 6,259 were variable sites. The second data matrix, 70g50s, contained 66 protein-coding genes and four rRNA genes from 50 Santalales species. After excluding ambiguous regions and sites, this dataset contained 56,789 nucleotide sites, of which 13,458 (23.70%) were parsimony-informative sites.
The optimal partitioning scheme using the 70g50s dataset identi ed under the Akaike Information Corrected Criterion (AICc) and using strict hierarchical clustering analysis in PartitionFinder (lnL= -263607.113888; AICc = 528592.787194) contained 57 partitions (Supplemental Table S6). The ML tree under the unpartitioned and the three partitioned schemes produced identical topologies ( Fig. 6 and Supplemental Figures S1-S3). The ML tree inferred from the 13CPG and 70g50s datasets were similar to the phylogenetic relationships of Santalum species (Fig. 6).
According to the 70g50s datasets, we inferred the phylogeny of Santalales. The ML tree showed that all the family was generally resolved and supported a monophyletic clade. Erythropalum scandens (Erythropalaceae) were selected as the outgroup, according to the results of Chen et al. and Guo et al. [19,20]. Ximeniaceae was supported position as early diverging lineages. Loranthaceae and Schoep aceae formed a clade (BS = 100/PP = 1). Opiliaceae followed by Cervantesiacea were successive sisters to a clade comprising the remaining Santalales. Santalaceae was sister to Viscaceae plus Amphorogynaceae (BS = 100/PP = 1).
All Santalum species formed a monophyletic clade (BS = 100/PP = 1) and were sister to Osyris wightiana within Santalaceae. Santalum had a shortened branch on the phylogenetic tree, indicating low divergence among Santalum species. S. spicatum was the rst diverging branch. S. acuminatum was sister to the remaining species, which formed two lineages. The rst lineage included three species (S. leptocladum, S. freycinetianum var. pyrularium, and S. sp.) and the second lineage include three branches. Two samples of S. album were sister to the remaining species, and the relationships of the three branches were not clear

Santalum chloroplast genome evolution and variation
Our ndings reveal that the Santalum chloroplast genomes have highly similar genome structures, genome sizes, and gene contents (Figs. 3 and 4). Our ndings are similar to other chloroplast genome studies reporting that the single-copy regions and non-coding regions are more variable than IRs and coding regions [21,22]. The variation in size relative to other angiosperm species was mainly due to some missing genes (Fig. 1).
All encoded NAD(P)H dehydrogenase complex (Ndh) genes in the Santalum chloroplast genome had functional or physical losses. The ndh genes are the earliest functional losses in the chloroplast genome of hemiparasites [23,24]. All the ndh genes have losses in the Santalales hemiparasites [19,20], suggesting that the chloroplast NDH pathway is not essential in these lineages, or this function has been transferred to the nuclear genome [24,25]. Another degraded chloroplast gene in the Santalum chloroplast genome was infA; this mutation was detected in most Santalales hemiparasite [20]. The infA gene is a translation initiation factor of the translation initiation complex [26]. Loss and pseudogenization of infA has occurred in many hemiparasites and holoparasitic plants [24,27]. This gene has also been independently lost multiple times among photoautotrophic plant lineages [28].
SSRs are important markers for population genetics and germplasm management [29], and chloroplast genome SSRs have been used to analyze the material from introgression [30]. A total of 17 to 31 SSRs were found in the Santalum chloroplast genome, and mononucleotide SSRs were the most frequent (Fig. 2). The number of SSR in the Santalum chloroplast genome was low compared to photoautotrophic plant lineages [31][32][33][34].
Microstructural mutation events are ubiquitous in chloroplast genome evolution, but have been little studied [31]. Indels and small inversions were analyzed in this study ( Fig. 5 and Table 2). Based on Dong et al. [31], we classi ed the indels mutations into three categories: SSR-related indels, repeat-related indels, and normal indels. The normal indels were the most frequent in the Santalum chloroplast genome, and the size was also variable, ranging from 1 to 331 bp (Fig. 5). Slipped strand mispairing (SSM) has been suggested as the mechanism leading to most SSR-related indels [35,36]. DNA recombination has also been proposed to cause repeat related indels [35,36]. These different mechanisms might be responsible for the observed differences in indel length.  [39], Diospyros [22], and Quercus [40]. Studies have shown that ycf1 is phylogenetically useful [41] and is associated with a high success rate for DNA barcoding [42]. ccsA-trnL and ΨndhE-ΨndhG-rps15 have been less widely used in the phylogeny and DNA barcoding.

Phylogenetic relationships of Santalum
Based on a few morphological characters, such as the oral tube color, placement of the ovary, the Santalum has been classi ed into three or four sections [8,11]. However, the molecular data was not supported by the sectional classi cation of Santalum. For example, the S. boninense from Section Santalum was sister to S. paniculatum and S. ellipticum from Section Hawaiiensia (Fig. 6). S. acuminatum and S. spicatum were in the formerly recognized Australian genus Eucarya. The ITS, ETS, and GBSSI sequences (nuclear data) support the notion that these two species form a monophyletic group [2,3], which con icts with the chloroplast genome results (Fig. 6). Incomplete lineage sorting and chloroplast genome capture might account for this discordance [43,44].
There were several auto-and allopolyploid species according to an estimate by measuring the C value [3]. Chloroplast genome data showed the maternal line of the allopolyploid species and were used to identify the maternal progenitor. The allopolyploid species of S. ellipticum, S. paniculatum, and S. boninense formed a clade, and there were no diploids species in this clade, although the maternal parents were unresolved. The progenitors might be extinct or not yet discovered. The biogeography suggests that Santalum island colonists tend to be polyploids [3].

Conclusions
The analyzed Santalum chloroplast genomes have a similar structure, gene number, and gene order. Diversi cation of the Santalum chloroplast genome is explained by the presence of mutation hotspots regions, small inversions, and the co-occurrence of different types of indel or single nucleotide polymorphisms. The phylogeny and divergence time analysis based on the complete chloroplast genome discovered that the Santalum species likely originated by radiation evolution, and most speciation events occurred less than 1 Mya. Owing to the incomplete sampling and the existing polyploids species in Santalum, further studies with extended sampling and additional nuclear markers will be necessary.

Plant materials and sequencing the chloroplast genomes
We collected 12 individual samples representing ten currently described Santalum species. Details of the 12 samples collected in this study are given in Supplemental Table S1. Specimens of these samples were preserved in the herbarium of Research Institute of Tropical Forestry, Chinese Academy of Forestry. Xiaojin Liu identi ed all samples. Leaf tissues were dried using silica gel for subsequent DNA extraction. These materials are from cultivated plants, and permission is not required to collect them.
DNA was extracted using the modi ed CTAB DNA extraction protocol [45]. We used the low-coverage whole-genome sequencing method to obtain the whole chloroplast genome. Paired-end libraries with 350-bp inserts were prepared and then sequenced in Illumina iSeq X-ten platform at Novogene. Each sample yielded about 4 Gb of data.

Chloroplast genome assembly and annotation
The raw reads were ltered using Trimmomatic v0.36 [46] to remove the adaptors, low-quality reads, and sites. The clean data were used to assemble the chloroplast genome using GetOrganelle [47]. The newly sequenced chloroplast genomes were annotated using Plann [48] using Santalum album (GenBank Accession number: MK675809) as the reference. The chloroplast genome maps were visualized using OGDRAW [49]. The complete chloroplast genomes were deposited in GenBank (MW464914 to MW464925).

Genome comparison
The mVISTA program was used to analyze the variation in the Santalum chloroplast genomes [50], using the chloroplast genome of S. leptocladum as the reference for sequence annotation. SSRs in the chloroplast genome were identi ed using the MISA software. The parameters implemented in MISA are as follows: repeat units ≥ 10 for mononucleotide, repeat units ≥ 6 for dinucleotides, repeat units ≥ 5 for trinucleotides, repeat units ≥ 4 for tetranucleotides, and repeat units ≥ 3 for pentanucleotides and hexanucleotides.
The whole Santalum chloroplast genome alignments were performed with MAFFT [51] and adjusted manually. We used the genetic Pdistances and the number of SNPs (single nucleotide substitutions) to assess the divergence among the Santalum species. The Pdistances and the number of SNPs were calculated using MEGA X [52]. To explore the mutation hotspots in the chloroplast genome, nucleotide diversity (π) was calculated using the software DnaSP v6 [53] by sliding window analysis with a window size of 800 bp and a step size of 100 bp.

Microstructural mutation events
Indels and small inversions were identi ed based on the aligned chloroplast genome sequence matrix, according to Dong et al. [31].
The indel types were divided into three categories: repeat-related indels, normal indels, and SSR-related indels. Inversions were rst identi ed using the REPtuter program and then checked and con rmed by reexamining the sequence matrix. Inversions form a stemloop structure, including the inversion sequences and inverted repeat at the opposite franking end [31].

Phylogenetic analyses
Phylogenetic analysis was conducted to elucidate the interspeci c phylogenetic relationships within Santalum and the phylogenetic positions within Santalales. To make good use of the chloroplast genomes in phylogenetic analyses, we generated two datasets for phylogenetic inference. The rst dataset (13CPG) was the 12 Santalum complete chloroplast genome sequences with Osyris wightiana used as an outgroup. The second dataset (70g50s) was 70 coding genes, including 12 Santalum samples and 38 Santalales species, including nine families.
Maximum likelihood (ML) and Bayesian inference (BI) methods were used to infer phylogenetic relationships. For the 13CPG dataset, the best-t model was found with ModelFinder [54]. For the 70g50s dataset, we used the following partitioning schemes: (1) unpartitioned, (2) partitioned by genes, (3) partitioned with the rcluster algorithm (data pre-partitioned by locus), and (4) partitioned with the hcluster algorithm (data pre-partitioned by locus). All partitioning analyses were run in PartitionFinder 2 [55]. RAxML-NG [56] was run for the ML tree with 500 bootstrap replicates.
Mrbayes v3.2 [57] was used to perform the BI tree. For the 70g50s dataset, we used the hcluster partitioning scheme for BI analyses because this scheme resulted in the highest log-likelihood in the ML analyses. The 13CPG dataset used the GTR + G model (the bestt model from ModelFinder) for BI analyses. The BI analysis was run with two independent chains and prior for 20 million generations with sampling every 1000 generations. The initial 25% of the sampled trees were discarded as burn-in. The stationarity was regarded as having been reached when the average standard deviation of split frequencies remained below 0.01.

Molecular clock dating
The 70g50s dataset was used to estimate the divergence times of Santalales using ve priors.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
Availability of data and material The 12 newly assembled chloroplast genomes were deposited in GenBank under the accession numbers of MW464914 to MW464925.

Competing Interests
The authors declare no con icts of interest.