Plastid phylogenomic insights into the evolution of Distylium (Hamamelidaceae)

Background: Most Distylium species are endangered. Distylium species mostly display homoplasy in their owers and fruits, and are classied primarily based on leaf morphology. However, leaf size, shape, and serration vary tremendously making it dicult to use those characters to identify most species and a signicant challenge to address the taxonomy of Distylium. To infer robust relationships and identify variable markers to identify Distylium species, we sequenced most of the Distylium species chloroplast genome. Results: The Distylium chloroplast genome size was 159,041–159,127 bp and encoded 80 protein-coding, 30 transfer RNAs, and 4 ribosomal RNA genes. There was a conserved gene order displayed and a typical quadripartite structure. Phylogenomic analysis based on whole chloroplast genome sequences yielded a highly resolved phylogenetic tree and formed a monophyletic group containing four Distylium clades. A dating analysis suggested that Distylium originated in the Oligocene (34.39 Ma) and diversied within approximately 1 Ma. The evidence shows that Distylium is a rapidly radiating group. Four highly variable markers, such as matK-trnK, ndhC-trnV, ycf1, and trnT-trnL, and 74 polymorphic simple sequence repeats were discovered in the Distylium plastomes. Conclusions: The plastome sequences had sucient polymorphic information to resolve phylogenetic relationships and identify species accurately.

In this study, we speci cally aimed to (1) develop and screen appropriate intrageneric markers in the chloroplast genome to establish DNA barcodes for Distylium; (2) estimate the effectiveness of a whole chloroplast genome data set in resolving the relationships within this radiating lineage; (3) estimate the divergence time of Distylium.

Basic characteristics of the Distylium plastomes
The complete chloroplast genomes of the 12 newly sequenced Distylium species ranged in length from 159,041 bp (D. lepidoium) to 159,127 bp (D. gracile) ( Table 1). The Distylium chloroplast genomes had a quadripartite structure typical of most angiosperm species, including large single copy (LSC) and small single copy (SSC) regions separated by two inverted repeat (IRa and IRb) regions (Fig. 1).
The LSC regions ranged from 87,825 bp (D. pingpienense) to 87,863 bp (D. racemosum), the SSC regions varied between 18,770 bp (D. dunnianum) and 18,796 bp (D. lepidoium), and the IR regions ranged from 26,225 bp (D. elaeagnoides) to 26,241 bp (D. dunnianum). The GC content of the chloroplast genome sequences was 38.0%. A total of 114 unique genes was detected in the chloroplast genomes of the 11 Distylium species, including 80 protein coding genes, 30 tRNA genes, and 4 rRNA genes, and the gene order was highly conserved ( Fig. 1 and Table 1).

Repetitive sequences
A total of 801 SSRs were identi ed across the chloroplast genomes of the 11 Distylium species (Fig. 2 and Table S2). The number of SSRs per species ranged from 70 (D. dunnianum) to 78 (D. gracile). The majority of the SSRs were mononucleotide repeats (78.65%), followed by dinucleotide (8.61%) and tetranucleotide (5.87%) repeats. There were no hexanucleotide repeats in the Distylium plastomes. The SSR A and T motifs were the most frequent. SSRs were particularly rich in AT in the Distylium plastomes. Among those SSRs, most were located in the LSC/SSC regions (94.01%).
A total of 96 unique SSRs and 74 SSRs were polymorphic across the 11 Distylium species. All polymorphic SSRs were located in the single copy regions, except two SSRs ( Table 2). The mononucleotide repeat units A and T were also the most frequent polymorphic SSRs. Indel variations A total of 76 indels were discovered in the Distylium plastomes, including 59 normal indels and 17 repeat indels. Most of the indels (72.37%, 55 times) were located in the spacer regions, 15.79% (12 times) of indels occurred in the exons, and 11.84% (nine times) were found in the introns (Fig. 3). The TrnT-trnL spacer had ve indels, followed by ndhC-trnV (3 indels). The size of the normal indels ranged from 1 to 13 bp, with 8 bp and 9 bp length indels being the most common. The largest indel (13 bp) was located in the trnC-petN spacer and was a deletion in D. macrophyllum. The second largest indel was in the ycf1 exon of 12 bp length and was an insert in the two D. lepidoium samples. The length of the repeat indels ranged from 2 to 16 bp. The largest repeat indel occurred in the rpl20-rps12 spacer and the second largest repeat indel was located in the rps7-trnV spacer.
Variation in the plastomes and molecular markers for Distylium species The entire chloroplast genome of the 11 Distylium species was 159,360 bp in length, including 298 polymorphic sites and 115 parsimony informative sites ( Table 3). The overall nucleotide diversity (π) was 0.00045; however, each region of the chloroplast genome revealed different nucleotide diversity; the SSC exhibited the highest π value (0.00089) and the IR had the lowest π value (0.00006). All species had a unique chloroplast haplotype, except the IR regions. The number of nucleotide substitutions among the 11 species varied from 7 to 109, and the p-distance varied from 0.0004 to 0.0069. The lowest divergence was between D. buxifolium and D. chinese, and the largest sequence divergence was observed between D. chinese and D. lepidoium. The π value ranged from 0 to 0.0027 in an 800-bp sliding window size. In total, four peaks with π values > 0.002 were identi ed in the chloroplast genome (Fig. 4). Those regions included matK-trnK, ndhC-trnV, ycf1, and trnT-trnL. Three intergenic regions (matK-trnK, ndhC-trnV, and trnT-trnL) were located in the LSC region, and the ycf1 coding region was in the SSC region.
We tested the variability in the hypervariable markers by comparing the chloroplast genome and the three universal DNA barcodes (matK, rbcL, and trnH-psbA). The variable information is shown in Table 4. The intergenic spacer marker trnH-psbA was 367 bp, including two variable sites and no parsimony informative sites. The rbcL and matK genes were 1,428 bp with three variable and three informative sites, and 1,515 bp with only one variable and no informative sites. Combined with the three universal markers, the aligned length was 3,310 bp, with six variable sites and three informative sites. The mean distance was 0.00045. The species identi cation analyses showed that the universal DNA barcodes had less discriminatory power; they had only four haplotypes when combined with the three markers, and the NJ tree had lower resolution and most of the samples were not distinguished (Table 4 and Fig. 5). The four hypervariable markers ranged from 827 bp (matK-trnK) to 2,306 bp (ycf1) in length. The ycf1 gene had the greatest number of variable sites (20 sites) followed by trnT-trnL (9 sites), matK-trnK (8 sites), and ndhC-trrnV had the fewest (6 sites). Combining the four hypervariable markers, there were 43 variable sites and 16 parsimony informative sites that produced the most current identi cation ( Table 4). The identi ed hypervariable markers had higher resolution compared with the tree universal markers, based on the NJ tree (Fig. 5).

Phylogenetic inference
Using the complete chloroplast genome sequences, we inferred the phylogenetic relationships among the 24 Hamamelidaceae samples. The topology of the ML and BI trees was nearly identical (Fig. 6

Discussion
The genera Distyliopsis, Distylium, Fothergilla, Parrotia, Parrotiopsis, Shaniodendron, and Sycopsis occur in in the tribe Fothergilleae of the subfamily Hamamedoideae [9]. According to the phylogenetic relationships based on the several chloroplast and nuclear ITS genes [6,8], Distylium is sister to Distyliopsis [9]. This is the rst use of molecular data to infer the Distylium phylogeny. The Distylium genus formed a well-de ned monophyletic group according to the chloroplast genome data (Fig. 6). Moreover, the phylogenetic tree possessed a series of short internodes within Distylium and most species diversi ed < 1 Ma (Fig. 6), suggesting that this genus has undergone rapid radiation. D. lepidoium was at the base of the genus. This species was rst described in 1918 and is endemic to the Ogasawara Islands [4]. D. myricoides formed a monotypic clade and is distributed in eastern and southeastern China. According to the morphological characteristics, D. myricoides resembles D. buxijolium most closely, from which it may be distinguished by its larger leaves [5]. However, this relationship was not supported by the present study. D. buxijolium and D. chinense were sister species and formed a group supported by morphological characteristics [5]. In this study, the chloroplast genome data provided information to infer the phylogeny of Distylium. However, due to rapid radiation, sampling of additional individuals from each species and extending more nuclear genes would provide additional evidence of the evolutionary history of Distylium.
Most Distylium species are rare and endangered; thus, the development of rapid and easily accessible species identi cation methods is essential. The variations in the morphological characteristics between species were continuous and uninterrupted. Therefore, it was di cult to distinguish species using morphological characteristics. DNA barcoding offers an opportunity to identify Distylium species. rbcL and matK are the two core DNA barcodes in plants. However, many studies have shown that these two markers have lower species identi cation power [16,17]. Our study also showed that rbcL and matK or a combination of the two markers failed to discriminate Distylium species (Fig. 5), explaining the low resolution in previous studies and highlighting the importance of developing highly divergent markers.
Some studies have indicated that mutations are not random and are clustered as "mutation hotspots" or "highly variable regions" [10,16,18]. In this study, we compared the whole chloroplast genomes and identi ed the mutation hotspots in Distylium (Fig. 3). Four variable loci (matK-trnK, ndhC-trnV, ycf1, and trnT-trnL) were discovered. trnT-trnL has been frequently used in plant phylogeny [19]. NdhC-trnV and ycf1 are considered divergence hotspots in angiosperms based on our previous research [16]. NdhC-trnV has been less used in plant phylogeny and species identi cation and is prone to large indels [20]. The coding region of the ycf1 locus is the most divergent marker in most groups, and has been suggested as the main plant DNA barcode [17]. matK-trnK is located in the LSC region, and this locus is used less frequently in evolutionary biology. Some lineages have the Ploy T structure [21]. Therefore, the lineagespeci c, highly variable markers developed in this study will facilitate further phylogenetic reconstruction and DNA barcoding of rare and endangered Distylium species.

Conclusions
In this study, we report 10 newly sequenced chloroplast genomes of Distylium species. The overall genomic structure, including the gene number and gene order, was well-conserved. The phylogeny and divergence time analyses based on the plastome sequences showed that Distylium was a rapidly radiating group and most speciation events occurred < 1 Ma. A comparison of sequence divergence across the Distylium plastomes revealed that matK-trnK, ndhC-trnV, ycf1, and trnT-trnL were mutation hotspot regions. Overall, our study demonstrated that plastome sequences can be used to improve phylogenetic resolution and species discrimination. Extended sampling and additional nuclear markers are absolutely necessary in further studies.  Table S1. Total genomic DNA was extracted from the leaf tissues of herbarium specimens of this genus following the modi ed CTAB DNA extraction protocol [22].

Methods
Sequence, chloroplast genome assembly, and annotation The total DNA was constructed using 350-bp insert libraries according to the manufacturer's instructions, which was then used for sequencing. Paired-end sequencing was performed on an Illumina HiSeq X-ten at Novogene (Tianjin, China), yielding approximately 4 Gb of high-quality 150-bp paired-end reads per sample.
The raw reads obtained from Novogene were ltered using Trimmomatic 0.39 [23] with the following parameters: LEADING = 20, TRAILING = 20, SLIDING WINDOW = 4:15, MIN LEN = 36, and AVG QUAL = 20. High-quality reads were assembled de novo using the SPAdes 3.6.1 program [24]. The chloroplast genome sequence contigs were selected from the initial assembled reads in SPAdes by performing a BLAST search using several related Hamamelidaceae chloroplast genome sequences as references. The chloroplast genome sequence contigs were further assembled using Sequencher 5.4.5. All plastid assemblies were annotated in Plann [25] using D. macrophyllum (GenBank Accession number: MN729500) as the reference, and missing or incorrect genes were checked in Sequin. A circular diagram for the chloroplast genome was generated using OGDRAW [26]. All chloroplast genomes assembled in this study have been deposited in GenBank under Accession numbers MW248109 -MW248120.
The chloroplast genomes sequences were aligned using MAFFT [27] manually examined, and adjusted. Based on the aligned sequence matrix, the indels were manually checked and divided into categories of repeat indels and normal indels, according to Dong et al. [15]. D. dunnianum was used as the reference to determine the size and position of the indel events.

Sequence divergence analysis
Single nucleotide substitutions and the genetic p-distances were calculated using MEGA 7.0 [28] based on the aligned chloroplast genome sequences. To assess sequence divergence and to explore highly variable chloroplast markers, nucleotide diversity (π) was calculated by sliding window analysis using DnaSP v6 [29] with a widow size of 600 bp and a step size of 100 bp.
Nucleotide diversity and the number of haplotypes were used to assess marker variable for all barcodes (hype-variable markers and the universal plant DNA barcodes, such as rbcL, matK, and trnH-psbA). The tree-based method was utilized to calculate discrimination power. A neighbor-joining (NJ) tree was prepared in PAUP using the K2p distance.

Phylogenetic analyses
To elucidate the phylogenetic positions of Distylium within Hamamelidaceae and the interspeci c phylogenetic relationships within Distylium, multiple alignments were performed using the whole chloroplast genome of 24 Hamamelidaceae samples representing 11 genera, including Cercidiphyllum japonicum, Daphniphyllum oldhamii, and Liquidambar formosana as outgroups. The Hamamelidaceae chloroplast genomes were aligned using MAFFT, and ambiguous alignment regions were trimmed with Gblocks 0.91b [30]. The maximum-likelihood (ML) analysis was run with RAxML-NG [31] with the best-t model from ModelFinder [32]. Branch support was assessed by fast bootstrap methodology using non-parametric bootstrapping and 500 ML pseudo-replicates.
Mrbayes v3.2 [33] was used to infer the Bayesian inference (BI) tree. The BI analysis was run for 20 million generations, in which a tree was sampled every 1,000 generations. Two independent Markov Chain Monte Carlo (MCMC) analyses were performed and each chain started with a random tree. The rst 25% of the sampled trees was discarded as burn-in, while the remaining trees were constructed in a majority-rule consensus tree to estimate posterior probabilities.

Molecular clock dating
We used BEAST v2.5.1 [34] to estimate the divergence times of Hamamelidaceae using three priors based on the complete plastome sequences. Based on the average value obtained by Xiang et al. [9] in a calibrated analysis, three priors were used: (i) the average age of the most recent common ancestor (TMRCA) of Hamamelidaceae (the root of the tree) was 108 Ma; (ii) the crown age of Hamamelideae/Fothergilleae was 89 Ma; and (iii) the crown age of Mytilarioideae was 58.3 Ma. Each secondary prior was placed under a normal distribution with a standard deviation of 1.
The GTR nucleotide substitution model and the prior tree Yule model were selected with the uncorrelated lognormal distribution relaxed molecular clock model. The MCMC run had a chain length of 400,000,000 generations with sampling every 10,000 generations. The stationary phase was examined through Tracer 1.6 [35] to evaluate convergence and to ensure su cient and effective sample size for all parameters surpassing 200. A burn-in of 10% generations was discarded, and TreeAnnotator v2.4.7 was used to produce a maximum clade credibility tree.       Nucleotide diversity (π) in the Distylium plastomes using sliding window method. The four mutation hotspot regions (π > 0.002) were annotated. π values were calculated in 800 bp sliding windows with 50 bp steps. Nucleotide diversity (π) in the Distylium plastomes using sliding window method. The four mutation hotspot regions (π > 0.002) were annotated. π values were calculated in 800 bp sliding windows with 50 bp steps. Neighbor joining tree for Distylium using combine three universal plant DNA barcodes and four highly variable regions combinations. Neighbor joining tree for Distylium using combine three universal plant DNA barcodes and four highly variable regions combinations.   Divergence times of Hamamelidaceae obtained from BEAST analysis based on the complete plastome sequences. Mean divergence time of the nodes were shown next to the nodes while the blue bars correspond to the 95% highest posterior density (HPD). Black circles indicate the three calibration points.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.