Comparative Analyses of Chloroplast Genomes from Six Rhodiola Species: Variable DNA Markers Identi cation and Phylogenetic Inferences

Kaihui Zhao Tibet Agriculture and Animal Husbandry University Lianqiang Li Tibet Agriculture and Animal Husbandry University Hong Quan Tibet Agriculture and Animal Husbandry University Junbo Yang Chinese Academy of Sciences Zhirong Zhang Chinese Academy of Sciences Zhihua Liao Tibet Agriculture and Animal Husbandry University Xiaozhong Lan (  lanxiaozhong@163.com ) Tibet Agriculture and Animal Husbandry University


Introduction
As traditional natural plant pharmaceuticals and health food, Rhodiola is belonging to the family of Crassulaceae and mainly distributed in alpine regions of Asia and Europe [1][2][3]. 73 species of Rhodiola plants are distributed in China, and the Qinghai-Tibet Plateau has the most species [4]. The extract of Rhodiola plants, especially R. crenulata and R. rosea, has various pharmacological effects such as antihypoxia, fatigue, tumors, radiation, aging, and improvement of mental and physical functions [5]. With the in-depth research on the pharmacological effects of Rhodiola plants, the demand for Rhodiola resources is increasing. Due to the fragile ecological environment of the Qinghai-Tibet Plateau and the lack of arti cial cultivation techniques, relying solely on the digging of wild resources can easily lead to the reduction of Rhodiola plant resources and the loss of genetic diversity resources [6].
Due to the variety of Rhodiola plants, the source of commercial medicine of Rhodiola is very complicated, but the pharmacodynamics of different species of Rhodiola have a signi cant difference in clinical e cacy. The traits of the medicinal plants of Rhodiola and the characteristics of the microstructure are similar [7]. At present, R. crenulata is the only primitive plant of the Rhodiola medicinal herbs contained in the Chinese Pharmacopoeia (2020 Edition). The development and research of its alternative varieties is imminent. In recent years, the mixed use of plant roots and rhizomes of distinct species of Rhodiola has occurred frequently. Researchers have studied the Rhodiola Herbal Slices on the market using DNA barcoding technology, only 40% of the samples are R. crenulata collected in the Chinese Pharmacopoeia [8]. The chaotic use of Rhodiola medicinal materials directly affects the safety and e cacy of clinical medications, and coupled with unrestricted collection, the number of wild resources has decreased dramatically. Therefore, in order to realize the protection and sustainable development of Rhodiola plants, it is necessary to conduct in-depth research on the identi cation and genetic diversity of their species.
To solve the problem of Rhodiola plant identi cation, Wang et al. developed random ampli cation polymorphic DNA (RAPD) and inter-simple sequence repeat (ISSR) primers to identify R. angusta, R. crenulata, R. bupleuroides, and R. sachalinensis [9]. Li et al established a method for classifying and identifying R. quadri da and R. crenulata based on nuclear magnetic resonance 1H-NMR ngerprintschemical pattern recognition technique [10]. Zhu et al. found that the internal transcribed spacer 2 (ITS2) sequence can effectively distinguish R. crenulata and R. rosea [11]. Booker et al. used nuclear magnetic resonance spectroscopy coupled with high performance thin layer chromatography techniques to comprehensively analyze R. crenulata and R. rosea collected in markets around the world [12]. These advanced identi cation methods currently used can solve the identi cation problems of some Rhodiola plants, but they also have the disadvantages of having a narrow application range and high identi cation cost.
DNA barcoding is a new species identi cation technology developed in recent years [13]. It eliminates the obstacles of traditional morphological recognition methods that rely on long-term experience. The plant chloroplast (cp) genome, as a research hotspot for screening DNA barcoding sequences, can also be used as a super-barcoding for phylogenetic and species identi cation studies [14]. The use of the cp genome to solve the problem of di cult classi cation of related species is of great signi cance for species identi cation in herbal medicine and even the entire plant community. Chen et al. proposed that using the whole genome as a super-barcode can effectively identify Ligularia plants [15]. Zhong et al. found that 41 Dendrobium species can be effectively identi ed based on the whole cp genome, and Dendrobium o cinale from 3 different places of production can also be distinguished [16] In this study, we sequenced the cp genomes of R. tangutica, R. wallichiana, R. quadri da, R. bupleuroides, R. gelida, and R. henryi using Illumina technology followed by reference-guided assembly of de novo contigs. Our aims were: 1) to detect the variations of long repeats and SSRs in 6 Rhodiola cp genomes; 2) to identify divergence hotspots as potential genetic markers for Rhodiola DNA barcoding; and 3) to construct a phylogeny for Rhodiola species using protein coding sequences of the cp genome and infer their phylogenetic location within Crassulaceae.

Long repeats and SSRs
Repeat sequences have been applied extensively for phylogeny, population genetics, genetic mapping, and forensic studies [17]. We used REPuter to identify repeat sequences in the cp genomes of 6 Rhodiola species. A total of 144 pairs of repeats were detected in the 6 Rhodiola cp genomes, with the repeat length range from 30 to 62 bp ( Fig. 2A). The cp genomes of 6 Rhodiola have 5, 15, 8, 11, 12 and 9 forward repeats and 10, 14, 13, 16, 13 and 14 palindromic repeats (Fig. 2B). Reverse repeats and complementary repeats only exist in the cp genome of R. gelida (Fig. 2B). The long repeat lengths of 30, 31, 32, 40, and 41bp existed in all 6 Rhodiola cp genomes ( Fig. 2A). Long repeat lengths of 33, 34, and 36 bp were found the least often and only existed in the R. gelida and R. bupleuroides cp genomes, respectively ( Fig. 2A).
Simple sequence repeats (SSRs) are usually 1-6 bp tandem repeat DNA sequences and are widely used as molecular markers with their polymorphic to identify closely related species [18]. In our study, SSRs in 6 Rhodiola cp genomes were identi ed using MISA software. The number of SSRs in the 6 Rhodiola species ranged from 186 (R. wallichiana) to 200 (R. gelida) (Fig. 3A). The numbers and distribution of all SSR types were similar and conserved in the 6 Rhodiola cp genomes, except for pentanucleotide, which only existed in R. bupleuroides (Fig. 3B). Mononucleotide repeat motifs were occupied the largest proportion in these SSRs, ranging from 63% (R. bupleuroides) to 65% (R. henryi). No hexanucleotide repeat motifs were found in the 6 Rhodiola cp genomes (Fig. 3B). Among these mononucleotide repeats, A and T nucleotides were the most common, while tandem G or C repeats were quite rare (Fig. 3A), which was in concordance with the other research results [19,20]. These SSR markers could be used to examine the genetic structure, differentiation, diversity, and maternity in the 6 Rhodiola species and their relative species in future studies.

Divergence Hotspots
We selected eight Rhodiola cp genome sequences to be compared and plotted using the mVISTA software with the annotated cp genome of R. rosea as a reference to elucidate the level of sequence divergence (Fig. 4). Based on the overall sequence identity, the results indicated that the coding regions exhibit lower divergence levels than the non-coding regions and two IR regions exhibit higher conservation than the remaining sequences across the whole chloroplast genome, as can be seen in other plants [21,22]. Furthermore, the results showed that the ycf1 and trnH-GUG-psbA sequences of Rhodiola were highly divergent regions.
In addition, the nucleotide diversity (Pi) values were calculated to evaluate the sequence divergence among 22 Rhodiola cp genomes (Table S3). The genetic distance of all 75 protein-coding genes varied from 0 to 0.01143 (ycf1) with an average value of 0.00444 ( Figure 5A). Based on a considerably higher Pi value of > 0.009, we found seven highly variable regions (ycf1, rps15, ndhF, rpoC1, rps8, rpl20, rps18, and matK) (Fig. 5A). These values of the non-coding regions ranged from 0 to 0.04122 (trnH-GUG-psbA) with an average value of 0.01057 (Fig. 5B). A total of ve mutational hotspots that showed high values of PI (≥0.02) were identi ed, including trnH-GUG-psbA, rps15-ycf1, trnG-GCC-trnR-UCU, trnC-GCA-petN, and ndhF-rpl32. The analysis revealed that the protein-coding regions exhibit lower divergence levels than non-coding regions. These hotspot regions could be utilized as potential molecular markers for phylogenetic studies and the identi cation of Rhodiola species.

Phylogenetic Analysis Within Rhodiola
Complete cp genomes comprise abundant phylogenetic information, which could be applied to evolution and phylogenetic studies of angiosperms because of several advantages, such as high accuracy and resolution [23]. In order to clarify the phylogenetic position of six newly assembled Rhodiola within the Saxifragales, 38 protein-coding genes from 55 available complete cp genome sequences in Saxifragales were used for phylogenetic analyses. Maximum likelihood (MI) and Bayesian inference (BI) were used to construct phylogenetic tree with Rosa rugosa as an outgroup. The phylogenetic tree showed that all Rhodiola species formed a monophyletic clade and then were classi ed into two separate branches (Fig.  6), which is inconsistent with the record in Flora of China [1]. The dioecious clade composed of nine dioecious Rhodiola species, and the hermaphrodite clade included all the hermaphrodite species.

Plant Materials and DNA Extraction
Plant materials of Rhodiola were collected in Nyingchi (Tibet, China). The specimens of Rhodiola have been kept at the Tibet Agriculture & Animal Husbandry University. Total genomic DNA was extracted from silica-gel-dried leaves using the modi ed CTAB method [24]. The quantity and quality of extracted genomic DNA were determined by gel electrophoresis and NanoDrop 2000 Spectrophotometer (Thermo Scienti c, Carlsbad, CA, USA).

Chloroplast Genome Sequencing, Assembly and Annotation
Genomic DNA was randomly fragmented using physical methods. Paired-end sequencing libraries were constructed according to the Illumina standard protocol (Illumina, San Diego, CA, USA). Sample sequencing was carried out on an Illumina Hiseq X-Ten platform. Total genomic DNAs were also sent to BGI (Shenzhen, China) for library (400 bp) preparation for genome skimming sequencing. Paired-end (150 bp) sequencing was conducted on the Illumina HiSeq X-10 platform, generating ∼2 Gb data per sample. Raw reads were ltered by quality control software NGS QC Toolkit v2.3.333 to obtain high quality Illumina data [25].
Next, ltered reads were de novo assembled using NOVOPlasty [26] with parameters of K-mer (33). These assembled chloroplast genomes were annotated in GeSeq [27], coupled with manually edited start and stop codons in Geneious 11.1.4 [28] (Biomatters Ltd., Auckland, New Zealand) with a reference R. rosea chloroplast genome (Genbank accession number MH410216). In addition, all tRNA genes were further veri ed using tRNAscan-SE v1.21 [29]. The border region between the inverted repeat (IR) and the large single copy (LSC), also between inverted repeats and small single copy (SSC) junction were determined through local BLAST software. Finally, the circular gene maps of Rhodiola plastomes were drawn utilizing the Organellar Genome DRAW tool (OGDRAW) [30].

Comparative Genomic Analysis and Molecular Marker Identi cation
To detect variations within Rhodiola cp genomes, we compared the cp genomes of R. crenulate, R. rosea, and the six newly assembled Rhodiola cp genomes by mVISTA [31]. The nucleotide diversity of the Rhodiola cp genomes was detected by DNA Sequence Polymorphism (DnaSP) software [32].

Characterization of Repeat Sequence and SSRs
The long repetitive sequences were detected using REPuter with a 30 bp minimum repeat size and a Hamming distance of 3 [33]. Simple sequence repeats (SSRs) in the cp genomes were identi ed via the MISA perl script [34] with the minimum number of repeats set to 8, 5, 3, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively.

Phylogenetic Analysis
The phylogenetic analysis was performed on six newly assembled Rhodiola cp genomes, another 55 Saxifragales species, and one outgroup Rosa rugosa, all of which were down loaded from the NCBI (Table S4) except those of six newly assembled Rhodiola cp genomes. Molecular phylogenetic trees, using aligned sequences of 38 protein-coding genes (

Discussion
Six Rhodiola cp genomes were sequenced and assembled in our study, and this information was used to identify candidate DNA markers and infer Rhodiola phylogeny. The size of the cp genome, the length of the SSC, LSC, and IR regions, the content of the GC, and gene content demonstrated a high degree of similarity among the genomes, implying that Rhodiola species shared low diversity [4]. The IR region, however, has a larger GC content than the LSC and SSC regions. In most angiosperm chloroplasts, there are 74 protein-coding genes, with an additional ve in a few species [37]. Six newly assembled Rhodiola cp genomes contain 85 protein-coding genes, 37 tRNA genes, and 8 rRNA genes, which is consistent with previous studies [4].
The whole cp genome contains abundant mutation sites, which can be used directly as a super barcode for species identi cation. As with hypervariable regions of the genome, they can also be screened out as potential molecular markers [38,39]. At present, many species have been successfully identi ed based on the chloroplast genome, especially the species with frequent hybridization and apomixis [39][40][41]. Ycf1, rps15, ndhF, rpoC1, rps8, rpl20, rps18 and matK genes in CDS showed signi cant variation and high sequence variations were found in intergenic regions as follows: trnH-GUG-psbA, rps15-ycf1, trnG-GCC-trnR-UCU, trnC-GCA-petN, and ndhF-rpl32 ( Figure 6). These regions can also be used as candidate markers for elucidating the phylogenetic relationship among Rhodiola species. In many species, TrnH-GUG-psbA and matK are the most mutated hotspots for species identi cation, such as Kengyilia [42], Apocynaceae [43] and Orchidinae [44]. Ycf1 marker has good species identi cation resolution in Pinus at the within-genus level relationships [45]. Ycf1 has a better effect on the identi cation of Rhodiola due to its longer sequence (~5800bp). We recommend that the ycf1 gene be used to reconstruct phylogenetic relationships of Rhodiola where there is a lack of genomic information.
Our phylogenetic analysis strongly supported the monophyly of Rhodiola species, which is consistent with previous studies [4,46]. All Rhodiola species are classi ed into two separate branches (dioecious and hermaphrodite), which supports the view of Zhao et al [4]. What interests us is that R. wallichiana was gathered in the hermaphrodite clade, and we think that we may have selected a rare unisexualis among R. wallichiana. So, we speculate that unisexual R. wallichiana and bisexual R. wallichiana may have huge genetic differences. In addition, the dioecious clade also contains R. integrifolia, which has been temporarily classi ed as a hybrid between R. rosea and R. rhodantha [47]. Our phylogenetic analyses also revealed that there are close relationships between Crassulaceae and Saxifragaceae, supporting the view that there is a common origin between them. In some traditional angiosperm classi cation systems, Saxifragaceae is the largest group of garbage bins in angiosperms, and many branches that are unrelated in evolution are forcibly pieced together into a highly multi-line group. Our phylogeny suggested that the Penthoraceae and Haloragidaceae were clustered into one clade, indicating their close relationship. The APGII research considers that Penthoraceae can be selectively combined with Haloragidaceae [48], our research seems to support this view. However, since there is only one chloroplast genome data in two families and lack of data on the cp genome of more species, we believe that when more species in the two families have been sequenced to accurately determine their evolutionary relationship.

Conclusions
In this study, we determined and characterized six cp genome sequences of Rhodiola, which are commonly used as Tibetan medicinal materials. The size of the genome, the structure and organization of genes were shown to be conservative, which is similar to those reported cp genomes of Rhodiola species. To develop molecular markers for future phylogeographic and population genetic studies, thirteen mutational hotspots were identi ed, including ycf1, rps15, ndhF, rpoC1, rps8, rpl20, rps18, matK, trnH-GUG-psbA, rps15-ycf1, trnG-GCC-trnR-UCU, trnC-GCA-petN, and ndhF-rpl32. The results of phylogenetic analysis showed that Rhodiola species were clustered into two clades: dioecious and hermaphrodite, with strong support values. The complete cp genome sequences that were newly assembled facilitate medicinal resource conservation, phylogenetic reconstruction, and biogeographical research of Rhodiola.  Figure 1 Gene maps of the complete cp genome of six species of Rhodiola. Genes drawn outside of the map circle are transcribed clockwise, while those drawn inside are transcribed counter-clockwise. The darker gray in the inner circle corresponds to GC content while the lighter gray corresponds to AT content.

Figure 2
The number of long repeats in the whole cp genome sequence of the 6 Rhodiola species. (A) Frequency of repeats more than 30 bp long. (B) Frequency of repeat type.  Phylogenetic tree based on 38 protein-coding genes of 55 available complete cp genome sequences in Saxifragales. The tree was generated using a ML method with 1000 bootstrap replicates. Numbers on the nodes indicate bootstrap values.