The complete plastomes of red eshed pitaya (Selenicereus monacanthus) and three related Selenicereus species: insights into gene losses, IR expansions and phylogenomic implications

Background: Selenicereus is a genus of perennial shrub from the family Cactaceae, and some of them play an important role in the food industry, pharmaceuticals, cosmetics and medicine. To date, there are few reports on Selenicereus plastomes, which limits our understanding of this genus. Here, we reported the complete plastomes of four Selenicereus species (S. monacanthus, S. annthonyanus, S. grandiorus and S. validus, and carried out a comprehensive comparative analysis. Results: The four Selenicereus plastomes all have a typical quartile structure. The plastome size ranged from 133,146 bp to 134,450 bp, and contained 104 unique genes, including 30 tRNA genes, 4 rRNA genes and 70 protein-coding genes. Comparative analysis showed that there were massive losses of ndh genes in Selenicereus. Besides, we observed the IR regions had undergone a dramatic expansion and formed a previously unreported SC/IR border in the intron region of the atpF gene. Furthermore, we identied 6 hypervariable regions (trnF-GAA-rbcL, ycf1, accD, clpP-trnS-GCU, clpP-trnT-CGU and rpl22-rps19) that could be used as potential DNA barcodes for the identication of Selenicereus species. Phylogenetic analysis indicated that Hylocereus was nested in Selenicereus. Conclusion: Our study enriches the plastomic resources in the family Cactaceae, and provides the basis for the reconstruction of phylogenetic relationships. potential DNA barcodes for the identication of Selenicereus species. Our study enriches the plastomic resources in the family Cactaceae, and provides the basis for the reconstruction of relationships.


Background
Hylocereus species are perennial herbs from the family Cactaceae. The species in this genus are native to Central America, and nearly 20 species of Hylocereus are recognized by most and they can be found naturally occurring from Southern Mexico down throughout Central America and into Northern South America [1]. Also, for harvest their large fruits, which are known as dragon fruit, those species are grown on farms in other parts of the world, especially tropical Asia. All Hylocereus species have varying edible fruits, and are commercially developed in different ways. Although the white pitaya (H. undatus) is the primary species found in grocery stores and street markets, red-eshed dragon fruit has gained more popularity. The red eshed pitaya (Selenicereus monacanthus (Lem.) D.R.Hunt), formerly known as H. lemairei, not only has an attractive red-purple appearance and unique taste, due to its rich content of highvalue functional compounds [2], it is also widely used in pharmaceutical, cosmetic and medical applications. For example, the pulp of red-eshed pitaya is rich in β-carotene and anthocyanin, which can effectively prevent and treat some chronic diseases (especially cancer) [3][4][5].
The speci c de nitions of Hylocereus and Selenicereus have always been controversial [6]. Britton & Rose [7] gave early de nitions of the two based on morphology, but Bauer [8] believed that "the transfer of Selenicereus to Hylocereus" made Britton & Rose's classi cation concept no longer applicable. Finally, based on a large number of plastid and nuclear DNA sequences, morphology and anatomical data, it was proved that the two genera were not separated, and Hylocereus was nested in Selenicereus [9][10][11][12]. The increasing quality and widespread cultivation of pitaya, as well as different viewpoints and limited genomic information, have further complicated the taxonomic de nition of this genus. Therefore, it is very important to explore the phylogenetic relationship of the Selenicereus species based on genomics. Unfortunately, there are few studies on the phylogenetic relationship between Hylocereus and Selenicereus based on the complete plastid genomes [13].
Organelle genome sequencing can effectively solve the closely related phylogenetic relationships among species [14]. Chloroplasts are an important organelle in plants, which had its semi-autonomous genetic system, known as chloroplast genome or plastid genome (plastome) [15], most plastomes in angiosperms are a typical quadripartite structure [16], consisting of two inverted repeats (IRa and IRb) and two copy regions (LSC and SSC) [17], and the size of the plastomes ranges from 72 to 220 kb [18], including about 110-130 unique genes, of many are involved in photosynthesis [19]. Plastid genomes have been widely used in taxonomic and evolutionary studies [20] due to their small size, simple structure and maternal inheritance [21,22]. Entire plastid genomes and nuclear DNA clusters are important in distinguishing between closely related or recessive species [23][24][25]. Besides, although the plastid genomes are generally conserved in terms of sequence differences and structural organization, some non-coding regions may experience an unexpectedly high frequency of nucleotide substitutions, and these hypervariable regions could be used as DNA barcodes for species identi cation.
In this study, we sequenced, assembled and analyzed the plastid genomes of four Selenicereus species, including the red-eshed pitaya (S. monacanthus, formerly classi ed as Hylocereus) and three traditional Selenicereus species (S. annthonyanus, S. grandi oras and S. validus). Our main tasks were as follows: (1) we provide four high-quality references Selenicereus plastomes; (2) we analyzed the structural characteristics and sequence divergence of the plastomes in Selenicereus; (3) we identi ed simple sequence repeats (SSRs) loci and repeat sequences for further studies on population genetic structure; (4) we inferred the phylogenetic relationships of Selenicereus and Hylocereus in Cactaceae based on the complete plastome sequence; and (5) we identi ed the hypervariable regions that could be used as DNA barcodes for commercial identi cation of pitaya varieties.

Results
Overall organization and features of the four plastomes The plastomes size of these four taxa ranged from 133,146 bp (S. monacanthus) to 134,450 bp (S. validus). They were typical quadripartite structure, consisting of a large single-copy region, (LSC, 68,076 − 68,877 bp), a small single-copy region, (SSC, 21,716 − 22,023 bp), and a pair of inverted repeat region (IRs, 21,674 − 21,775 bp). Figure 1 showed the plastid genome map. In addition to the differences in length, the GC content of these conserved plastomes also showed slight changes. According to the analysis, the GC content in the four plastomes ranged from 36.29% to 36. 43   Similar to previous reports in cacti plastomes, the 11 ndh genes in the analyzed plastomes were partially lost, including ndhA, ndhC, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ and ndhK. The second exon of ndhB gene was also lost, and only the rst exon remained. By contrast, only the ndhD gene was intact. On the whole, the four plastomes were all composed of 104 unique genes, including 30 unique tRNA genes, 4 unique rRNA genes and 70 unique protein-coding genes. Moreover, we observed the loss of the rst exon of the clpP gene, which might be pseudogenes with the same to gene ndhB ( Table 2). Note. (x2) indicates that the gene located in the IRs and thus had two complete copies, * and ** indicate that genes containing one/two introns. ' ψ ' indicates that it is a pseudogene.

Category of Genes
Group of Genes Name of Genes Unknown Conserves open reading frames ycf1, ycf1 ψ , ycf3**, ycf2 (x2), ycf4 Note. (x2) indicates that the gene located in the IRs and thus had two complete copies, * and ** indicate that genes containing one/two introns. ' ψ ' indicates that it is a pseudogene.
Furthermore, we also clearly observed the loss of intron in two genes: rpl2 and rpoC1. Due to the loss of some genes, exons and introns, the number of intron-containing genes in the plastomes of Selenicereus species were signi cantly reduced compared to most other non-cactus species. Except for the transsplicing gene, rps12, there were only 5 protein-coding genes (petB, petD, rpl16, rps16 and atpF,) containing one intron, and only one gene (ycf3) containing two introns. Moreover, 5 tRNA genes containing one intron (trnL-UAA, trnT-CGU, trnK-UUU, trnA-UGC and trnE-UUC).

Repeat And Ssrs Analysis
Microsatellites (simple repeat sequences, SSRs) are usually 6 bp tandem sequences in eukaryotic genomes [26]. Their high polymorphism and codominant inheritance make them popular molecular markers [27,28], which play an important role in the identi cation of species and the evaluation of evolutionary relationships [29]. Among the four plastomes, S. monacanthus had the largest number of SSRs (74), followed by S. anthonyanus and S. validus (both were 61), and S. grandi orus (55). Most of these SSRs were homopolymers of A/T mononucleotide, and on average, they accounted for 64.34% of the total SSRs. Dinucleotides (18.85%), tetranucleotides (8.61%), and trinucleotides (3.69%) were followed. Pentanucleotide and hexanucleotide repeats were rare in Selenicereus plastomes, accounting for 1.23% and 1.64% of all SSRs, respectively ( Fig. 2 and Additional le 1: Table S1).
We detected a large number of dispersed repeats in the four plastomes. A total of 807 dispersed repeats were identi ed, including 618 forward repeats (with length ranged from 30 to 415 bp), 146 palindromic repeats (30 to 415 bp), 39 reverse repeats (30 to 41 bp), and four complementary repeats of 30 bp in length (Additional le 1: Table S2). The detailed distribution of these dispersed repeats in each plastome was shown in Fig. 2. Notably, the number of forward repeats in S. grandi orus and S. validus was signi cantly greater than that in the other two taxa. The dispersed repeats not only serve as potential markers for rearrangement, but were also crucial for inducing mutations [30].

Genomic Divergence
Sequence identity analysis based on mVISTA [31]was performed among the 4 plastomes, with the reference being the plastome of S. validus. We found that the plastome sequences of the four species were quite conservative, in general, IR regions were more conserved than LSC and SSC regions, and the hypervariable regions were mainly found in non-coding sequences. Nevertheless, several coding-regions showed signi cant differences in the sequences (Fig. 3), such as accD, clpP, ycf1 and ccsA. The accD gene, in particular showed unusual sequence divergence. In addition, there were signi cant differences among several non-coding regions: trnF-rbcL, trnM-accD and trnN-trnR.

Contraction And Expansion Of Irs
We analyzed the IR/SC boundaries and their adjacent genes in the four plastomes, and compared them to previously published related plastomes. It is not di cult to nd that the IR/SC border and the adjacent genes of Selenicereus plastomes were very similar in structural characteristics except for small differences in gene position. However, we found that the IR lengths and IR boundaries of the four plastomes newly reported here were varied greatly from those previously reported in cacti and related species. The length of IR regions was observed more than 20,000 bp in Opuntia quimilo and all other reported non-cactus species in the order Caryophyllales. However, it was only 8,530 bp in Rhipsalis baccifera, and less than 2,000 bp in most cacti genera, such as Mammillaria, Carnegiea, Lophocereus et al. Here, in our four sequenced Selenicereus plastomes, the IR lengths were ranged from 21,674 to 21,775 bp, indicating that the cacti had undergone a drastic expansion/contraction event in the IR region.
Furthermore, we also analyzed the IR boundaries of plastomes in the IR region that over 2,000 bp. As shown in Fig. 6, in two non-cactus species, the rps19 gene span the LSC/IRb border, and an rps19 pseudogene was duplicated in the IRa. The same, the ycf1 gene span the SSC/IRa border, and an ycf1 pseudogene was duplicated in the IRb.
In O. quimilo, the two LSC/IR boundaries were ycf15-trnV and trnV-trnH, and the two SSC/IR boundaries were ndhG-trnL and ndhG-ndhE, respectively. By contrast, in R. baccifera, the two LSC/IR boundaries were rpl23-trnI and trnI-trnH, and the two SSC/IR boundaries were inside of ycf1. Due to the dynamic changes of IRs, the IR boundaries were also changed in the four Selenicereus plastomes. Although the two SSC/IR boundaries were similar to R. baccifera, the second exon of atpF gained access to the IR regions, the rst exon of atpF was not. Thus, a previously unreported LSC/IR boundaries at the intron region of atpF was formed. This result suggested that the IR boundaries in cacti plastomes were extremely unstable compared with other Caryophyllales plastomes.
Phylogenetic Analysis Based On Conserved Protein-coding Genes In this study, we constructed phylogenetic trees by using the 56 shared plastidial genes as datasets. The tree reconstruction based on Maximum Likelihood (ML) method and Bayesian Inference (BI) method had a highly consistent topology. The stable topological structure and high bootstrap/posterior probability support values of each node indicated the reliability of phylogenetic tree (Fig. 7).
The phylogenetic analysis involved 15 species of subfamily Cereoideae and two outgroups (Pereskia sacharosa and O. quimilo). In our tree, the four Selenicereus species form a monophyletic clade supported by strong support values. The red-eshed pitaya (S. monacanthus) was most closely related to S. anthonyanus compared to other two Selenicereus species. This result also suggested that Hylocereus and Selenicereus were paraphyletic groups, the genus Hylocereus belonged to Selenicereus based on the 56 conserved plastidial genes.

Discussion
Changes in the content of plastomes: gene gain/loss and intron loss In this study, we reported the complete plastid genomes of four Selenicereus species. According to the assembly results, the plastomes of these four taxa were typical quartile structure, with a pair of inverted repeats separated by a large-copy region and a small-copy region [33]. The ndh genes in plastids play an important role in Circulating Electron Flow (CEF) in the photosystem of most land plants, CEF is attributed to plant maintenance of effective photosynthesis, water stress and light protection [34,35]. Exciting, we found two interesting phenomena in this study. Firstly, the phenomenon of massive losses of ndh genes in the plastome was observed, which was similar to the report by Sanderson et al. [36], only the ndhD gene was relatively complete; Secondly, we found that the number of intron-containing genes in the plastomes of Selenicereus species was signi cantly reduced. The main reason for this phenomenon is the losses of exons (ndhB and clpP) and introns (rpl2 and rpoC1). Introns can effectively improve the expression level of genes under certain conditions, and play an indispensable role in regulating gene expression [37]. This loss has also been observed in other plastomes of subfamily Cereoideae [36, 38, 39], and it probably is a feature unique to this clade.
SSRs and the repeats are crucial for the plastome rearrangement, and are widely used to detect population genetic diversity [40], as well as being considered as markers for DNA ngerprinting [41]. We analyzed the SSRs and repeat sequences in the four plastomes. First, the number of SSRs ranged from 55 to 74, of many were mononucleotide (A/T) polymer, and accounting for 65.57% of all SSRs. This is one of the reasons for the low GC content in the plastome. Second, compared with SSRs there were a lot of dispersed repeats in the four analyzed plastomes, and the length of forward/palindromic repeats even more than 400 bp. The repeated sequences have previously been reported to have the potential to form secondary structures, they can be used to identify the recombination process [42]. In our study, these large numbers of short dispersed repeats most likely facilitated the plastome rearrangement. Unfortunately, our Illumina short reads have not been able to con rm this, and the long reads will be needed to con rm the presence of genomic recombination in the future.

The Expansion Of Irs Resulted In A Rare Boundary
The contraction and expansion of IRs are common in angiosperms [43], which is also one of the factors affecting the length of plastid genome [44]. According to the comparative analysis results, we found that the length of IR regions in the four Selenicereus plastomes were exceeded 20 kb. Although this phenomenon also exists in O. quimilo, the IR length of most reported genera in cacti such as Mammillaria, Carnegiea and Lophoereus were usually less than 2 kb. Other studies have found that the IR length of R. baccifera was only 8530 bp. Apparently, cacti species have been undergone dramatic expansion/contraction events in IR region. Besides, through the analysis of the IR boundaries, we noticed that the positions of each gene in the IR/SC border of the four Selenicereus plastomes were not signi cantly different. However, due to the expansion of IRs, some genes originally located in the LSC region were access to IR region and formed a new IR boundary in the intron region of gene atpF that had not been reported before. In general, the IR boundary in cactus family is extremely instability compared to that of other plastomes in Caryophyllales order [45].

Hypervariable Regions Were Identi ed Based On Plastome Sequences
According to the results of sequence similarity analysis by mVISTA, the four Selenicereus plastomes were highly conserved, and there were few regions of difference. The hypervariable regions in plastomes were mainly identi ed in non-coding regions, which is consistent with the other plastomes in angiosperm [46,47]. Although there is little difference in plastomes as a whole, some hypervariable regions deserve our attention. Signi cant differences were observed in some protein-coding genes, such as clpP, ycf1, ccsA and accD, particular in gene accD, the mutation rates were even higher than that of the non-coding region. While in contrast, the gene with the greatest difference among the other plastomes usually was observed in gene ycf1. The differences in accD genes might be due to the presence of a large number of forward repeats in this region, which tend to mediate genome rearrangement. A large number of repeats in this region have been previously observed in passion fruit [48], leading to the rearrangement of plastid genomes. Our results suggested that this region is also highly variable in cactus, and that they probably also contribute to genomic recombination in the genus Selenicereus. The gene accD and ycf1 both are indispensable for plant adaptation and leaf development [49,50], and the high variability of nucleotide sequences of these two genes might be the result of environmental adaptation during evolution [51][52][53]. However, whether they cause physiological differences between Selenicereus and other cactus plants remains to be seen. On the whole, these "hot spots" of mutations could be used as resources for system biology analysis and identi cation of DNA barcodes in plants. Our results provide a wealth of genetic information for the identi cation of species for the development of new DNA barcodes in Selenicereus [54].
Phylogenomic analysis revealed a close relationship among Selenicereus species In this study, we have constructed the high resolution phylogenetic tree by using the 56 shared plastidial genes as datasets. The results show undisputed monophyly of the 4 Selenicereus species. However, it is worth noting that S. monacanthus, once classi ed as Hylocereus (synonym: H. lemairei), is more closely related to S. anthonyanus, the traditional Selenicereus species. Our results support the previous studies, namely the two genera were not separated, and Hylocereus was nested in Selenicereus [9][10][11][12]. However, considering the existence of interspeci c or even intergeneric hybridization for Selenicereus plants [55], it is one-side to perform phylogenetic inferences about species with hybridization origin based on organelle genomes, as organelles are matrilineal inheritance [56]. The combination of nuclear and organelle genes should be considered and used for phylogenetic inference in the future.

Conclusion
In this study, we reported the complete plastid genomes of four Selenicereus species. The plastid genomes of these four species were similar to those of other angiosperms with typical quadripartite structure. In general, the structural changes of in the four plastomes were interesting: 1) Large losses of ndh genes; 2) The losses of introns and exons leads to a signi cant decrease in the number of intron genes; 3) The IR region underwent a dramatic expansion and formed a previously unreported SC/IR border in the intron region of the atpF gene. Besides, we identi ed 6 hypervariable regions that could be used as potential DNA barcodes for the identi cation of Selenicereus species. Our study enriches the plastomic resources in the family Cactaceae, and provides the basis for the reconstruction of phylogenetic relationships.

Sampling, DNA extraction and sequencing
Fresh stems of the red-eshed pitaya (S. monacanthus) were collected from Yulin, Guangxi, China (22°94'N, 110°49'E). The fresh stems of the other three analyzed Selenicereus species were collected from the local ower market of Beibei, Chongqing, China (29°81'N, 106°40'E). They were identi ed by Professor Jie Yu. These species were cultivated for edible use or ornamental plants, and no permission is required to collect these samples. Our experimental research, including the collection of plant materials, complies with institutional, national or international guidelines. All the samples were deposited in the Herbarium of Southwest University, Chongqing, China (voucher code: YJ-swu002, YJ-swu027 ~ YJ-swu029). Total genomic DNA was extracted by using the CTAB method [57]. The DNA library with an insert size of 350 bp was constructed using a NEBNext® library construction kit and sequenced by using the HiSeq Xten PE150 sequencing platform. Sequencing produced a total of 6.04-6.85 Gb of raw data per species. Clean data were obtained by using Trimmomatic [58]: we removing the low-quality sequences with more than 5% bases being "N", and a quality value of Q < 19 accounted for more than 50% of the total base. The detailed sequencing data were showed in Additional le 1: Table S4. Genome Assembling And Annotation The de novo genome assembly from the clean data was accomplished utilizing GetOrganelle (v1.7.3) with a default setting, and NOVOPlasty (v3.8.1) [59] was continue used for those without assembled circular genome. The correctness of the assembly was con rmed by using Bowtie2 (v2. 0.1) [60] to manually edit and map all the raw reads to the assembled genome sequence under the default settings.
Detailed assembly information was shown in Additional le 1:    Sequence similarity of 4 Selenicereus species by using S.validus as a reference sequence and visualized in mVISTA. Different color markers represent different areas, the pink regions are conserved noncoding sequences, the purple regions are protein-coding sequences, the light blue regions are tRNA or rRNA and the gray arrows are the gene and its direction. The percentage of identity ranges from 50 to 100%, shown on the Y-axis.

Figure 4
The nucleotide diversity (Pi) of four Selenicereus plastomes. It was analyzed by using DnaSP with a sliding window analysis (window length: 500 bp, step size: 500 bp). The horizontal and vertical axes respectively represents the midpoint position of the window and the Pi value of each window. Pi values in one intergenic (trnF-GAA-rbcL, 0.05567) and two protein-coding genes (accD, 0.00667-0.167; ycf1, 0.004-0.059) were greater than 0.05.