Exploration of the Species-specic Dna Markers Based on the Complete Chloroplast Genome for Discriminating Curcuma Comosa Roxb. From Curcuma Latifolia Roscoe and Other Related Species

Members of the Curcuma genus are among the most commonly used rhizomatous herbs worldwide. There are two species of Curcuma referred to as “Wan Chak Motluk” in Thai, C. comosa Roxb. and C. latifolia Roscoe, and their herbal materials are often confused. C. comosa is widely used as a traditional herbal remedy for its phytoestrogenic activity, but its morphology is highly similar to that of C. latifolia, which contains a compound that causes hepatotoxicity. In this study, the complete chloroplast (cp) genomes of these species were determined for the rst time using Illumina sequencing. Our results showed that their cp genomes were 162,272 bp (C. comosa) and 162,289 bp (C. latifolia) in length. A total of 133 unique genes were identied, including 87 protein-coding genes, 38 tRNA genes and 8 rRNA genes. Comparative analyses with other species of Curcuma indicated high similarity in gene content and structural organization. The analyses also reveal variable hotspots in the genomes at ndhA, trnT-trnL, and ndhC-trnV that can serve as species-specic nucleotide barcodes. Indeed, mislabeling of these two species among samples sold at market was detected using these species-specic markers, indicating that cp genomes can provide more information for better elucidating and improving discriminatory power for species authentication.


Introduction
Because of the lack of adequate variations in DNA barcode sequences, a new method is needed to identify closely related plant species. Chloroplast genome sequencing is one such tool and has recently been shown to successfully discriminate closely related species in the genera Boesenbergia, Curcuma, Kaempferia, and Pyrgophyllum 25,26 . Therefore, this study has identi ed species-speci c DNA markers to allow the discrimination of C. comosa from other related species, including C. latifolia, based on the cp genome. The developed species-speci c DNA markers were then utilized for testing crude drugs sold in herbal markets. Additionally, this study provides reliable and high-quality chloroplast genome resources for future research in Zingiberaceae.
Codon usage. The relative synonymous codon usage (RSCU) of ten species belonging to ve different genera in the family Zingiberaceae (Table S1), including C. comosa and C. latifolia, was calculated. There was no bias in the usage of the start codons methionine (AUG) and tryptophan (UGG) (RSCU = 1). The 87 protein-coding genes contained approximately 28,393 codons. UUAencoded leucine had the highest RSCU, at approximately 1.94, and GCG-encoded alanine had the lowest RSCU, at approximately 0.39. All preferred synonymous codons with A or U at the third position showed higher bias (RSCU >1) than those with G or C (Table S2).
Highly variable sequences in noncoding regions of the cp genome of C. comosa and C. latifolia. To compare the sequence divergences of C. comosa and C. latifolia, the cp genome sequences of the 10 species in Zingiberaceae (Table S1) were included for comparison using the mVISTA program, and C. comosa was used as the reference. Overall, the coding regions were more conserved than the noncoding regions among the 10 species of Zingiberaceae; however, rpoC2, rpoB, ycf1, ycf2, and ndhF exhibited some degree of variation. The two IR regions were less divergent than the LSC and SSC regions. In contrast, high levels of divergence were found in the intergenic regions of trnK-rps16, rpoB-trnC, rps4-trnT, trnT-trnL, ndhC-trnV, and ndhF-rpl32 (Fig.  5). In addition to nucleotide divergence, the expansion and contraction of the border regions were also analyzed for the 10 species of Zingiberaceae ( Fig. 6; Table S1). The four junctions, LSC/IRa, LSC/IRb, SSC/IRa and SSC/IRb, were found to be almost the same (29,642 bp to 29,797 bp). The rpl22-rps19 genes were located at the boundary of the LSC/IRb region in each cp genome. The rpl22 gene was located on the left side of the LSC/IRb boundary, at a distance of 21 bp to 48 bp. The rps19 gene was located on the right side of the LSC/IRb boundary, at a distance of 129 bp to 148 bp. The ycf1-ndhF genes were located at the IRb/SSC boundary. The IRb/SSC junction was located in the ycf1 region and extended a length of 7 bp to 205 bp into the SSC region. The ndhF gene was located on the right side of the IRb/SSC boundary, at a distance of 8 bp to 218 bp. The SSC/IRa junctions in the cp genomes were embedded in the ycf1 genes, with the distance of 3,705 bp to 3,899 bp in the IRa region. The rps19-psbA genes were located at the boundary of the IRa/LSC region. The rps19 gene was located on the left side of the IRa/LSC boundary, at a distance of 129 bp to 148 bp, while psbA was located on the right side of the IRa/LSC boundary, at a distance of 109 bp to 125 bp.
ndhA, TrnT-trnL, and ndhC-trnV are DNA signature sites for the development of species-speci c markers. The cp genome sequences belonging to 33 species of Zingiberaceae (Table S5) were analyzed for species-speci c DNA markers. As expected, the sliding window analysis showed the most variation in the LSC and SSC regions but lower variation in the IR regions (Fig. 7). The average value of nucleotide diversity (Pi) was 0.0096 among the 33 Zingiberaceae species (Table S6). Mutational hotspots were found in 6 genes, rps16-trnQ, ycf1, ndhA, ndhI, ndhD, and RF19; these sites exhibited remarkable Pi values, higher than 0.03 (Fig.  7A). In addition, the average value of nucleotide diversity (Pi) among 20 species in Curcuma was 0.0018 (Table S6), and there were 3 mutational hotspots in rps16-trnQ, petN-psbM, and ndhA that exhibited Pi values higher than 0.01 (Fig. 7B). Additionally, there were 6 SNPs and 41 indels in the cp genomes of C. comosa and C. latifolia (Table S7). When comparing the cp genomes among the 10 selected species of Zingiberaceae (Table S1), SNP/indel variation sites were found in the ndhA, trnT-trnL, and ndhC-trnV regions, with 1 SNP, a 6 bp insertion, and a 2 bp deletion, respectively ( Figure S1).
Phylogeny construction with the cp genome sequences of C. comosa and C. latifolia. To examine the phylogenetic positions of the C. comosa and C. latifolia species and their relationships within Zingiberales (Table S10), neighbor-joining (NJ) phylogenetic analyses were performed using 40 cp genomes from 40 species belonging to 6 families of Zingiberales. In this analysis, six families in Zingiberales were divided into two clades with 100% bootstrap support (BS) values. One clade was composed of ve families, including Musaceae, Heliconiaceae, Strelitziaceae, Cannaceae, and Costaceae, while the other clade included only Zingiberaceae. The clade containing Zingiberaceae was divided into 2 groups. The rst group included four genera (Wurfbainia, Amomum, Lanxangia, and Alpinia) (BS =100%), and the second group included ve genera (Curcuma, Stahlianthus, Hedychium, Kaempferia, and Zingiber). The second group was further divided into 4 subgroups (BS = 72-100%). Subgroup II was the most complex, with 17 species, including the species of interest, C. comosa and C. latifolia, on the same branch as C. elata and C. aromatica (BS = 100%). (Fig. 8).

Discussion
The accurate discrimination of herbal material using only morphological characteristics is di cult. DNA barcoding has become a conventional technology used for such discrimination. Although DNA barcoding technology has been developed signi cantly, no barcode has yet achieved the goal of reliable identi cation of all plant species. In fact, the identi cation of species with close genetic relationships continues to pose great challenges 10 . In this study, DNA regions including the rbcL, matK, psbA-trnH spacer and ITS2 of two authentic C. comosa and C. latifolia were established; their accession numbers have been submitted to GenBank (Table S11). Unfortunately, there were no variations observed in any of these core barcode regions. This result is consistent with previous studies showing low variation in the nucleotide sequences of these core barcode regions within closely related species, including species within Curcuma 6 , Chrysanthemum 14 , Cymbidium 48 and Ligularia 7 . Therefore, we developed speci c markers for the discrimination of C. comosa from C. latifolia and other related species belonging to Zingiberaceae based on mutation hotspot regions identi ed in cp genomes. In previous reports, closely related plant species in Hedyotis 49 , Gentiana 51 , and Fritillaria 47 were successfully identi ed at the species level based on divergent sequences at hotspots in the cp genome. The cp genomes of many Curcuma spp. have been studied, and divergent hotspots have been proposed as authentication markers, 26 but these prior efforts excluded C. comosa and C. latifolia. In this research, no signi cant variation in the numbers of total genes, protein-coding genes, tRNAs or rRNAs was shown between C. comosa and C. latifolia. Similar ndings were observed in other Curcuma spp. 13,26 and other plant species in Zingiberaceae 8, 23,25 . Codon usage compares the frequencies of each three-nucleotide sequence that codes for a particular amino acid 4 . Codons are used in transmitting genetic information and serve as the building blocks of proteins 24 .
Codon usage is a factor that has shaped the evolution of cp genomes due to biases in mutation, and it varies across species 22 .
The results of the codon usage analysis showed that codon usage bias was low in the cp genomes of C. comosa and C. latifolia, indicating their similar evolutionary paths. Repeat structure plays an important role in genomic rearrangement, recombination and sequence divergence in plastomes 40 . However, we did not nd signi cant variations in repeat distribution, especially in Curcuma genera (Fig. 4). Despite the fact that the organization of the cp genome is highly similar among these Zingiberaceae species, we discovered several regions with interspeci c polymorphisms, mostly located within intergenic regions and at the LSC, SSC and IR regional boundaries of the cp genomes ( Fig. 5-7). Among the 20 Curcuma cp genomes analyzed, including C. comosa, C. latifolia and Curcuma spp. sequences retrieved from the NCBI database, three divergent hotspots, rps16-trnQ, petN-psbM and ndhA, were found. This nding agreed well with a previously published report on the cp genome of Curcuma spp. in Zingiberales 26 . Based on our alignment of the cp genome, C. comosa has one unique sequence at ndhA. However, the sequences of rps16-trnQ and petN-psbM could not discriminate C. comosa from C. latifolia. Therefore, SSRs, indels and SNPs were examined for use as molecular markers. After searching, indel-variable loci at trnT-trnL and ndhC-trnV were discovered and used for differentiation of C. comosa from other related species, including C. latifolia. In our study, three regions, ndhA, trnT-trnL, and ndhC-trnV, were successfully de ned as species-speci c DNA markers for C. comosa. In previous publications, these same three regions were recommended for the identi cation of plants in the genera Dendrobium 50 , Actaea 31 , and Entandrophragma 28 . Then, nineteen crude drug samples claimed to be "Wan Chak Motluk" (C. comosa or C. latifolia) purchased from different herbal markets (Table S8) were examined using our developed species-speci c DNA markers, including ndhA, trnT-trnL, and ndhC-trnV regions ( Figure S1; Table S9). Surprisingly, only 5 out of 19 samples were con rmed as C. comosa, while 11 samples were identi ed as C. latifolia. Three samples were neither C. comosa nor C. latifolia. Our crude drug identi cation indicated that these two species are widely mislabeled in herbal markets. Thus, the Thai Food and Drug Administration (Thai FDA) must become involved and be noti ed about the quality of herbal materials as well as consumer safety. Therefore, the species-speci c DNA marker developed through cp genome sequencing could be used as one tool for authenticating the origin of these plant species. This marker can help to resolve the limitations of core DNA barcoding in discriminating closely related plant species. With the genetic variation discovered, we also analyzed the phylogenetic relationships of the 20 species of Curcuma and other species of Zingiberales using cp genomes. The phylogenetic analysis revealed that all of the species of Curcuma formed a complex monophyletic clade with high bootstrap support values. As expected, C. comosa and C. latifolia exhibited a very close relationship in our phylogenetic analysis.

Conclusion
Cp genome sequences represent a valuable genomics resource for exploring diversity in any plant family. To the best of our knowledge, this is the rst study of the cp genome sequences of C. comosa and C. latifolia, plants in Zingiberaceae. Within these cp genomes, species-speci c nucleotide sequences that can be used as DNA markers for discriminating C. comosa from other related species, including C. latifolia, were discovered in the ndhA, trnT-trnL, and ndhC-trnV regions. Species-speci c DNA markers can be used for species authentication when the origin or morphological characteristics of plants or herbal materials are not available or su cient for evaluation. The constructed phylogenetic tree of C. comosa and C. latifolia and related species provides a deeper understanding of the relationships among plants in the genus Curcuma and the family Zingiberaceae and can be exploited as a reliable resource for further Zingiberaceae research. To ensure the safe and effective use of herbal or raw materials, species identi cation and quality control steps should be performed before their release onto the markets.

Materials And Methods
Plant materials and crude drugs. Two cultivated authentic species, C. comosa Roxb. and C. latifolia Roscoe, were collected from Zingiberaceae Collection, Sireeruckhachat Nature Learning Park, Nakhon Pathom, Thailand and identi ed by taxonomist Dr. Bhanubong Bongcheewin at Mahidol University. Those collections are permitted and legal. Voucher specimens of C. comosa and C. latifolia were deposited at Sireeruckhachat Nature Learning Park and assigned as PBM005645 and PBM005639, respectively. All the experiments were performed in accordance with relevant guidelines and regulations.
Nineteen crude drug samples claimed to be C. comosa and C. latifolia were purchased from various herbal markets. They were purchased in different forms, as dried herbs and powders. Sample ID of 19 crude drugs in our collection was provided in Table S8.
Chloroplast genome sequencing. Intact chloroplasts were isolated from 10 g of fresh leaves of C. comosa and C. latifolia using 40% (v/v) Percoll solution and a Chloroplast Isolation Kit according to the manufacturer's protocol (Sigma-Aldrich, USA). Chloroplast genome assembly and annotation. The raw Illumina paired-end whole cp genome data of C. comosa and C. latifolia were trimmed and assembled into contigs using FASTP 7 and GetOrganelle 15 , respectively. All paired-end reads were mapped to the assembled cp genomes with over 28 coverage. The automatic annotator GeSeq was used to annotate the cp genome with BLAST searches against the cp genomes of sibling species 42 . The online program OGDRAW 12 was used to draw the circular cp genome map and then manually edited. The nal assembly was submitted to GenBank (https://www.ncbi.nlm.nih.gov) with corresponding accession numbers.
Codon usage analysis. The cp genomes of C. comosa and C. latifolia were analyzed for relative synonymous codon usage (RSCU) in protein-coding genes using Mega-X software 20 .
Repeat element analysis. The cp genome sequences (Table S1) belonging to plants of ve different genera in the family Zingiberaceae, Curcuma, Zingiber, Hedychium, Kaempferia, and Amomum, were retrieved from the NCBI database for repeat element analysis along with our authentic C. comosa and C. latifolia cp sequences. The MIcroSAtellite (MISA) 1 program was used for simple sequence repeat (SSR) analysis with the minimum number of repeats set to 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta-and hexa-nucleotides, respectively. REPuter software 21 was used to detect the size and location of long repeat sequences, including forward, reverse, complement and palindromic repeat units. Parameters were set as a Hamming distance of 3, maximum computed repeats of 150 and minimal repeat size of 30.
Sequence divergence analysis. The cp genomes (Table S1) were compared and analyzed for variable regions by using the mVISTA tool in Shu e-LAGAN mode 11 . The annotated cp genome of C. comosa was used as the reference. Boundaries between the LSC, SSC and IR regions were manually de ned to detect the variation in gene rearrangement between regions. The cp genomes (Table S5) belonging to plants of nine different genera in the family Zingiberaceae, Curcuma, Zingiber, Hedychium, Kaempferia, Amomum, Alpinia, Wurfbainia, Stahlianthus and Lanxangia, were retrieved from the NCBI database for sequence divergence analysis. DnaSP v.6 35 was used to calculate nucleotide variability (Pi). The parameters for sliding window analysis were a 600 bp window length and 200 bp step size.
SNP and indel detection. The cp genomes (Table S1) were aligned using MAFFT v.7.0. 17 , and the sequences were edited using Mega-X software 20 . The SNP/indel was detected by using DnaSP v.6 35 . The cp genome of C. comosa was used as the reference.
Species-speci c DNA markers for discrimination of C. comosa from C. latifolia and other related species. Three variable sequence regions, ndhA, trnT-trnL, and ndhC-trnV, from the twenty Curcuma spp. were aligned using MAFFT v.7.0. 17 (Table S5). The speci c primers were designed based on the conserved sequence encompassing the variable sites of C. comosa and C. latifolia ( Figure  S1).
Testing the authenticity of crude drugs. The gDNA of nineteen samples (Table S8) was ampli ed by PCR with our speci c primers (Table S9). PCR was performed in a volume of 12.5 µL containing 20-50 ng template gDNA, 1 PCR buffer, 3 mM MgCl 2 , 0.2 mM dNTP mix, 0.5 µM of each primer and 0.5 U Platinum Taq DNA polymerase (Invitrogen, USA). The cycling conditions consisted of predenaturation at 94°C for 3 min followed by 30 cycles of denaturation at 94°C for 30 sec, annealing at 55°C (trnT-trnL and ndhA) or 60°C (ndhC-trnV) for 30 sec and extension at 72°C for 20 sec (trnT-trnL) or 25 sec (ndhA and ndhC-trnV) followed by a nal extension at 72°C for 5 min. The PCR products were puri ed and sequenced by Macrogen, South Korea. Sequences from amplicons were BLASTed against our authentic ndhA, trnT-trnL, and ndhC-trnV sequences from C. comosa and C. latifolia in MAFFT v7.0. 17 .
Phylogenetic analysis. The cp genomes (Table S10) belonging to plants of six different families in Zingiberales, Musaceae, Heliconiaceae, Strelitziaceae, Cannaceae, Costaceae and Zingiberaceae, were used for phylogenetic analysis. Typha latifolia L., an angiosperm, was used as an outgroup. The cp genome sequences were aligned using MAFFT v7.0. 17 . Subsequently, a neighborjoining (NJ) tree was constructed, and the relative support for the branches of the NJ tree was assessed via 1,000 bootstrap replicates.   Sequence alignment of 10 Zingiberaceae cp genomes using mVISTA. The C. comosa cp genome was used as a reference. Gray arrows and thick black lines above the alignment indicate the gene orientation. Purple bars represent exons, sky-blue bars represent transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), and red bars represent noncoding sequences (NCSs). The horizontal axis indicates the coordinates within the cp genome. The vertical scale represents the percent identity, ranging from 50 to 100%. White peaks represent regions with sequence variation among the 10 species.