Chloroplast genome organization and features of C. comosa and C. latifolia. Illumina high-throughput sequencing was used to obtain the cp genome sequences of C. comosa and C. latifolia. The C. comosa and C. latifolia cp genomes were 162,272 bp and 162,289 bp in length, respectively (Fig. 2). Both cp genomes had typical quadripartite structures consisting of a large single-copy (LSC) region of 87,074 bp in C. comosa and 87,089 bp in C. latifolia, a small single-copy (SSC) region of 15,698 bp in C. comosa and 15,700 bp in C. latifolia, and two inverted repeat (IR) regions of 29,750 bp in both species. The GC contents of both cp genomes were identical, at 34% (LSC), 29.7% (SSC), 41.2% (IR), and 36.2% (total). In addition, the cp genomes of the two species were predicted to encode 133 genes, including 87 protein-coding genes, 38 tRNA genes and 8 rRNA genes (Table 1). These genes were classified into 4 major groups according to their functions, including self-replication (4 ribosomal RNA genes, 30 transfer RNA genes, 12 small ribosomal subunit genes, 9 large ribosomal subunit genes, and 4 DNA-dependent RNA polymerase genes), photosynthesis (5 photosystem I genes, 14 photosystem II genes, 6 cytochrome b/f complex genes, 6 ATP synthase genes, 1 ATP-dependent protease gene, 1 Rubisco large subunit gene, and 11 NADH dehydrogenase genes), other (maturase, envelope membrane protein, acetyl-CoA-carboxylase, c-type cytochrome synthesis, and translation initiation factor), and unknown (4 genes) (Table 2). Intragenic regions were found in 18 genes. Of these, 13 genes (rps16, rpoC1, atpF, petB, petD, rps12, rpl16, ycf3, clpP, trnK-UUU, trnL-UAA, trnV-UAC, and trnG-UCC), 1 gene (ndhA), and 4 genes (ndhB, rpl2, trnI-GAU, and trnA-UGC) were in the LSC, SSC and IR regions, respectively (Fig. 2; Table 2).
Codon usage. The relative synonymous codon usage (RSCU) of ten species belonging to five different genera in the family Zingiberaceae (Table S1), including C. comosa and C. latifolia, was calculated. There was no bias in the usage of the start codons methionine (AUG) and tryptophan (UGG) (RSCU = 1). The 87 protein-coding genes contained approximately 28,393 codons. UUA-encoded leucine had the highest RSCU, at approximately 1.94, and GCG-encoded alanine had the lowest RSCU, at approximately 0.39. All preferred synonymous codons with A or U at the third position showed higher bias (RSCU >1) than those with G or C (Table S2).
Repeat structure analysis. The cp genome sequences of the plants in Zingiberaceae (Table S1) were retrieved for SSR and long repeat analysis using MISA software and the REPuter program. A total of 78-121 SSRs were found in the cp genomes of the ten species (Fig. 3; Table S3). Among the different types of SSRs, mononucleotide repeats were the most abundant, accounting for 27-58 loci, followed by dinucleotide (32-34 loci), tetranucleotide (17-21 loci), trinucleotide (3-8 loci), pentanucleotide (1-4 loci), and hexanucleotide (0-2 loci) repeats (Fig 3A; Table S3). Mononucleotide SSRs were especially rich in A/T repeats (239-280 loci) (Fig 3B; Table S3). The SSR repeats were mainly distributed in the LSC regions (51-79 loci), while only a small portion were located in the SSC regions (13-22 loci) and IR regions (5-8 loci) (Fig. 3C; Table S3). The long repeat analysis identified a total of 39-79 long repeat sequence types (Fig. 4; Table S4). Among the different types of long repeats, forward repeats (9-28 loci) were the most abundant, followed by palindromic (8-28 loci), reverse (4-16 loci), and complement (1-10 loci) repeats (Fig. 4A; Table S4). Repeat lengths of 30-39 bp were the most abundant among the ten cp genomes used in this study (Fig. 4B; Table S4).
Highly variable sequences in noncoding regions of the cp genome of C. comosa and C. latifolia. To compare the sequence divergences of C. comosa and C. latifolia, the cp genome sequences of the 10 species in Zingiberaceae (Table S1) were included for comparison using the mVISTA program, and C. comosa was used as the reference. Overall, the coding regions were more conserved than the noncoding regions among the 10 species of Zingiberaceae; however, rpoC2, rpoB, ycf1, ycf2, and ndhF exhibited some degree of variation. The two IR regions were less divergent than the LSC and SSC regions. In contrast, high levels of divergence were found in the intergenic regions of trnK-rps16, rpoB-trnC, rps4-trnT, trnT-trnL, ndhC-trnV, and ndhF-rpl32 (Fig. 5). In addition to nucleotide divergence, the expansion and contraction of the border regions were also analyzed for the 10 species of Zingiberaceae (Fig. 6; Table S1). The four junctions, LSC/IRa, LSC/IRb, SSC/IRa and SSC/IRb, were found to be almost the same (29,642 bp to 29,797 bp). The rpl22-rps19 genes were located at the boundary of the LSC/IRb region in each cp genome. The rpl22 gene was located on the left side of the LSC/IRb boundary, at a distance of 21 bp to 48 bp. The rps19 gene was located on the right side of the LSC/IRb boundary, at a distance of 129 bp to 148 bp. The ycf1-ndhF genes were located at the IRb/SSC boundary. The IRb/SSC junction was located in the ycf1 region and extended a length of 7 bp to 205 bp into the SSC region. The ndhF gene was located on the right side of the IRb/SSC boundary, at a distance of 8 bp to 218 bp. The SSC/IRa junctions in the cp genomes were embedded in the ycf1 genes, with the distance of 3,705 bp to 3,899 bp in the IRa region. The rps19-psbA genes were located at the boundary of the IRa/LSC region. The rps19 gene was located on the left side of the IRa/LSC boundary, at a distance of 129 bp to 148 bp, while psbA was located on the right side of the IRa/LSC boundary, at a distance of 109 bp to 125 bp.
ndhA, TrnT-trnL, and ndhC-trnV are DNA signature sites for the development of species-specific markers. The cp genome sequences belonging to 33 species of Zingiberaceae (Table S5) were analyzed for species-specific DNA markers. As expected, the sliding window analysis showed the most variation in the LSC and SSC regions but lower variation in the IR regions (Fig. 7). The average value of nucleotide diversity (Pi) was 0.0096 among the 33 Zingiberaceae species (Table S6). Mutational hotspots were found in 6 genes, rps16-trnQ, ycf1, ndhA, ndhI, ndhD, and RF19; these sites exhibited remarkable Pi values, higher than 0.03 (Fig. 7A). In addition, the average value of nucleotide diversity (Pi) among 20 species in Curcuma was 0.0018 (Table S6), and there were 3 mutational hotspots in rps16-trnQ, petN-psbM, and ndhA that exhibited Pi values higher than 0.01 (Fig. 7B). Additionally, there were 6 SNPs and 41 indels in the cp genomes of C. comosa and C. latifolia (Table S7). When comparing the cp genomes among the 10 selected species of Zingiberaceae (Table S1), SNP/indel variation sites were found in the ndhA, trnT-trnL, and ndhC-trnV regions, with 1 SNP, a 6 bp insertion, and a 2 bp deletion, respectively (Figure S1).
Validation of the species-specific DNA markers in crude “Wan Chak Motluk” sold at market. We analyzed 19 samples of crude drugs claiming to be “Wan Chak Motluk”, comprising 14 samples of C. comosa and 5 samples of C. latifolia, represented as CD-01 to CD-19 (Table S8). All samples were purchased from various herbal markets. PCR amplification of ndhA, trnT-trnL, and ndhC-trnV with our developed species-specific primers yielded products of 330 bp, 264 bp, and 370 bp in length, respectively (Table S9). Of the 14 samples claiming to be C. comosa, 5 samples (CD-01, CD-07, CD-16, CD-18, and CD-19) were confirmed as C. comosa, while 8 samples (CD-02, CD-03, CD-04, CD-05, CD-06, CD-08, CD-09 and CD-17) were identified as C. latifolia (Table 3), and one sample (CD-10) was neither C. comosa nor C. latifolia (Table 3). Examination of 5 samples (CD-11, CD-12, CD-13, CD-14, and CD-15) claiming to be C. latifolia revealed that only 3 samples (CD-12, CD-13, and CD-14) were C. latifolia, and the remaining 2 samples (CD-11 and CD-15) were neither C. latifolia nor C. comosa (Table 3).
Phylogeny construction with the cp genome sequences of C. comosa and C. latifolia. To examine the phylogenetic positions of the C. comosa and C. latifolia species and their relationships within Zingiberales (Table S10), neighbor-joining (NJ) phylogenetic analyses were performed using 40 cp genomes from 40 species belonging to 6 families of Zingiberales. In this analysis, six families in Zingiberales were divided into two clades with 100% bootstrap support (BS) values. One clade was composed of five families, including Musaceae, Heliconiaceae, Strelitziaceae, Cannaceae, and Costaceae, while the other clade included only Zingiberaceae. The clade containing Zingiberaceae was divided into 2 groups. The first group included four genera (Wurfbainia, Amomum, Lanxangia, and Alpinia) (BS =100%), and the second group included five genera (Curcuma, Stahlianthus, Hedychium, Kaempferia, and Zingiber). The second group was further divided into 4 subgroups (BS = 72-100%). Subgroup II was the most complex, with 17 species, including the species of interest, C. comosa and C. latifolia, on the same branch as C. elata and C. aromatica (BS = 100%). (Fig. 8).