3.1 The chloroplast genomes features of EC.
After the chloroplast sequencing, assembly, and annotation, we obtained complete information about the chloroplast of EC. The total length of the CPDNA of EC is 178,808 bp. The CP region of a plant is generally divided into four parts. The LSC of EC is the longest, with a length of 88,300 bp. The SSC region is the shortest, with a length of 148 bp(Fig. 1). The IRa and IRb regions are 45,180 bp in length.
The cp genome of EC contains 99 protein-coding genes (PCG), 37 transfer RNA (tRNA) genes, and 8 ribosomal RNA (rRNA) genes. Among them, 4 rRNAs(rrn16s, rrn23s, rrn4.5s, and rrn5s), 6 tRNAs (trnL-CAA, trnV-GAC, trnA-UGC, trnR-ACG, trnN-GUU, trnL-UAG), and 19 encoding-proteins (ndhA, ndhB, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, psaC, rpl2, rpl23, rpl32, rps15, rps7, ccsA, ycf1, ycf15, ycf2) contain two repeating units, the tRNA (trnE-UUC) contains three repeating units, and trnM-CAU contains four repeating units (Table 2). For introns and exons of genes, 21 genes contain an intron, including 7 tRNAs (trnK-UUU, trnS-CGA, trnL-UAA, trnE-UUC×2、trnA-UGC×2) and 14 protein-coding genes (rps16, atpF, rpoC1, petB, petD, rpoA, rpl16, rpl22, rpl2×2、ndhB×2. ndhA×2), one protein-coding gene (ycf3) contain two introns.
Table 1
Starting and ending sites of the EC cp genome
Species | Name | Start (bp) | End(bp) |
Eomecon chionantha | IR | 88,301 | 133,480 |
Eomecon chionantha | IR | 133,629 | 178,808 |
Eomecon chionantha | SSC | 133,481 | 133,628 |
Eomecon chionantha | LSC | 1 | 88,300 |
Table 2
Gene composition of EC chloroplast genome
Category of genes | Group of genes | Name of genes | |
RNA | rRNA | rrn16s×2,rrn23s×2,rrn4.5s×2,rrn5s×2 | |
| tRNA | trnH-GUG,trnK-UUU,trnQ-UUG,trnS-GCU,trnS-CGA,trnR-UCU,trnC-GCA,trnD-GUC,trnY-GUA, trnT-GGU,trnS-UGA,trnG-GCC,trnS-GGA,trnT-UGU,trnL-UAA,trnF-GAA,trnW-CCA,trnP-UGG, trnL-CAA×2,trnV-GAC×2,trnA-UGC×2,trnR-ACG×2,trnN-GUU×2,trnL-UAG×2, trnE-UUC×3, trnM-CAU×4 | |
photosynthesis | Subunits of ATP synthase | atpA, atpB, atpE, atpF, atpH, atpI |
Subunits of photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ, ycf3 |
Subunits of NADH-dehydrogenase | ndhA×2, ndhB×2, ndhC, ndhD×2, ndhE×2, ndhF×2, ndhG×2, ndhH×2, ndhI×2, ndhJ, ndhK |
Subunits of cytochrome b/f complex | petA, petB, petD, petG, petL, petN |
Subunits of photosystem I | psaA, psaB, psaC×2, psaI, psaJ |
Subunit of rubisco | rbcL |
Self replication | Large subunit of ribosome | rpl14, rpl16, rpl2×2, rpl20, rpl22, rpl23×2, rpl32×2, rpl33, rpl36 |
| DNA dependent RNA polymerase | rpoA, rpoB, rpoC1, rpoC2 |
| Small subunit of ribosome | rps11, rps12×2, rps14, rps15×2, rps16, rps18, rps19, rps2, rps3, rps4, rps7×2, rps8 |
Other genes | Subunit of Acetyl-CoA-carboxylase | accD |
| c-type cytochrom synthesis gene | ccsA×2 |
| Envelop membrane protein | cemA |
| Protease | clpP |
| Translational initiation factor | infA |
| Maturase | matK |
Unkown | Conserved open reading frames | ycf1×2, ycf15×2, ycf2×2, ycf4 |
Table 3
Genes with intron in the EC chloroplast genome and length of exons and introns.
Gene | Strand | Start | End | Exonl | Intronl | Exonll | Intronll | Exonll |
trnK-UUU | - | 1,932 | 4,471 | 37 | 2,468 | 35 | | |
rps16 | - | 5,520 | 6,,624 | 40 | 840 | 225 | | |
trnS-CGA | + | 10,755 | 11,522 | 31 | 675 | 62 | | |
atpF | - | 13,432 | 14,728 | 145 | 742 | 410 | | |
rpoC1 | - | 22,693 | 25,507 | 453 | 754 | 1,608 | | |
ycf3 | - | 45,437 | 47,386 | 124 | 690 | 230 | 753 | 153 |
trnL-UAA | + | 50,394 | 50,984 | 35 | 506 | 50 | | |
petB | + | 78,844 | 80,274 | 6 | 783 | 642 | | |
petD | + | 80,468 | 81,692 | 8 | 721 | 496 | | |
rpoA | - | 81,916 | 82,981 | 939 | 40 | 87 | | |
rpl16 | - | 85,248 | 86,652 | 9 | 997 | 399 | | |
rpl22 | - | 87,456 | 88,034 | 431 | 57 | 91 | | |
rpl2 | - | 88,439 | 89,919 | 385 | 662 | 434 | | |
ndhB | - | 98,609 | 100,841 | 775 | 700 | 758 | | |
trnE-UUC | + | 106,423 | 107,434 | 32 | 940 | 40 | | |
trnA-UGC | + | 107,499 | 108,372 | 37 | 801 | 36 | | |
ndhA | + | 120,586 | 122,683 | 553 | 1,006 | 539 | | |
ndhA | - | 144,426 | 146,523 | 553 | 1,006 | 539 | | |
trnA-UGC | - | 158,737 | 159,610 | 37 | 801 | 36 | | |
trnE-UUC | - | 159,675 | 160,686 | 32 | 940 | 40 | | |
ndhB | + | 166,268 | 168,500 | 775 | 700 | 758 | | |
rpl2 | + | 177,190 | 178,670 | 385 | 662 | 434 | | |
3.2 Repetitive sequence and codon usage analysis
By studying the CP genome of EC, it was revealed that a total of 54 SSRs were detected. Out of these SSRs, only one was a dinucleotide and trinucleotide, while the largest number was a single nucleotide (A/T)(Luo et al. 2023; Nemati et al. 2023; Pei et al. 2023; Savoia et al. 2023). Repeat sequences were detected, and it was found that EC had only three types of repeat sequences: Palindromic (P), Forward (F), and Reverse (R). No Complex (C) sequences were detected. The number of repetitions for each repeat sequence type was as follows: Forward (F) had the highest number of repetitions with 25, followed by Palindromic (P) with 24, and Reverse (R) with the lowest at 7(Fig. 2).
Based on codonW's processing of EC files, we obtained the codon usage times and RSCU values for 20 amino acids in EC. In order to better describe the ratio of actual and expected codon usage(Fig. 3), we selected the RSCU value to study and describe the usage pattern of 20 amino acids. If the RSCU value is greater than 1, the frequency of a specific codon is higher than that of other synonymous codons, and the frequency of codons is higher than expected. Through analysis of the use of the RSCU value of the EC (Li et al. 2023b; Tang et al. 2022; Zhang et al. 2022b), it was found that in the cp genome, there are three amino acids with the highest RSCU value of 6, namely Leu, Ser, and Arg, and each of the three amino acids contains six codons. In Leu, there are three codons with RSCU > 1, namely UUA, UUG, and CUU; In Ser, there are also three codons with RSCU > 1, namely UCU, UCC, and UCA; In Arg, there are four codons with RSCU > 1, namely AGA, AGG, GGA, and GGG. We found that the EC contains 59,602 common codons encoding protein genes, with the top three appearing times being Phe-UUU (2,300), Lys-AAA (2,179), and Asn-AAU (2,109).
3.3 Nucleotide polymorphism
We studied nucleotide polymorphisms between EC and different genera. To do this, we aligned the sequences of five species using the MAFFT website and exported them in Fas format. Next, we imported the output file into DnaSP v5.0 for nucleotide polymorphism analysis(Fig. 4). Our analysis revealed that the genetic sequence of EC is relatively conserved, with only two loci showing a Pi > 0.1(Wonok et al. 2023). These loci correspond to the regions of psbN-petB and ndhH-ndhD, respectively. Additionally, we used the online platform mVISTA to compare the sequences of the five species(Li et al. 2021; Liu et al. 2022; Wu et al. 2022). By using EC as a reference, we identified a highly specific sequence between 115K-134K that did not match any of the neighboring species(Supplement Fig. 1).
3.4 Inverted repeats contraction and expansion
Studying the variation in the IR region can be useful in elucidating the phylogenetic relationships between species, as the contraction and expansion of the inverted repeat (IR) can alter the length of the CPDNA(Yang et al. 2023b).
EC is a species that belongs to the genus Eomecon and is found only in China. In order to investigate its genetic relationships with other species, we conducted a study of the cp genome(Fig. 5). We selected 15 other species based on phylogenetic clustering and analyzed them together with EC. Using IR analysis, we compared the genomes of EC and its neighboring species in four regions of the genome: the large single copy (LSC) and inverted repeat (IR) regions, as well as the junctions between them (JLB, JSB, JSA, and JLA). Our results showed that EC shares some similarities with other species in the LSC and IRb regions, including the presence of genes such as rpl22, rpl2, and rps19 at the JLB locus. However, we also found differences between EC and its neighbors, particularly in the IRa region, where genes such as trnH, psbA, and rpl2 were present. Interestingly, we observed that the length of the SSC region was significantly shorter in all four regions of the genome, at 148 bp. Further analysis revealed that this was due to an expansion of the IR region, which caused multiple copies of genes to be inserted into the SSC region, leading to its contraction. This finding sheds new light on the evolution of the chloroplast genome in this region and provides a potential explanation for the unique features of EC's genome.
3.5 Phylogenetic analysis
PhyloSuite was used to extract proteins from 31 species and establish a maximum likelihood tree (ML)(Yi et al. 2023). EC is an endemic species of a single genus and a single species unique to China, so on the ML tree, EC is grouped into a single category(Li et al. 2023a; Wang et al. 2023a). Although EC is a separate category within the poppy family, it is closely related to four species (Macleaya microcarpa, Coreanomecon haloconoids, Hylomecon japonica, Chelidonium majus). All nodes have high boot support(Fig. 6).
3.6 Kimura's two-parameter (K2-P) analysis
We chose five species for the k2-p model based on phylogenetic analysis, which revealed that the genus Eomecon is closely related to four other genera(Fig. 7). Highly variable regions were utilized to distinguish species with close phylogenetic relationships(Toshino et al. 2022). Analysis of the genetic distance of the five selected species identified potential molecular markers in certain regions. From screening 82 intergenic regions in the five species using the K2-p model, we identified five gene regions (ycf4-cemA, ycf3-trnS-GGA, trnC-GCA-petN, rpl32-trnL-UAG, and psbI-trnS-UGA) with the highest genetic distance. These regions can serve as potential markers for future molecular marker development.
3.7 Selective pressures analysis
Changes in the DNA sequence are influenced by natural selection and can be assessed by the ratio of non-synonymous (amino acid substitution, Ka) to synonymous (Ks) substitutions (Ka/Ks)(Zhang et al. 2022a)(Fig. 8). A Ka/Ks value of less than 1 indicates a high synonymous substitution rate, suggesting that the amino acid at this site remains unchanged and the probability of mutation in this fragment is low. Conversely, a Ka/Ks value of greater than 1 suggests a high rate of non-synonymous substitution, indicating changes in amino acids and a higher possibility of mutation in this fragment. In this study, most of the Ka/Ks values obtained were less than 1, implying that synonymous nucleotide substitution is more common in the protein-coding region during natural selection. Out of the 80 protein-coding genes examined, only clpP, psbK, and ycf2 had a Ka/Ks value greater than 1, indicating that these three genes have the highest non-synonymous substitution and are under positive selection pressure(Lin et al. 2023). Mutations in these sites may lead to the future accelerated development of this species.