General features of cp genomes
Using Illumina Hiseq sequencing platforms, 5.38–5.89 G clean data from each Pilea species were obtained, with the number of clean reads are ranged from 17,935,118 to 19,627,967 (Additional File 1: Table S1). The chloroplast was then assembled based on these data. The 4 cp genomes of Pilea are characterized by a typical circular DNA molecule with the length ranging from 150,398 − 152,327 bp. They all have a conservative quartile structure, which is composed of a LSC region (82,063–83,292 bp), a SSC region (17,487 − 18,363 bp) and a pair of IR regions (25,180 − 25,356 bp) (Table 1). The lengths of cp genomes are conserved in this genus. The GC content analysis showed that the overall GC contents ranged from 36.35–36.69% in the 4 cp genomes. Note that the GC contents in IR regions (42.56% − 42.73%) are significant higher than that in LSC (33.87% − 34.36%) and SSC regions (30.01% − 30.81%). The four cp genomes have been deposited to NCBI (Accession number: MT726015 to MT726018).
Table 1
Basic features of the 4 chloroplast genomes from Pilea.
Species | P. glauca | P. mollis | P. peperomioides | P. serpyllacea |
Accession number | MT726015 | MT726018 | MT726016 | MT726017 |
Length (bp) | Total | 151,210 | 150,587 | 152,327 | 150,398 |
LSC | 82,662 | 82,063 | 83,292 | 82,551 |
SSC | 17,836 | 17,864 | 18,363 | 17,487 |
IR | 25,356 | 25,330 | 25,336 | 25,180 |
GC content (%) | Total | 36.69 | 36.72 | 36.35 | 36.41 |
LSC | 34.31 | 34.36 | 33.87 | 33.96 |
IR | 42.64 | 42.65 | 42.73 | 42.56 |
SSC | 30.81 | 30.76 | 30.01 | 30.23 |
Gene numbers | Total | 133 | 133 | 133 | 133 |
Protein-coding gene | 88 | 88 | 88 | 88 |
tRNA gene | 37 | 37 | 37 | 37 |
rRNA gene | 8 | 8 | 8 | 8 |
Genome annotation
The cp genomes of four Pilea species all comprises 133 genes, among which, 113 are unique genes, including 79 protein-coding genes, 4 rRNA genes and 30 tRNA genes (Additional File 1: Table S2). The gene order and gene numbers of these four species are highly similar, showed conserved genomic structures. Figure 1 shows the schematic diagram of the cp genomes of Pilea. Introns play a significant role in selective gene splicing [21]. Among the 79 protein coding genes annotated, nine unique genes (rps16, rpoC1, atpF, petB, petD, rpl16, rpl2, ndhB, ndhA) contain one intron and two unique genes (ycf3, clpP) contain two introns. Six unique tRNA genes (trnK-UUU, trnG-UCC, trnL-UAA, trnV-UAC, trnI-GAU, trnA-UGC) contain one intron. There are seven protein-coding genes, four rRNA genes, and seven tRNA genes completely duplicated in the IR regions, so they have two copies. The gene rps12 is a trans-spliced gene, the 5’ end is located in LSC region. However, the 3’ end was found in the IRa and IRb region. These results are similar to other species in the nettle family [22].
Repeats analysis
Simple sequence repeats (SSRs), also referred to as the microsatellite sequences, provide a large amount of genetic information [23–25]. Because of its high genetic polymorphism, SSRs are often used for the development of molecular markers and play an important role in the identification of species [26, 27]. In this study, we detected 68, 75, 71, 80 SSRs in the 4 analyzed species, respectively (Fig. 2A, Additional File 1: Table S3). Most SSRs are mononucleotide homopolymers, particularly for A/T, which accounts for 70.75% of the total. Hexanucleotide repeats are detected only in P. mollis. These SSR showed high polymorphism, suggesting great potential in the identification of Pilea species.
In the cp genomes of Pilea species, we detected four types of interspersed repeats. Most of them are forward repeats and palindromic repeats (Fig. 2B). By contrast, there are only two reverse repeats and one complement repeats. The only complement repeats were found in P. peperomioides. The detailed sequences showed in Additional File 1: Table S4. Moreover, the length of these short interspersed repeats mainly ranged from 30 to 34 bp. We note that one forward repeats with a length of 102 bp, (detected in P. serpyllacea, R13). These longer interspersed repeats thought to be essential for promoting cp genome rearrangements [28, 29]. Whether or not these repeats have caused the rearrangement of the cp genomes of Pilea species are interesting questions.
Contraction and expansion analysis of IR regions
The contraction and expansion of IR regions are considered to be an important reason for the length diversity in cp genomes [30]. Besides, with the expanded/contracted of the IR regions, genes near the border have an opportunity access to IR or single-copy regions [31]. We retrieved the published cp genomes of six species from Urticaceae, and compared them with the four Pilea species. We observed several genes span or near the boundary of IR and single copy regions. They mainly are rps19, rpl22, rpl2, ycf1, ndhF and trnH (Fig. 3). It's worth noting that, an abnormal expansion of IR regions was observed in Gonostegia hirta. The IR regions are over 30,000 bp in G. hirta and more genes access to the IR regions (e.g. rpl36 and rps19). However, the length of IR regions in the other nine species are about 25,000 bp, and rps19 gene span the LSC/IRb boundary except for D. iners and H. tenella. The former rps19 gene is in LSC region, while the latter is completely in IR regions. Besides, trnH gene completely accesses to IR regions in H. tenella, obtained two copies. It can be seen that the genomic structure, gene order and numbers of some species in Urticaceae have changed obviously.
Furthermore, gene ycf1 crosses the SSC/IRa boundary, most of which located in the SSC region. The length of ycf1 gene in the four Pilea species varies widely, indicating the possibility of sequence differences. Surprisingly, we annotated two copies of ycf1 in the four Pilea plants, they cross IRb/SSC boundary and are not annotated in other species. Sequence alignment found that the two copies of ycf1 are existed in other taxa, indicating that previous annotation are imperfect, although one of the two copies is a fragment of ycf1, and are generally considered to be a pseudogene. Interestingly, except for E. dissectum, a small fraction of the ndhF gene (less than 100 bp) crosses the IRb/SSC regions, which means that fragments of ycf1 have an overlap with ndhF in Pilea species. The overlapping areas are 108 bp. These results are also observed in Arabidopsis, the overlaps are about 30 bp [32]. Whether or not these overlaps affect the transcription or translation of these proteins is also an interesting subject.
Genome divergence
To evaluate genomic divergence, a sequence identity analysis based on mVISTA [33, 34] was performed between 4 Pilea species, with the reference being the cp genome of P. peperomioides. We observed varying degrees of sequence divergence, especially in LSC and SSC regions. In contrast, the IR regions are more conservative. Most of these highly variable regions are observed in Conserved Non-Coding Sequences (CNS) (Fig. 4). However, the regions with the greatest sequence divergence are found in protein-coding regions, which is gene ycf1. The coding regions of ycf1 in the four Pilea species showed significant differences, and the similarity are even less than 50% in some fragments. Overall, the analyzed genomic sequences showed high levels of sequence divergence, suggesting a high level of genetic diversity in the genus Pilea.
To quantify the levels of DNA polymorphism, the 4 cp genomes were aligned and analyzed by using DnaSP v6.0 [35]. We detected 8 hypervariable regions with Pi value exceed 0.06 (Fig. 5), including petN-psbM (Pi = 0.06067); psbZ-trnG-GCC (Pi = 0.07067); trnT-UGU-trnL-UAA (Pi = 0.06433); accD-psbI (Pi = 0.06003); ndhF-rpl32 (Pi = 0.06100); rpl32-trnL-UAG (Pi = 0.06800); ndhA-intron (Pi = 0.06533) and the most regions of gene ycf1 (Pi values are ranged from 0.07367 to 0.17067). The Pi values are listed in the parentheses. It's noteworthy that most regions of the cp genome sequences had Pi values more than 0.02 (except for IR regions), exhibiting abundant polymorphism of cp genome sequences.
Nucleotide variations in protein-coding genes
The protein-coding regions are highly conserved in cp genomes [36]. We analyzed the protein-coding sequences of 79 identified unique orthologous genes in 4 Pilea taxa. Surprisingly, these protein-coding genes also showed high levels of variation (Fig. 6A, Additional File 1: Table S5). Of the 79 shared genes, 63 had a mutation rate of more than 2%, and 30 had a mutation rate of more than 4%. The gene with the highest mutation rates is ycf1 (16.62%), then followed by matK (10.54%), ccsA (8.74%) and rps15 (8.42%). Only two genes (psbJ and psbL) showed extreme conservation without any variable sites. Moreover, we observed a total of 11 genes (ycf1, ndhF, rps19, accD, rpoC2, rps16, rpoA, rpl20, ndhD, rpoC1 and ycf2) with InDels in nucleotide sequences by using DNASP 6.0 [35]. Among which, ycf1 had 35 InDels, then followed by ycf2 (9), accD (4) and rpoC2 (3). Considering that the protein-coding regions are highly conserved, protein-coding sequences with high nucleotide mutation rates are usually infrequent in the same genus, these results indicated that the Pilea plants had high genetic diversity and there are great differences among species.
In this study, synonymous (dS) and nonsynonymous (dN) substitution rates, along with dN/dS were estimated for the 79 shared genes in parallel by using PAML v4.9 [37]. Among the 79 genes, ycf1, matK, ccsA and rps15 had relatively higher dN values, and rps16, rpl32, ndhF and psaJ had higher dS values (Fig. 6B, Additional file 1: Table S6). Most genes exhibited considerably low values of dN/dS (less than 0.6), implying most of the protein-coding genes were under purifying selection during evolution. However, the dN/dS ratio of three genes (rpl36, clpP and accD) was between 0.6 and 1.0. Moreover, the dN/dS ratio was greater than 1.0 for petL, rps12, ycf1 and ycf2, indicating that they have been under positive selection during evolution. This results clearly indicate that cp genes in different plant species of Pilea may be subjected to diverse selection pressures.
Phylogenetic analysis
Compared to nuclear and mitochondrial genomes, the cp genomes are highly conserved, and it has been widely used in phylogenetic and evolutionary studies [38–40]. With the development of high throughput sequencing technology, the chloroplast genome sequence plays an important role in species identification as a super barcode [41, 42]. In this study, we constructed the maximum likelihood (ML) trees by using the complete cp genome sequences as the data sets (detailed materials are shown in Additional File 1: Table S7). The phylogenetic tree has high bootstrap supports in all nodes, shows the reliability of the phylogeny recovered (Fig. 7).
Our phylogenetic tree displayed two clades clearly, and then further diversified into four subclades with 100% bootstrap supports (ML). These four subclades correspond to four subfamilies, they are Boehmerioideae, Cecropioideae, Lecanthoideae and Urticoideae, respectively. This is consistent with the traditional classification [5]. All the 4 Pilea species clustered together (all nodes have BS = 100 for ML), and form a monophyletic group, which is a sister group to Elatostema. They all are belonging to subfamily Lecanthoideae.