Characteristics of P. Cistena cp Genomes
The complete cp genome of P. Cistena showed a typical circular double-stranded DNA with quadripartite structure, and length was 157,935 bp. It was consisted of one LSC region (85,947 bp), one SSC (region 19,116 bp), and a pair of IR regions (26,436 bp) (Fig. 1). The overall GC content was about 36.72%, while IR regions displayed a higher GC content (42.53%) than the GC content of LSC (34.59%) and SSC (30.22%). The GC content distribution pattern was similar to that of other plants [18–20].
The cp genome encoded a total of 130 genes, including 85 protein-coding genes (PCGs), 37 tRNA genes, and 8 rRNA genes (Table 1). Nine of 85 protein coding genes were for large subunits of ribosome, 12 were for small subunits of ribosome, 4 were for RNA polymerase, 20 were for photosystem and 6 were for ATP synthase. There were 20 duplicated genes in IRs, including seven protein-coding genes, nine tRNA genes and four rRNA genes (Table 1). Among these intron-containing genes, 21 out of 23 included a single intron, and 2 genes (ycf3, clpP) had two introns (Table S1). In comparison with other introns, the longest intron was within the trnK-UUU gene in the LSC region, reaching 2547 bp. The intron of the trnL-UAA gene was the shortest, only 514 bp. The rps12 gene was identified as trans-spliced with a single 5′-end at the LSC region while repeated 3′-end exons was located in the IRs (Fig. 1 and Table S1).
Table 1
Annotated genes in the Prunus Cistena CP genomes
Category | Gene group | Gene name |
Photosynthesis | Subunits of photosystem I | psaA, psaB, psaC, psaI, psaJ |
Subunits of photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ |
Subunits of NADH dehydrogenase | ndhA*, ndhB*(2), ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK |
Subunits of cytochrome b/f complex | petA, petB*, petD*, petG, petL, petN |
Subunits of ATP synthase | atpA, atpB, atpE, atpF*, atpH, atpI |
Large subunit of rubisco | rbcL |
Self-replication | Proteins of large ribosomal subunit | rpl14, rpl16*, rpl2*(2), rpl20, rpl22, rpl23(2), rpl32, rpl33, rpl36 |
Proteins of small ribosomal subunit | rps11, rps12**(2), rps14, rps15, rps16*, rps18, rps19, rps2, rps3, rps4, rps7(2), rps8 |
Subunits of RNA polymerase | rpoA, rpoB, rpoC1*, rpoC2 |
Ribosomal RNAs | rrn16(2), rrn23(2), rrn4.5(2), rrn5(2) |
Transfer RNAs | trnA-UGC*(2), trnC-GCA, trnD-GUC, trnE-UUC, trnF-GAA, trnG-GCC, trnG-GCC*, trnH-GUG, trnI-CAU(2), trnI-GAU*(2), trnK-UUU*, trnL-CAA(2), trnL-UAA*, trnL-UAG, trnM-CAU, trnN-GUU(2), trnP-UGG, trnQ-UUG, trnR-ACG(2), trnR-UCU, trnS-GCU(2), trnS-UGA, trnT-GGU, trnT-UGU, trnV-GAC(2), trnV-UAC*, trnW-CCA, trnY-GUA, trnfM-CAU |
Other genes | Maturase | matK |
Protease | clpP** |
Envelope membrane protein | cemA |
Acetyl-CoA carboxylase | accD |
c-type cytochrome synthesis gene | ccsA |
Genes of unknown function | Conserved hypothetical chloroplast ORF | ycf1(2), ycf2(2), ycf3**, ycf4 |
*: contains one intron; **: contains two introns; (2): gene with a copy number greater than 1, the number of copies in parentheses. |
The relative occurrence of the synonymous codons in the coding sequences of P. Cistena cpDNA was calculated using 26,526 codons. It suggested that the four most frequently used codons were AUU-I (1117), AAA-K (1064), GAA-E (1029) and AAU-N (998), accounting for 4.21%, 4.01%, 3.88% and 3.76% among all codons, respectively (Table S2 and Fig. 2). Similar to the previous results for other study, most of the codons ending with A or T have RSCU values greater than 1, while, most of those ending with C or G have RSCU values of less than 1[21–23].
Analysis of repeat sequence and simple sequence repeats (SSRs)
In the cpDNA of P. Cistena, a total of 49 long repeats were recognized, including 19 forward, 23 palindrome, 6 reverse, and 1 complement repeats (Fig.S1). The length of the repeat sequence ranged mainly from 30 to 26436 bp. Furthermore, a total of 27, 23 and 7 long repeats were detected in the LSC, IRs and SSC regions, respectively.
In addition, 253 simple sequence repeats (SSRs) were detected in the cpDNA, among which 161, 14, 65, 11, 1 and 1 were mono-, di-, tri-, tetra-, penta- and hexonucleotide repeats, respectively (Fig. 3A). The mononucleotide SSRs were the most abundant and accounted for 63.63% of SSRs, trinucleotides SSRs followed, accounting for 25.69%. There were 170 SSRs in LSC region, 45 in SSC region and 38 in IR region (Fig. 3B). There were 35 SSRs in exons, 32 in intron and 103 in intergenic region in the LSC region. In the SSC region, 26 SSRs were located in exons, 4 in introns, and 15 in intergenic region. In the IR region, a total of 19, 4 and 15 SSRs were located in the exons, introns, and intergenic regions, respectively. The high variation in numbers of SSRs might provide abundant information for molecular marker studies and plant breeding.
Adaptive evaluation analysis
The Ka/Ks ratio was used to evaluate the degree of selection constraint on each gene and estimate the selective pressure of protein-coding genes. Ka/Ks > 1 indicates the gene is under positive selection, Ka/Ks = 1 indicates the gene is under neutral selection and Ka/Ks < 1 indicates purification selection [24].
In the present study, KaKs Calculator was used to calculate the non-synonymous (Ka) to synonymous (Ks) ratio (Ka/Ks) of 78 effectively shared protein-coding genes in P. Cistena and five other Prunus species (Table S3). The results showed that the Ka/Ks values between P. Cistena and Prunus species ranged from 0 (ndhC) to 1.84989 (matK). The Ka/Ks value of atpE, ccsA, petA, rps8 gene between P. Cistena and P. padus was > 1, which had positive selection effects. The matK gene was positively selected within P. Cistena, P. salicina, P.japonica and P. simonii. The Ka/Ks values of other genes were < 1, which had strong purification selection pressure in Prunus genera (Fig. 4).
Phylogenetic analysis
The cpDNAs of P. Cistena and another 29 species within the Rosaceae family were selected to explore the genetic relationship between P. Cistena and its relatives, Punica granatum was used as outgroup. Multiple alignments of all of the 31 cpDNAs were computed using MAFFT, and then a maximum likelihood (ML) tree was determined using RAxML implementing the GTRGAMMA model. All relationships were strongly supported by high bootstrap values ranging from 93 to 100 (Fig. 5).
The phylogenetic tree with most branches having high levels of support showed three divergent clades. The first clade was formed by Prinsepia, Crataegus, Pyrus, Photinia, Malus and Prunus species. The phylogenetic analyses had established that three Rosa species and two Rubus species fell into the second clade. The outgroup Punica granatum alone was located at the third clade (Fig. 5). Prunus species were clustered into a group in which P. Cistena was closely related to P. jamasakura.
Comparative chloroplast genomic analysis
To advance our understanding of the cpDNAs of the genus Prunus, further investigation was conducted to make critical comparisons of the IR/SSC and IR/LSC border positions in the six selected Prunus species, to access the degree of IR expansion or contraction among them. The difference at the boundaries was observed in the six cpDNAs, of which the LSC, IRa/b, and SSC regions have an average length of 86,351 bp, 26,367 bp and 19,035 bp, respectively (Fig. 6A).
The rpl22 of P. cistena, P. salicina, P. jamasakura, P. japonica and P. simonii were located completely at the junction of the LSC region (Fig. 6A). The LSC/IRb border was inside the rps19 gene in the six cpDNAs. The 217 bp fragment of the rps19 was located within the IRb region in P. cistena and P. jamasakura, whereas the remaining 62 bp section of the gene can be found within the LSC region. There were 240bp fragment of the rps19 within the LSC region in P. padus, while the remaining 39 bp section within the IRb region. The IRb/SSC boundary was located inside the ycf1 and ndhF gene. The fragment length of ycf1 ranged from 1036bp to 1051bp in IRb region, while the length of remaining section was 3 bp ~ 94bp in the six cpDNAs. The 1 bp ~ 10 bp fragment of the ndhF gene was located within the IRb region, whereas the remaining 2219 bp ~ 2250bp section can be found within the SSC region. The ycf1 gene was located across the SSC/IRa junction, which length in the SSC region was 4561 bp ~ 4614 bp. The trnN-GUU and rpl2 was completely found in the IRa region, but rpl2 was also in the IRb region in P. padus. The trnH-GUG was completely found in the LSC region.
To indicate the degree of the genome divergence, the mVISTA and Mauve software was used for sequence alignment and collinearity analysis among the selected six species (Fig. 6B, Fig.S2). The nucleotide sequence similarity of the six cpDNAs was extremely high, suggesting that there was no variation in the cp genome of P. cistena compared with its ancestral species. At the same time, it can be found that divergence existed in the highly conserve regions, and the coding region and IR region were more conserved than noncoding region.
The chloroplast genome structure of P. cistena and proximal species was also analyzed by CGVIEW (Fig.S3). It indicated a high degree of similarity between genomes. The nucleotide diversity (pi) values of the selected eight cpDNAs, calculated within the slide window, ranged from 0 to 0.02398, with an average of 0.00296 (Fig.S4). There were three highly variable regions, LSC. rps18, LSC. rbcL and SSC.ycf1 with pi values higher than 0.01. The results showed low pi values, which suggested the cpDNA sequences were highly conserved at the sequence level throughout the genus.