Chloroplast Genome Features of N. lappaceum
The structure of N. lappaceum chloroplast genome was analogous to most chloroplast genomes of plants with a typical quadripartite structure. We assemble a closed circular chloroplast genome with 161,321 bp in N. lappaceum. The chloroplast genome contains a pair of inverted repeat regions (IRs) of 28,550 bp, a large single-copy region (LSC) of 86,068 bp and a small single-copy region (SSC) of 18,153 bp (Fig. 1). The size of N. lappaceum chloroplast genome was slightly larger than that in S. mukorossi (160,481 bp), P. tomentosa (160,818 bp), D. Longan (160,833 bp) and shorter than that in L. chinensis chloroplast genome (162,524 bp) of Sapindoideae (Table 1). The number of chloroplast genes in N. lappaceum was 132, the same with those in D. Longan and L. chinensis (Table 1). In addition, there was no significant difference in GC content among the five analytical genomes in Sapindoideae.
Table 1
Comparison of the general features of the five Sapindoideae chloroplast genomes.
Genome feature
|
Dimocarpus longan
|
Litchi chinensis
|
Pometia tomentosa
|
Sapindus mukorossi
|
Nephelium lappaceum
|
GenBank
|
MG214255
|
KY635881
|
MN106254
|
KM454982
|
MT936934
|
Size (bp)
|
160833
|
162524
|
160818
|
160481
|
161321
|
LSC (bp)
|
85707
|
85750
|
85666
|
85650
|
86068
|
SSC (bp)
|
18270
|
16568
|
18360
|
18873
|
18153
|
IR (bp)
|
28428
|
30103
|
28396
|
27979
|
28550
|
Total genes
|
132
|
132
|
133
|
135
|
132
|
Protein genes
|
87
|
87
|
88
|
88
|
87
|
tRNA genes
|
37
|
37
|
37
|
39
|
37
|
rRNA genes
|
8
|
8
|
8
|
8
|
8
|
GC (%)
|
37.79%
|
37.80%
|
37.87%
|
37.66%
|
37.77%
|
The overall nucleotide composition of rambutan is: 30.79% A, 31.44% T, 19.27% C, and 18.50% G, with a total GC content of 37.77%. In total, 132 genes were annotated on this chloroplast genome, including 78 protein-coding genes, 29 transfer RNA genes (tRNA) and 4 ribosomal RNA genes (rRNA). Among them, a total of 21 genes were found duplicated in the IR regions, including nine protein-coding genes (rps3, rps7, rps12, rps19, rpl2, rpl22, rpl23, ndhB and ycf2), eight tRNA genes (trnA-UGC, trnI-CAU, trnI-GAU, trnL-CAA, trnM-CAU, trnN-GUU, trnR-ACG and trnV-GAC) and four rRNA genes (rrn4.5 s, rrn5s, rrn16s and rrn23s) (Additional file 1: Table S2). The genes structure analysis showed that 21 genes contains introns, and 19 of them (11 protein-coding genes and 8 tRNA genes) have one intron, while two genes (ycf3 and clpP) have two introns (Additional file 1: Table S3).
Characterization of SSRs and repeat sequences
A total of 63 SSRs were detected from rambutan chloroplast genome, of which 45 were mononucleotide, 3 dinucleotide, 8 trinucleotide, 5 tetranucleotide and two pentanucleotide (Additional file 1: Table S4). Moreover, we compared the distribution pattern and number of SSRs with eight other chloroplast genomes in Sapindaceae family (Additional file 1: Table S5). The number of mononucleotide repeats is more than the sum of other types (Fig. 2A), and the number and types of chloroplast SSRs vary in different species. S. mukorossi (91 SSRs) possess the highest number of SSRs while E. cavaleriei (62 SSRs) possesses the lowest. Furthermore, the chloroplast genome of D. longan, L. chinensis, P. tomentosa, D. viscosa, K. paniculate and X. sorbifolium contained 79, 75, 74, 77, 87 and 83 SSRs, respectively (Fig. 2B). In this study, a total of 98 larger repeats (> 10 bp) were identified in N. lappaceum chloroplast genome composed of 42 forward, 11 reverse, 41 palindromic and 4 complement repeats (Additional file 1: Table S6) using REPuter[12]. Among them, the largest repeat was a palindromic repeat with a size of 48 bp.
Codon usage analysis and RNA editing sites prediction
We used 53 protein coding sequences from rambutan chloroplast genome for calculate codon usage frequency and relative synonymous codon usage (RSCU) frequency (Additional file 1: Table S7). All protein coding sequences contain 21,434 codons. In detail, leucine and cysteine are the highest and lowest number of amino acids, they have 2,232 codons (approximately 10.41% of the total) and 236 codons (approximately 1.10% of the total), respectively. While Met (ATG) and Trp (TGG) are encoded by only one codon showed no biased usage (RSCU = 1). 30 codons with RSCU values more than 1, indicating they showed biased usage (Fig. 3). Among them, excluding the leucine (UUG) codon was G-ending, the remaining 29 biased usage codons of N. lappaceum were all A/T-ending in the third codon. In addition, there were 49 potential RNA editing sites were found across 18 protein-coding genes in N. lappaceum chloroplast genome and the ndhB gene contained the most RNA editing sites (9) (Additional file 1: Table S8). We also observed that RNA editing sites were all C to U conversion, and took place at the first (30.6%) or second (69.4%) positions of the codons, indicating that editing in the third codon position disappeared quicker than that in the second or first codon position. Furthermore, serine codons were more frequently edited than codons of other amino acids and the conversion from serine to leucine occurred most frequently.
Comparative genomes analysis
The comparative analysis based on mVISTA was performed between the chloroplast genomes of rambutan with other four Sapindoideae species with the annotated D. longan chloroplast genome as a reference. The five Sapindoideae subfamily chloroplast genomes length between the confines of 160,481 to 162,524 bp. The chloroplast genome of L. chinensis has the largest size, whereas S. mukorossi has the smallest size. Interestingly, the SSC region (16,568 bp) of L. chinensis is the shortest, whereas the SSC region (18,873 bp) of S. mukorossi chloroplast genome is the longest (Fig. 4). The IR (A/B) regions exhibited less divergence than the SSC and LSC regions. In addition, the coding regions were more highly conserved than the non-coding regions. Among the five chloroplast genomes, four rRNA genes (rrn16S, rrn23S, rrn5S, rrn4.5S) were the most conserved, while 7 genes (matK, rpoC2, psbB, rpoA, ndhF, ndhD and ycf1) showed the most diversity in the coding regions. The highly divergent regions were found in the intergenic spacers and introns, including trnH-GUG-psbA, trnR-UCU-atpA, petN-psbM, psbZ-trnG-GCC, ndhC-trnV-UAC, psbE-petL, rpl16-rps3 and rpl32-trnL-UAG.
Expansion and contraction of IR regions
We compared the IR regions and the junction sites of the LSC and SSC regions of nine Sapindaceae family chloroplast genomes (including N. lappaceum) (Fig. 5). The IR regions vary in different chloroplast genomes, ranging from 26,923 bp in E. cavaleriei to 30,103 bp in L. chinensis. In our study, the ycf1 gene was located at the SSC/IRA junction in all of the nine chloroplast genomes and the fragment located at the IRa region ranged from 962 bp to 3,183 bp. Moreover, most junctions between LSC and IRa in this study was located downstream of the trnH-GUG, except the S. mukorossi. In addition, the LSC/IRb junction of three species D. viscosa, E. cavaleriei and K. paniculate was located within the coding region of rpl22 and created a location of 110, 40 or 63 bp at the LSC/IRb border. The remaining chloroplast genomes share a similar pattern, the LSC/IRb junction was located in intergenic regions of rpl16 and rps3, and the IRb/SSC junction between IRb and SSC region (JSB) of five species (S. mukorossi, X. sorbifoliun, D. viscosa, E. cavaleriei and K. paniculate) was located between the gene of ycf1 and ndhF. However, other four chloroplast genomes only have ndhF located or near the JSB.
Synonymous (Ks) and non-synonymous (Ka) substitution rate analysis
To explore molecular evolution of orthologous genes shared by nine Sapindaceae species, particularly genes undergoing purifying or positive selection, we calculated the Ka/Ks ratio of 622 orthologous pairs with 78 protein coding genes (Additional file 1: Table S9). Overall, the average Ka/Ks ratio of the nine chloroplast genomes was 0.20. Total 612 orthologous pairs had a Ka/Ks ratio less than 1 in the nine comparison groups, out of which 546 orthologs had a Ka/Ks ratio less than 0.5 (Fig. 6), suggesting that most genes are undergoing strong purifying selection pressures. Moreover, 66 orthologs of 31 genes with a Ka/Ks ratio between 0.5 and 1, 10 orthologous pairs of 6 genes (ccsA, rpoA, rps12, psbJ, clpPc and rps19) with a Ka/Ks ratio greater than 1 were detected in this study, suggesting that these genes might have experienced positive selection in the procedure of evolution. Among them, the Ka/Ks ratio of the ycf1 gene was greater than 0.5 in eight comparison groups, the rpoA and ycf2 gene with Ka/Ks ratio greater than 0.5 was also observed in the comparison of seven and six groups, respectively. Besides, clpP, matK and rps15 gene with Ka/Ks ratio > 0.5 were founded in four out of the eight comparison groups.
Phylogenetic analysis
We performed multiple sequence alignments using the whole chloroplast genome sequences of nine Sapindaceae species and two Anacardiaceae species as outgroups (Fig. 7). All nodes in the ML trees have 100% bootstrap support values, and these 11 chloroplast genome sequences are clustered into three groups. In detail, the five species (D. longan, L. chinensis, P. tomentosa, N. lappaceum and S. mukorossi) from Sapindoideae clustered into one group, four species (K. paniculata, D. viscosa, E. cavaleriei and X. sorbifolium) from Dodonaeoideae are in one group, and the two species (A. occidentale and M. indica) in Anacardiaceae are cluster into one group. In the Sapindoideae group, the N. lappaceum chloroplast genome sequence showed the closest relationship with P. tomentosa, followed by D. longan and L. chinesis, as far as S. mukorossi. The three groups of this phylogenetic tree of the 11 chloroplast genome sequences are consistent with traditional taxonomy, suggesting that the chloroplast genome could effectively resolve the phylogenetic positions and relationships of species.