Identification of RNA editing sites
We selected a series of plants to represent distant evolutionary lineages judging by two criterions, one is enough transcriptomic data in the SRA database at NCBI, another is availability of sequenced chloroplast genome. Hence, 21 plants, consisting of 6 ferns, 4 gymnosperms and 11 angiosperms, and corresponding 317 SRA accessions were chosen finally. Detailed information of SRA data and chloroplast genome accessions were listed in Additional file 1: Table S1. Based on results of RNA-seq data mapping and SNP calling, an automated bioinformatics pipeline implemented in REDO tools was conducted under default thresholds. Consequently, there were a total of 6,011 raw editing sites located in leaf chloroplast genome were detected. However, in spite of multiple stringent filters, sequence mismatches that accord with RNA editing occasionally appear. Hence, to eliminate these false positive mismatches, we manually examined all mismatches, only C-to-U and U-to-C editing types were kept, therefore, a total of 5,389 RNA editing sites, consisting of 264 U-to-C and 5,125 C-to-U edit sites, were finally screened out for further analysis.
For the reason that RNA-seq data volume varies among different species, for several species, such as Adiantum aleuticum, Histiopteris incisa, lacked enough RNA-seq data, caused the read depth of these two species was lower than that of other species’, consequently result in less number of editing sites identified in these two species. Furthermore, the read density also varied widely between different genomic regions, ranges from less than 30 to more than 800 in a few species, demonstrating various expression level of genes, the average depth distribution across 21 species was shown in Additional file 6: Figure S1. Compared with angiosperms, ferns and gymnosperms have more lower depth, indicating the actual number of RNA editing in ferns and gymnosperms might be underestimated. To evaluate the reliability of editing sites number, PREPACT webserver was also used to predict editing sites, we found that the distribution of predicted number of RNA editing based on PREPACT accorded with that of our prediction based on RNA-seq data basically (see Additional file 6: Figure S2), reflecting our pipeline offered a high performance with reliable results. To describe the attributes of RNA editing sites, we illustrated by one example of samples of Adiantum aleuticum (see Additional file 6: Figure S3), which depicts the reliability of RNA editing sites in each species by REDO tools statistically.
Characteristics of the statistics for RNA editing sites
The statistics of raw results showed that C-to-U was the dominate editing type (nearly ~ 95.1% ), the next is U-to-C type, other mismatches types were rare (see Fig. 1c). After manual inspection of raw results and elimination of mismatches, filtered RNA editing sites across 21 plants were screened out, detailed corresponding annotation information files were produced simultaneously, as listed in Additional file 2: Table S2, the statistical result was also listed in Additional file 3: Table S3, Table 1 and Fig. 1. In terms of codon position, 21.97%, 67.16%, and 10.58% of sites were edited at first, second, and third codon positions respectively (Fig. 1b). Furthermore, the statistics of editing types showed that the majority (~ 95%) of the editing events resulted in non-synonymous codon changes, and the amino acids changes tend to be hydrophobic, the change from hydrophilic to hydrophobic was the highest, followed by the change from hydrophobic to hydrophobic. The most common amino acid change types were Ser-to-Leu and Pro-to-Leu, as shown in Fig. 1d, Serine is hydrophilic, whereas Leucine and Proline are both hydrophobic. The above results demonstrated that the RNA editing exhibited a selective advantage in overall increase in hydrophobicity of the resulting proteins, which was also in good agreement with previous studies [10].
Table 1
The statistical result of RNA editing sites across 21 plants
| Species | Genome | Raw | Filtered | Percent of edited Genes | Depth | | |
Fern | Pteris vittata | MH500228 | 349 | 324 | 0.74 | 102 | | |
Adiantum aleuticum | MH173079 | 422 | 380 | 0.82 | 23 | | |
Selaginella moellendorffii | HM173080 | 1316 | 1316 | 0.96 | 73 | | |
Histiopteris incisa | NC_040220 | 588 | 466 | 0.85 | 27 | | |
Cibotium barometz | NC_037893 | 562 | 549 | 0.85 | 32 | | |
Cyrtomium fortunei | NC_037510 | 778 | 610 | 0.91 | 40 | | |
Gymn | Ginkgo biloba | NC_016986 | 301 | 291 | 0.81 | 228 | | |
Picea abies | HF937082 | 123 | 121 | 0.61 | 450 | | |
Cycas revoluta | NC_020319 | 173 | 140 | 0.44 | 27 | | |
Pinus massoniana | MF564195 | 96 | 93 | 0.59 | 182 | | |
Angi | Liriodendron tulipifera | NC_008326 | 320 | 265 | 0.75 | 113 | | |
Nelumbo nucifera | NC_025339 | 154 | 134 | 0.63 | 295 | | |
Nicotiana tabacum | Z00044 | 107 | 88 | 0.23 | 134 | | |
Glycine max | NC_007942 | 64 | 59 | 0.34 | 195 | | |
Populus tremula | KP861984 | 93 | 89 | 0.43 | 244 | | |
Arabidopsis thaliana | KX551970 | 50 | 45 | 0.24 | 389 | | |
Gossypium hirsutum | NC_007944 | 127 | 100 | 0.37 | 171 | | |
Helianthus annuus | NC_007977 | 59 | 57 | 0.29 | 148 | | |
Phoenix dactylifera | NC_013991 | 121 | 118 | 0.42 | 781 | | |
Zea mays | NC_001666 | 105 | 79 | 0.37 | 814 | | |
Oryza sativa | NC_001320 | 103 | 65 | 0.26 | 550 | | |
All the filtered editing sites were located in 1,038 genes across all species, the average percent of edited genes in the three lineages (ferns, gymnosperms and angiosperms) are 0.85, 0.61, and 0.39 respectively (Table 1). Compared to the latter two lineages, more sites and genes were edited in chloroplast transcripts from ferns remarkably, see Fig. 1a. In ferns, the largest number of editing sites was found in Selaginella moellendorffii, with 1,316 editing sites, exclusively of the C-to-U type, which is nearly 100-fold more abundant than that of flowering plants, a total of 77 genes (~ 96%) were edited, and the overwhelming majority (94.3%) of the 1,241 silent editing events altered codon directly. Differ from Selaginella moellendorffii that exclusively belonging to taxa of lycopsida, the other five fern plants are members of Leptosporangiopsida, has relatively smaller numbers of editing sites, represented by Cyrtomium fortune, owning second-largest number of editing sites among all specifies, with 610 editing sites and 79 edited genes (~ 91%). Whereas for gymnosperms, the average number of editing sites and percent of edited genes were all less than that of ferns. Compared with other three species of the same taxa, Ginkgo biloba has the most editing sites, with 291 editing sites and 68 edited genes (~ 81%) in chloroplast, reflecting its ancient nature of “living fossil”. On the opposite end, angiosperms have the lowest average numbers of editing sites and only a part of genes were effectively edited. It was noticeable that Liriodendron tulipifera and Nelumbo nucifera distinguished them from other angiosperms with more editing events. Liriodendron tulipifera is one of the most ancient flowering trees as the genus dates back about 65 million years, a total of 265 editing sites were detected, which is nearly 3-fold more abundant than that of other angiosperms and gymnosperms except Ginkgo biloba, the percent of edited genes was up to 75%, well above the average, reflecting early angiosperms possess much more diversified editing sites. The average number of editing sites among other 9 angiosperms showed no significant differences among them, but they were significantly lower than those of Liriodendron tulipifera and Nelumbo nucifera. The above results illustrated the differential distribution of RNA editing among varied plants, the cases of Selaginella moellendorffii and Liriodendron tulipifera also implies independent origins and subsequent evolutionary trajectories of editing processes.
Variability of RNA editing among plants and genes
To further explore the evolutionary trajectory of RNA editing among different species, for each gene, we summed the number of the editing sites across 21 plants (see Additional file 4: Table S4), and picked out the top 30 genes with most editing sites across 21 plants for cluster analysis, as shown in Fig. 2. We found that not every gene were edited in all the species, for certain species, the lack of editing at a few genes may be explained by two reasons, one is absence of the genes that annotated in chloroplast genome, another is no RNA editing occurred in the genes actually. By grouping the genes based on their function, genes encoding membrane subunits of the chloroplast NDH complex and RNA polymerase exhibited the largest average numbers of editing sites, while ribosomal subunits showed the lowest numbers, this is consistent with previous studies that RNA editing occurred preferentially in genes encoding membrane-bound proteins under strong selection [24]. Due to the well-studied background and abundant editing sites in plant, ndhB gene is assumed to be a good case for the study of RNA editing evolution, in our study, ndhB was also confirmed to possess the most editing sites based transcriptome data, with 333 editing sites spread across 17 species. In spite of this, there is a biased distribution of RNA editing sites in ndhB among three clades, in fern group, 50 sites were detected in Selaginella moellendorffii, and about 10 sites in the other ferns, in angiosperms, there were about 20 editing sites in each of the 11 angiosperm plants. In gymnosperms, RNA editing in ndhB was only detected in Ginkgo biloba, for Picea abies and Pinus massoniana, no ndhB gene annotated in their chloroplast genome, whereas for Cycas revolute, no RNA editing events were detected in its ndhB gene, which may result from loss of editing or too low depth around genomic regions of its ndhB gene.
To compare the number of editing sites among the top 30 genes, a matrix of numbers of editing sites across 21 species was produced, a hierarchically-clustered heatmap was plotted in Fig. 2, which showed that the clustering relationships agreed with their phylogenetic tree based on sequence alignments roughly. 21 plants were divided into three clustering groups with two exceptions, Selaginella moellendorffii and Liriodendron tulipifera were clustered far away from their own clades respectively, further implied independent origins and subsequent evolutionary trajectories of editing processes. The above results suggest that RNA editing in chloroplast may break out in early-branching plants from different clades simultaneously and suffer a lot of loss during evolution. Hierarchically clustered heatmaps of numbers of all RNA editing genes in chloroplast across 21 plants and each clade were shown in Additional file 6: Figure S4-7 respectively.
Uneven distribution of RNA editing extent
RNA editing extent was used to measure to what extent the edited transcripts among all transcriptome for one gene, if one site was edited, the C/G base (wild type) should be changed to the T/A base (edited type), thus its editing extent can be calculated by the formula: depth of edited bases (T and A)/total read depth of bases. In this study, we explored the distribution of editing extent among codon positions, species and edited genes. First, the comparison between codon positions showed that the distribution of RNA editing extent among them was uneven, and did not comply with the normal distribution, featuring an peak around ~ 0.2 and fat tails, as shown in Fig. 3. The average editing extent in second codon position (~ 0.78) is higher than that of first (~ 0.69) and third codon positions (~ 0.66), suggesting non-synonymous substitution occurred in second codon position tend to be effectively edited, it was higher editing extent that dominated the landscape of RNA editing. Second, The average editing extent also varied widely across 21 plants, ranging from 0.43 to 0.87 (see Fig. 4a), Selaginella moellendorffii has the lowest editing extent (~ 0.43), far below that of other species. In gymnosperms and angiosperms, Ginkgo biloba and Nelumbo nucifera had the lowest editing efficiencies (~ 0.6) respectively, it seemed that abundant editing sites detected in those early-branching plants might have a negative impact on their editing extent. However, as an ancient plant in angiosperms, Liriodendron tulipifera was one exception, with editing extent up to 0.81. The editing extent was also analyzed in each gene individually, we averaged the RNA editing extent among the top 30 genes across 21 plants and the result also demonstrated an uneven distribution, as shown in Fig. 4b, matK gene has the lowest editing extent (~ 0.5), oppositely, editing extent of atpA gene was the highest, up to 0.88.
Reduced cytosines content with the evolution of plants
Considering the large differences in the scale of RNA editing events along with the evolution of plants, we analyzed the nucleic acid base content of involving genes shared by the 21 plants. There were a total of 51 genes annotated in all the chloroplast genome of 21 plants, for each plant, its corresponding gene sequences were extracted, the percent of cytosines for each gene was calculated by the formula: number of cytosines (C)/total number of bases (A/T/C/G), as listed in Additional file 5: Table S5. Afterwards, comparison between each two clades were conducted, the cytosines content of each gene were averaged across all the members of each clade, and two-tailed Wilcoxon rank-sum test was used to perform pairwise comparisons. The statistical result showed that a remarkable significance (p < 0.05) was detected in the comparison of cytosines content between each two clades except for gymnosperms-angiosperms, the percent of cytosines of ferns was far below than that of angiosperms, followed by gymnosperms, as shown in Fig. 5a. The percent of cytosines of each gene across the 21 plants were further plotted in Fig. 5b, which also demonstrated that the percent of cytosines dramatically declined roughly along with the evolution, with a few exceptions (such as rpl23 gene). One striking example was psbI gene, which has the highest percent of cytosines in Selaginella moellendorffii (~ 0.37), and dropped to about 0.18 in other species. While the highest average of 51 shared genes was found in Selaginella moellendorffii (~ 0.26), the smallest average was found in Glycine max (~ 0.17). Furthermore, Ginkgo biloba and Liriodendron tulipifera have higher average in angiosperms, correspond to ~ 0.192 and ~ 0.188 respectively, showed a positive correlation with their high numbers of RNA editing sites. The above results indicated the number of editing sites declined dramatically with the evolution of plants, which maybe due to loss of cytosine content in chloroplast gene for later-branch plants.
We illustrated one RNA editing example of atpA gene that may help to understand the evolution trajectory across plants vividly. The gene sequences of atpA across 21 species were collected, intersection of all the species’ RNA editing sites of atpA was concatenated for alignment and annotated in Fig. 6. We marked all editing sites identified in atpA gene by yellow color, the distribution of its editing sites demonstrated that numbers of editing sites as well as cytosine content declined from ferns to angiosperms. We found that editing sites at third codon positions were poorly conserved, for example, in Fig. 6a, RNA editing in site a was only occurred in Adiantum aleuticum in spite of existence of cytosine in other fern members, indicating that synonymous substitution at third codon positions were not active in re-establishing functional proteins. However, sites b and c are both located at second codon position, in contrast to site b that occurred in all three clades, RNA editing in site c occurred in several members of ferns and angiosperms, was absent in gymnosperms where the cytosine have already corrected to thymine in the genome. Whereas in site d, RNA editing occurred in two members of gymnosperms, and the base types of other 19 species were all corrected to thymine in the genome. The above results demonstrated the diversity of RNA editing evolution, further validated that RNA editing in plant might evolve independently in distant evolutionary lineages, and in certain higher plants, new editing sites may occur occasionally.