A comprehensive study on chloroplast RNA editing by performing a broad-spectrum RNA-seq analysis

RNA editing is a post-transcriptional modication that complement variation at the DNA level. Until now, different RNA editing systems were found in the major eukaryotic lineages. However, the evolution trajectory in plant chloroplast remains unclear. To gain a better understanding of RNA editing in plant chloroplast, in this study, based on publicly available RNA-seq data across three clades (fern, gymnosperm, and angiosperm), we provided a detailed analysis of RNA editing events in plant chloroplasts and discussed the evolution of RNA editing in land plants.

editing predominantly take place at the rst or second positions of codons, thereby affects the translated regions of protein coding transcripts. Occasionally, it can produce a new protein with a different function from the genome-encoded transcript. The amino acids speci ed by the altered codons generated by editing are generally conserved in evolution, suggesting that most RNA editing events can restore the evolutionarily conserved amino acid residues in mRNAs [6]. RNA editing thereby is an essential process to maintain essential functions of encoded proteins at the RNA level. For example, pigment de ciency in nightshade/tobacco cybrids is caused by the failure of editing the plastid ATPase alpha-subunit mRNA [7].
Numerous studies have demonstrated that RNA editing plays an important role in various plant fundamental processes, such as organelle biogenesis, adaptation to environmental changes, and signal transduction [8][9][10]. A number of factors are involved in plant RNA editing, and considered to interact with one another to form a large protein complex, termed as editsome, PLS subfamily members of pentatrico peptide repeat (PPR) proteins function in site recognition of the target C, multiple organelle RNA editing factors (MORF) family members are components of the RNA editosome and also required for RNA editing at multiple editing sites in plants [11,12]. Almost all the PPR proteins are localized in either chloroplasts or mitochondria where those proteins participate in different facets of RNA metabolism such as RNA splicing, RNA stability, and translational initiation [13]. Despite progress in increasing editing sites identi ed and the mechanism underlying editing target recognition, the biological function of RNA editing in plants remains the fundamental question.
More and more recent studies demonstrated that RNA editing is a widespread phenomenon that occurred in the various land plants, including the liverworts, mosses, hornworts, lycopods, ferns and owering plants. However, no instance of RNA editing has yet been detected in Marchantiidae and algae, suggesting that RNA editing may have evolved in organelles only after the green plants established themselves on the land [6]. Many excellent studies in different plants have recently appeared and described their mechanistic and functional aspects [14][15][16]. The frequency of organellar RNA editing varies from zero to thousands of sites across plant kingdom, among land plants. The early-diverging lineages show the highest numbers of editing sites compared with higher plants that undergoing an extensive loss of editing sites through the substitution of genomic editable cytidines to thymidines [17].
The evidence gathered to date suggests that RNA editing in plant organelles evolved independently from RNA processes in distant evolutionary lineages of animals, fungi, and protozoans when the rst plants moved from the aquatic environment onto the land. Evolutionary studies can help to understand the puzzling nature of RNA editing in plants.
The straightforward way to detect RNA editing sites is to compare RNAs with their corresponding DNA templates. Sanger sequencing of cDNAs was widely used in the last two decades, though it is timeconsuming and prone to underestimate editing sites. In recent years, the advent of next-generation sequencing technologies and the availability of large quantities of RNA sequencing data makes it possible to identify RNA editing sites and quantify their editing extent on a large scale. This strategy allows a transcriptome-wide fast detection of editing sites and has enormous potential to deepen our knowledge of transcriptional processes in plant. Indeed, the number of complete plant organellar genomes and related transcriptome data have considerably grown in the last decade, and tens of thousands of editing sites have been identi ed in more and more plants [18,19]. However, this strategy is also a challenging task due to its accuracy of mapping the RNA-seq reads against genomic sequence, hence, so different bioinformatic strategies have been introduced to improve the detection of RNA editing sites [20][21][22].
To gain a better understanding of RNA editing in plant chloroplast, we investigated RNA editing located in chloroplast genome across diverse plants that distributed in three clades: fern, gymnosperm, and angiosperm. Through systematically exploiting enormous RNA-seq data derived from public database, thousands of RNA editing sites were detected, with a lot of these sites have not been previously reported, the data set provides valuable information for the research community. Additionally, we provided a detailed analysis of data in land plant chloroplasts and propose a model for the evolution of RNA editing in land plants.

Identi cation of RNA editing sites
We selected a series of plants to represent distant evolutionary lineages judging by two criterions, one is enough transcriptomic data in the SRA database at NCBI, another is availability of sequenced chloroplast genome. Hence, 21 plants, consisting of 6 ferns, 4 gymnosperms and 11 angiosperms, and corresponding 317 SRA accessions were chosen nally. Detailed information of SRA data and chloroplast genome accessions were listed in Additional le 1: Table S1. Based on results of RNA-seq data mapping and SNP calling, an automated bioinformatics pipeline implemented in REDO tools was conducted under default thresholds. Consequently, there were a total of 6,011 raw editing sites located in leaf chloroplast genome were detected. However, in spite of multiple stringent lters, sequence mismatches that accord with RNA editing occasionally appear. Hence, to eliminate these false positive mismatches, we manually examined all mismatches, only C-to-U and U-to-C editing types were kept, therefore, a total of 5,389 RNA editing sites, consisting of 264 U-to-C and 5,125 C-to-U edit sites, were nally screened out for further analysis.
For the reason that RNA-seq data volume varies among different species, for several species, such as Adiantum aleuticum, Histiopteris incisa, lacked enough RNA-seq data, caused the read depth of these two species was lower than that of other species', consequently result in less number of editing sites identi ed in these two species. Furthermore, the read density also varied widely between different genomic regions, ranges from less than 30 to more than 800 in a few species, demonstrating various expression level of genes, the average depth distribution across 21 species was shown in Additional le 6: Figure S1. Compared with angiosperms, ferns and gymnosperms have more lower depth, indicating the actual number of RNA editing in ferns and gymnosperms might be underestimated. To evaluate the reliability of editing sites number, PREPACT webserver was also used to predict editing sites, we found that the distribution of predicted number of RNA editing based on PREPACT accorded with that of our prediction based on RNA-seq data basically (see Additional le 6: Figure S2), re ecting our pipeline offered a high performance with reliable results. To describe the attributes of RNA editing sites, we illustrated by one example of samples of Adiantum aleuticum (see Additional le 6: Figure S3), which depicts the reliability of RNA editing sites in each species by REDO tools statistically.

Characteristics of the statistics for RNA editing sites
The statistics of raw results showed that C-to-U was the dominate editing type (nearly ~ 95.1% ), the next is U-to-C type, other mismatches types were rare (see Fig. 1c). After manual inspection of raw results and elimination of mismatches, ltered RNA editing sites across 21 plants were screened out, detailed corresponding annotation information les were produced simultaneously, as listed in Additional le 2: Table S2, the statistical result was also listed in Additional le 3: Table S3, Table 1 and Fig. 1. In terms of codon position, 21.97%, 67.16%, and 10.58% of sites were edited at rst, second, and third codon positions respectively (Fig. 1b). Furthermore, the statistics of editing types showed that the majority (~ 95%) of the editing events resulted in non-synonymous codon changes, and the amino acids changes tend to be hydrophobic, the change from hydrophilic to hydrophobic was the highest, followed by the change from hydrophobic to hydrophobic. The most common amino acid change types were Ser-to-Leu and Pro-to-Leu, as shown in Fig. 1d, Serine is hydrophilic, whereas Leucine and Proline are both hydrophobic. The above results demonstrated that the RNA editing exhibited a selective advantage in overall increase in hydrophobicity of the resulting proteins, which was also in good agreement with previous studies [10]. All the ltered editing sites were located in 1,038 genes across all species, the average percent of edited genes in the three lineages (ferns, gymnosperms and angiosperms) are 0.85, 0.61, and 0.39 respectively (Table 1). Compared to the latter two lineages, more sites and genes were edited in chloroplast transcripts from ferns remarkably, see Fig. 1a. In ferns, the largest number of editing sites was found in Selaginella moellendor i, with 1,316 editing sites, exclusively of the C-to-U type, which is nearly 100-fold more abundant than that of owering plants, a total of 77 genes (~ 96%) were edited, and the overwhelming majority (94.3%) of the 1,241 silent editing events altered codon directly. Differ from Selaginella moellendor i that exclusively belonging to taxa of lycopsida, the other ve fern plants are members of Leptosporangiopsida, has relatively smaller numbers of editing sites, represented by Cyrtomium fortune, owning second-largest number of editing sites among all speci es, with 610 editing sites and 79 edited genes (~ 91%). Whereas for gymnosperms, the average number of editing sites and percent of edited genes were all less than that of ferns. Compared with other three species of the same taxa, Ginkgo biloba has the most editing sites, with 291 editing sites and 68 edited genes (~ 81%) in chloroplast, re ecting its ancient nature of "living fossil". On the opposite end, angiosperms have the lowest average numbers of editing sites and only a part of genes were effectively edited. It was noticeable that Liriodendron tulipifera and Nelumbo nucifera distinguished them from other angiosperms with more editing events. Liriodendron tulipifera is one of the most ancient owering trees as the genus dates back about 65 million years, a total of 265 editing sites were detected, which is nearly 3-fold more abundant than that of other angiosperms and gymnosperms except Ginkgo biloba, the percent of edited genes was up to 75%, well above the average, re ecting early angiosperms possess much more diversi ed editing sites.
The average number of editing sites among other 9 angiosperms showed no signi cant differences among them, but they were signi cantly lower than those of Liriodendron tulipifera and Nelumbo nucifera. The above results illustrated the differential distribution of RNA editing among varied plants, the cases of Selaginella moellendor i and Liriodendron tulipifera also implies independent origins and subsequent evolutionary trajectories of editing processes.

Variability of RNA editing among plants and genes
To further explore the evolutionary trajectory of RNA editing among different species, for each gene, we summed the number of the editing sites across 21 plants (see Additional le 4: Table S4), and picked out the top 30 genes with most editing sites across 21 plants for cluster analysis, as shown in Fig. 2. We found that not every gene were edited in all the species, for certain species, the lack of editing at a few genes may be explained by two reasons, one is absence of the genes that annotated in chloroplast genome, another is no RNA editing occurred in the genes actually. By grouping the genes based on their function, genes encoding membrane subunits of the chloroplast NDH complex and RNA polymerase exhibited the largest average numbers of editing sites, while ribosomal subunits showed the lowest numbers, this is consistent with previous studies that RNA editing occurred preferentially in genes encoding membrane-bound proteins under strong selection [24]. Due to the well-studied background and abundant editing sites in plant, ndhB gene is assumed to be a good case for the study of RNA editing evolution, in our study, ndhB was also con rmed to possess the most editing sites based transcriptome data, with 333 editing sites spread across 17 species. In spite of this, there is a biased distribution of RNA editing sites in ndhB among three clades, in fern group, 50 sites were detected in Selaginella moellendor i, and about 10 sites in the other ferns, in angiosperms, there were about 20 editing sites in each of the 11 angiosperm plants. In gymnosperms, RNA editing in ndhB was only detected in Ginkgo biloba, for Picea abies and Pinus massoniana, no ndhB gene annotated in their chloroplast genome, whereas for Cycas revolute, no RNA editing events were detected in its ndhB gene, which may result from loss of editing or too low depth around genomic regions of its ndhB gene.
To compare the number of editing sites among the top 30 genes, a matrix of numbers of editing sites across 21 species was produced, a hierarchically-clustered heatmap was plotted in Fig. 2, which showed that the clustering relationships agreed with their phylogenetic tree based on sequence alignments roughly. 21 plants were divided into three clustering groups with two exceptions, Selaginella moellendor i and Liriodendron tulipifera were clustered far away from their own clades respectively, further implied independent origins and subsequent evolutionary trajectories of editing processes. The above results suggest that RNA editing in chloroplast may break out in early-branching plants from different clades simultaneously and suffer a lot of loss during evolution. Hierarchically clustered heatmaps of numbers of all RNA editing genes in chloroplast across 21 plants and each clade were shown in Additional le 6: Figure S4-7 respectively.
Uneven distribution of RNA editing extent RNA editing extent was used to measure to what extent the edited transcripts among all transcriptome for one gene, if one site was edited, the C/G base (wild type) should be changed to the T/A base (edited type), thus its editing extent can be calculated by the formula: depth of edited bases (T and A)/total read depth of bases. In this study, we explored the distribution of editing extent among codon positions, species and edited genes. First, the comparison between codon positions showed that the distribution of RNA editing extent among them was uneven, and did not comply with the normal distribution, featuring an peak around ~ 0.2 and fat tails, as shown in Fig. 3.  Table S5. Afterwards, comparison between each two clades were conducted, the cytosines content of each gene were averaged across all the members of each clade, and two-tailed Wilcoxon rank-sum test was used to perform pairwise comparisons. The statistical result showed that a remarkable signi cance (p < 0.05) was detected in the comparison of cytosines content between each two clades except for gymnospermsangiosperms, the percent of cytosines of ferns was far below than that of angiosperms, followed by gymnosperms, as shown in Fig. 5a. The percent of cytosines of each gene across the 21 plants were further plotted in Fig. 5b, which also demonstrated that the percent of cytosines dramatically declined roughly along with the evolution, with a few exceptions (such as rpl23 gene). One striking example was psbI gene, which has the highest percent of cytosines in Selaginella moellendor i (~ 0.37), and dropped to about 0.18 in other species. While the highest average of 51 shared genes was found in Selaginella moellendor i (~ 0.26), the smallest average was found in Glycine max (~ 0.17). Furthermore, Ginkgo biloba and Liriodendron tulipifera have higher average in angiosperms, correspond to ~ 0.192 and ~ 0.188 respectively, showed a positive correlation with their high numbers of RNA editing sites. The above results indicated the number of editing sites declined dramatically with the evolution of plants, which maybe due to loss of cytosine content in chloroplast gene for later-branch plants.
We illustrated one RNA editing example of atpA gene that may help to understand the evolution trajectory across plants vividly. The gene sequences of atpA across 21 species were collected, intersection of all the species' RNA editing sites of atpA was concatenated for alignment and annotated in Fig. 6. We marked all editing sites identi ed in atpA gene by yellow color, the distribution of its editing sites demonstrated that numbers of editing sites as well as cytosine content declined from ferns to angiosperms. We found that editing sites at third codon positions were poorly conserved, for example, in

Discussion
As a post-transcription process, RNA editing can modify the genome template to produce a different transcript [6]. For plants, signi cant progress in RNA editing has been made in recent years, numerous studies have proved that RNA editing occurred in nearly all plants in the kingdom, and demonstrated that RNA editing played roles not only in abiotic stress tolerance but also likely in the plant development, such as ower development and male sterile [9, 10, 14, 19], RNA editing may also cause secondary structure transformation of transcripts [25]. Until now, there are two viewpoints about the nature of RNA editing, one is contribution to variations in proteomic sequence, and play roles in modulating gene expression; another point thinks that RNA editing in plants might be a repair mechanism to correct genomic point mutations at post-transcription level, thus increase the substitutional rate that is extremely low in organellar genome [26,27].
With prior knowledge, to detect editing sites, RT-PCR primers should be designed to ank a region that contains the sites of interest, the editing sites can be determined by comparing PCR products with the genomic DNA sequence. However, this traditional approach is a time-consuming procedure requiring larger experiments to perform, and prone to underestimate number of editing sites and overestimate editing extent, because for site with editing extent less than 10%, to nd one edited transcript, more than 10 clones have to be sequenced when comparing its cDNA with genomic sequences. With the advent of sequencing technology and bioinformatics, more and more RNA-seq data was generated, combined with the sequencing data and bioinformatics tools, much progress has been made in identifying all the potential RNA editing sites and quantifying their editing extent especially for the sites with very low extent. Hence, RNA editing sites were identi ed in more and more organisms based on RNA deep sequencing [14,19,28].
In this study, to gain a better understanding of RNA editing in plant chloroplast, we collected a large mount of RNA-seq data and performed a series of bioinformatics procedures to investigate RNA editing in 21 diverse plants that distributed in three clades. A total of 5,389 editing sites located in leaf chloroplast genes across 21 plants were identi ed and quanti ed their editing extent, demonstrating the powerfulness of bioinformatics approach, many of the identi ed sites have not been previously reported, thus provided a valuable data resource for future research. The statistics results revealed RNA editing sites that occurred in second codon position was mainly the largest, and majority (~ 95%) of the editing events resulted in non-synonymous codon changes, additionally, editing signi cantly increased the hydrophobic amino acid with a selective advantage. We found that the cluster relationship of numbers of RNA editing sites complied with the phylogenetic tree based on gene sequences approximately, further veri ed that the RNA editing across plant kingdom are comparatively conservative and accord with laws of evolution roughly. An uneven distribution of editing sites among species, genes, codon positions were found for numbers of editing sites as well as the average RNA editing extent. In total, numbers of editing sites declined with the evolution of plants, editing events occur more often in ancient plant than higher plant, such as Ginkgo biloba, Liriodendron tulipifera, which both owned highest number of editing sites in each clade. Compared with other species, fern plant Selaginella moellendor i was identi ed to own the highest number of sites and the lowest editing extent (~ 0.43). We also found that a reasonable percentage of editing sites occurred in certain clades, and lost in other clades whose cytosine already corrected to thymine in the site of genome, RNA-editing activities affecting third-codon positions showed a higher evolutionary variability. The decrease of cytosine content in chloroplast gene for later-branch plants might explain the reason for declined number of editing sites with the evolution of plants.
Previous studies revealed that organelle genomes have a more slower evolutionary rate than nuclear genome, thus accumulated a number of T-to-C mutation that constitutes a prerequisite for generation of plant organellar RNA editing [29], which could correct those T-to-C mutations to restore the evolutionarily conserved amino acid residues in mRNAs. During the evolution of land plants, most mutations would have been nally corrected to thymine in the genome especially for higher plants, but some sites still need to be edited to thymine or remained to be cytosine for coding for different amino acids. During the evolution of higher plants, genome mutations of C-to-T eliminate the need for editing at certain sites, thus the number of editing sites showed a remarkable reduction, since substitution of genome are more constantly than that of mRNA level. Hence, the genome mutations is actually the driving force behind the evolution of editing sites in plants, the increasing modi cation of C-to-T at the genome level might be more accurately to describe the evolution trajectory instead of loss of RNA editing sites, The uneven distribution of editing sites among species, genes, codon positions implied their independent origins across three clades, suggesting that in early-branching plants of different clades, chloroplast RNA editing may break out simultaneously and suffer a lot of loss during evolution.

Conclusion
To illuminate the evolution mechanism on a genome-wide scale, we performed a systematic characterization and comparison of RNA editing events across three major clades of plants. Based on a large amount of RNA-seq data, a relaxed automated approach combined with manual inspection that eliminated false positives was used, and nally screened out thousands of RNA editing sites, demonstrating the advantages of bioinformatics approach in detection of RNA editing sites. Many of the identi ed sites have not been previously reported so far and provide a valuable data set for future research. The distribution of editing sites showed an heterogeneity characteristics among species, genes, and codon positions. The average RNA editing extent also varied among different plants as well as genes. The genome-wide distributions of chloroplast RNA editing across three clades suggest that plants have undergone drastic changes in both the numbers and patterns of editing. Ancient plants have more editing sites and lower editing extent, while constant genome mutations occurring by replacing cytosines with thymidines might be the underlying force of loss of editing sites in higher plants. Our comparative study provided valuable information for evolution of RNA editing in plants. However, further mechanistic studies are still needed to characterize the RNA-binding proteins that involved in site recognition and editing across different clades.

Data collection
All data sets used in this study are publicly available. We selected 21 plants across three clades (fern, gymnosperm, and angiosperm) for analysis of chloroplast RNA editing events. For each plant, the corresponding transcriptome data was downloaded from SRA database at NCBI based on two criteria: rst, to increase reliability of editing sites, SRA accessions with paired-end reads that possess higher mapping speci city were preferred, second, for the reason that more chloroplast mRNA in leaf were extracted and sequenced compared with other tissues, RNA-seq data obtained from leaves of wild type individuals were only selected. Besides, for each plant, the reference le consisting of chloroplast genome sequences and corresponding gene annotation le in 'tbl' format were also downloaded from the GenBank database. Detailed information of SRA data and reference les used in our study were listed in Additional le 1: Table S1.

Read mapping and SNP calling
In total, the identi cation process of RNA editing sites can be decomposed into three steps: rst read alignment, second the SNP calling, and third detection of RNA editing sites. For each plant, in order to increase sequencing depth, we merged all the SRA accessions from the same species into one sample.
The quality control of paired-end Illumina sequencing data were evaluated rst by NGSQCToolkit, low quality sequence data were ltered out (cutOffQualScore < 20) [30], then transcriptome data from each plant was aligned against its chloroplast reference by hisat2 software under default parameters [31].
Afterwards, the alignment results were sorted, removed duplicates, indexed, and sorted by using SAMtools [32]. Finally, the bcftools tool was used to identify SNPs, and VCF les that describe transcriptome variation were generated [33].

Detection of RNA editing sites
The principles of RNA editing detection is similar to that of transcriptome variation in SAMtools. Thus, we used the preliminary output from variant calling software to identify RNA editing sites. For each plant, based on the SNP-calling results (in "VCF" format) and genome annotation les (in "tbl" format), the RNA editing sites were identi ed under default parameter values by using the REDO tool [21]. REDO  To minimize the number of false negatives with the automated approach, for the produced raw RNA editing sites, we manually examined all mismatches to eliminate false positives, and only kept the C-to-U and U-to-C editing types and excluded mismatches with other editing types, such as A-to-C, T-to-A, etc. In addition, to evaluate the reliability of editing sites number, for each plant, we also used PREPACT tools to predict potential RNA editing events supplying with entire chloroplast genomes as input les, with lter threshold at least 80% of the references under BLASTX mode, PREPACT originally relied on BLASTX hits in manually assembled collections of reference protein sequences [34].

Characteristic statistics
All the ltered RNA editing sites detected in 21 plants were used for further statistics and feature analysis, including statistics of editing number, editing type, codon position, amino acid changes, involved genes and so on. In order to decipher the distribution of RNA editing frequency across different species, the top 30 genes with most editing sites across 21 plants were selected, cluster analysis and heatmap plotting were also provided based on the matrix of RNA editing numbers of top 30 genes across 21 plant. The CDS sequences of top 30 genes across 21 plants were concatenated and subjected to alignments and phylogenetic tree construction using MEGA [35]. Meanwhile, the RNA editing extent of top 30 genes were also subjected to statistical analysis. In terms of RNA editing extent, its value at one site was expressed as the proportion between edited transcripts and total transcripts. If one site was edited, the C/G base (wild type) should be altered to the T/A base (edited type), since one editing site could be detected hundreds of times via sequencing, the number of wild type (C/G) or edited type (T/A) of bases could then be counted at this particular site, then the editing extent at one site could then be calculated by the formula: depth of edited bases (T and A)/total read depth of bases. Values of editing extent matrix were normalized by subtracting the row-wise mean from the values in each row of data and multiplying all values in each row of data by value of standard deviation. For each lineage ( fern, gymnosperm, and angiosperm), a heatmap was plotted across all of its species respectively using "pheatmap" function in R, the distance matrix of different samples was calculated using "dist" function with the default Euclidean method, and the hierarchical clustering was computed using "hclust" function.
Considering protein coding genes varied among different plant, we picked out shared edited genes for the statistics of cytosine content across 21 plant. For each shared edited gene, we extracted its CDS sequence, and calculated the ratio of cytosine content, and further performed pairwise comparisons between any two of lineages (fern, gymn, and angi). Two-tailed Wilcoxon rank-sum test was used. We illustrated atpA gene as an example, sequence logo of atpA gene was produced by WebLogo [36], alignment was constructed by using MEGA under default parameters [35].