Integrated analysis of four new sequenced fern Chloroplast Genomes: Genome structure and Comparative Analysis

Background: Dryopteris goeringiana (Kunze) Arthyrium ex Dryopteris crassirhizoma Nakai, and Polystichum tripteron (Kunze) Presl are fern species with their chloroplast genomes uncovered, except D. crassirhizoma . In this study, high throughput sequences were performed to get better understand of their chloroplast genomes and comparison analysis with other fern species. Methods: Fresh fern leaves were used for sequencing. Simple Sequence Repeat (SSR) analysis, high-variety regions comparison, IR border comparison, nucleotide diversity, RNA editing, and phylogenetic analysis were performed. Results: They have different genome structure characteristics in terms of genome and its area size, gene types and location. The quantity and type of SSRs in D. crassirhizoma is very similar to D. goeringiana , with (ATAA)2, (ATCT)1 and (TTTA)1 only present in D. crassirhizoma . SSRs contain more AT bases than GC. There were divergent genes existed in fern species. Ten genes have a Pi value >0.20. C-to-U RNA editing was most common. Phylogenetic analyses showed that species in the same genus clustered together. Conclusion: The genomic structure and genetic resources of D. goeringiana , A. brevifrons , D. crassirhizoma , and P. tripteron contribute to further studies on phylogenetics, population genetics, and conservation biology of ferns. Conceptualization, R.F. and Q.H.; methodology, W.M.; software, S.L.; resources, Q.H.; writing-original draft preparation, R.F.; writing-review and editing, Q.H., S.L., and

. From up to down: LSC: Large single copy; SSC: small single copy; IR: Inverted repeat.  (Table 3). We found 14 intron-containing genes in all these four chloroplast genomes, 11 of which contained one intron, the other three genes (clpP, rps12 and ycf3) contained two introns in D. crassirhizoma, A.
brevifrons and D. goeringiana (Table 4). Especially rps12, which contains one exon in LSC region and the other 2 reside in IR regions, was considered as a trans spliced gene separated by two introns.  Among all the 268 RNA editing events, the C-to-U and mutations were most common, reaching 120 (44.8%), followed by U-to-C 103 (38.4%), A-to-G 36 (13.4%) and G-to-A 9 (3.4%) of all the RNA editing events.

Phylogenetic analysis
We investigated the phylogenetic relatedness among the chloroplast genomes of 43 fern species, among which 35 nodes with support values greater than 90%, containing 27 nodes with support values of 100%. The fern species in the same genus were clustered together to a certain degree. D.
crassirhizoma, D. goeringiana were closer to D. decipiens. P. tripteron was identified as a sister genus to C. devexiscapulae. It was worth noting that A. brevifrons formed a single clade with Athyrium sinense, instead of P. glycyrrhiza ( Figure 6).

Discussion
This study reported four complete chloroplast genomes of D. crassirhizoma, A. brevifrons, D.
goeringiana and P. tripteron for the first time, ranging from 148,539 bp to 151,341 bp in length. The whole chloroplast genomes size of more than 20 ferns were compared in the study of Gao et al [26] and RuizRuano et al [1], which revealed that the length of ferns ranging between 131,760 bp (Equisetum hyemale) and 181,684 bp (C. devexiscapulae) based on the current researches, and the four ferns detected in our study is within this range. The genomes of these four ferns chloroplast exhibit a typical quadripartite structure, including LSC, SSC and two IRs, as reported in other spore plants [27,28]. Additionally, the gene number and order were largely similar with other ferns, but there were still some differences among species, which should let the length variation in the IR and SSC regions take the responsibility [29]. We compared the chloroplast genomes of D. crassirhizoma, D. goeringiana and D. fragrans [26], and found a genomes loss of 4033 bp in D. fragrans, which caused a longest SSC and shorter IRs. Therefore, the D. fragrans was found has dispersed gene distribution and longer sequence length, because of more intergenic sequences. As demonstrated by Gao et al [26] and Lu et al [16], they revealed that fern chloroplast genome extends the intergenic sequences but reduces the overlapping genes, thus, sequence utilization is more specific and genes are more independent.
Four fern genomes encoded 132-135 genes, composed with 89 protein-coding, eight ribosomal RNAs and 35-38 transfer RNA genes in this study. The genes clpP, ycf3, and rps12 were found containing two introns, was also reported in many other ferns, as Alsophila podophylla [30] and Alsophia gigantea [31]. matK only in P. tripteron was found had two introns in this study. It was worth noting that, a plastid research of D. crassirhizoma detected by Xu et al [32] had the different result with this study. They demonstrated that ycf2, rpoB, clpP and ndhF encoded four, three, two and two introns, respectively. As reported in other angiosperm cp genomes [29,33], rps12 is a coding unequally divided gene with its 5' terminal exon located in the LSC region, while two copies of the 3' terminal exon and intron are in IRs.
SSRs with its length ranging from 1-6 or more base pairs, play an important role in genetic molecular markers for population genetics [34,35] and are also widely applied for plant genotyping [36,37]. In this study, the number of SSRs ranging from 69 to 108, with lowest in D. goeringiana and highest in P.
tripteron. The quantity and type of SSRs showed a high similarity in D. crassirhizoma and D.
goeringiana, which might for the same genus they were belonging. Although A. brevifrons and P.
tripteron obtained the same quantity of SSRs, but the type of SSRs differed largely.
It had been reported that chloroplast genomes SSRs are commonly composed of short polyadenine polyadenine (Poly A) or polythymine (poly T) repeats, but rarely tandem guanine (G) or cytosine (C) repeats, which is consistent with this study [38]. However, this conclusion is not working in D.
fragrans, which have high GC content than AT in SSRs. Taken the living environment into consideration, the author speculated that high GC content of repeat structures may allow D. fragrans coping with heat and large temperature differences [26]. The kind and amount of SSRs in D. fragrans, D. crassirhizoma and D. goeringiana exist great differences though they all belonging to the Dryopteridaceae family. This implies that great differences exist within species of the same family, which might for the different conditions they survive.
We also revealed that the sequence of chloroplast genome was similar with each other in this study and other four fern species (Figure 3). However, relatively small variation was observed within them in several comparable genomic regions. Furthermore, LSC and SSC regions were less similar than IR regions as previously reported [39,40]. Similar result was also reported in chloroplast genome of higher plant, which demonstrated that the lower sequence divergence in IR regions was possibly owing to copy correction between IR sequences by gene conversion when compared with SSC and LSC regions [41]. Moreover, some previous reports revealed that LSC region contains most of the divergent genes, and display a trend towards more rapid evolution [29,32]. The divergent regions included rpoC2, rpoB, psbC, pasA, rbcL, ycf2, ycf1 and ndhB in this study, as reported previously [32].
In spite of high conservation in IR regions, the expansion and contraction are still take responsible for variations in chloroplast genome size and rearrangement [39,42], therefore, play a vital role in evolution [15,43]. chlL and ndhF extended into IR regions with different length. All these ferns have relatively similar boundary characteristics with the expansion and contraction of IR regions except D.
goeringiana, which presented opposite in the junctions of SSC/IRa and IRb/SSC with other six fern species in our study and other study [32].
The nucleotide diversity analysis also demonstrated that genes located in IR regions are less variable than in the SC regions. Additionally, genes with Pi value > 0.20 were mainly located at SC regions.
It was reported that fern and hornwort chloroplast genomes have a relatively high levels of RNA editing compared with seed plants [44,45], whereas only 30-40 RNA editing sites typically occurred.
Most editing events in these four chloroplast genomes were C-to-U events in this study, which is consistent with D. fragrans [26]. It has been reported that the excess of C-to-U RNA editing developed in early stages of vascular plant evolution [46].
Chloroplast genome data are valuable for resolving species definitions since organelle-based "barcodes" can be established for certain species and then applied to unmask interspecies phylogenetic relationship [47]. The phylogenetic relationships of A. brevifrons, D. goeringiana and P.
tripteron were rarely been studied for there was none studied had put focus on the chloroplast of them. In our study, we combined 43 species of ferns participated in the phylogenetic tree analysis, which contained almost all the ferns that been studied before. We found that D. crassirhizoma, D.
goeringiana were closer to D. decipiens. P. tripteron was identified as a sister genus to C.
devexiscapulae. Interestingly, D. decipiens and C. devexiscapulae were found clustered into one branch in the study of Wei et al [18], and have close relationship with D. crassirhizoma in the study of Xu et al [32]. According to the result, we could guess that D. crassirhizoma, D. goeringiana, D.

Conclusions
In this study, we conducted and compared chloroplast genome skimming of D. goeringiana, A.
brevifrons, D. crassirhizoma and P. tripteron. Through the study, we found that they all obtain the typical quadripartite structure with slightly differences in genes and gene orders. SSR features differed largely among different species, so it might be able to be used in developing molecular markers for genetic diversity and molecular identification. The genomes in this study also show some variations and IR region expansion and contraction when compared with the other fern species. In addition, nucleotide diversity provides a way to study the genetic variation of the four species.
Phylogenetic analysis gives a suggestion of the relationship between ferns species. In conclusion, the genomic characteristics, comparison and genetic resources presented in this study contribute to further studies on phylogenetics, population genetics, and conservation biology of ferns. Illumina HiSeq 4000 platform (Biozeron Co., Ltd., China) for sequencing [49].

Genome assembly and annotation
Prior to assembly, reads with low quality were removed using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). After that, the chloroplast genomes were assembled in three steps [50]. First, the clean reads were assembled into contigs using SOAPdenovo 2.04 [51]. Second, clean reads were mapped to the contigs for assembly and optimization using SOAPGapCloser 1.12 [52]. Third, the redundant sequence is removed to get the final assembly result.

Phylogenetic analysis
A total of 43 fern species were used for the phylogenetic analysis. ClustalW2 was used to align the chloroplast DNA sequences under default parameters [60], and the alignment was checked manually.
The Maximum-likelihood (ML) methods were performed for the genome-wide phylogenetic analyses

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.