Sequencing of Taxodium Plastid Genomes
Illumina 150-bp paired-end sequencing of long-range PCR-amplified plastid DNA generated 5045–6946 Mb clean reads for the three sampled Taxodium species (Table 1). Using the combination of de novo and reference-guided assembly, we obtained complete plastid nucleotide sequences for all three species. The nucleotide sequences of the four plastid genomes range from 131,947 bp in T. distichum to 132,613 bp in T. ascendens (Table 2). Like the cp genomes of other cupressophyte species, they lack the IR region and have no distinct quadruple structure. The gene map of the T. ascendens plastid genome is presented in Fig. 1 as a representative. The three genomes encode an identical set of 120 genes (Additional file1), and the arrangements of these 120 genes are totally collinear (Additional file2). The 120 unique genes include 83 protein-coding genes (Table 2), 33 tRNA genes, and 4 rRNA genes. They also have similar GC contents of 35.22%-35.26%. This is similar to other gymnosperm plastid genomes.
Table 1
Sequencing and assembly results of three chloroplast genomes of Taxodium
Species | Raw data (Mb) | Clean data (Mb) | Clean data GC(%) | Clean data Q20(%) | Clean data Q30(%) | GC Content(%) | N rate (%) |
T. ascendens | 5329 | 4895 | 37.85 | 98.02 | 93.94 | 35.22% | 0% |
T. distichum | 5045 | 4695 | 35.74 | 98.11 | 94.2 | 35.26% | 0% |
T. mucronatum | 6946 | 6595 | 34.19 | 98.38 | 94.91 | 35.25% | 0% |
Note: Read length: read length of valid data; Clean data GC: average GC content of valid data; Clean data Q20: Q20 value of valid data; Clean data Q30: Q30 value of valid data. Total Length (bp): The total length of the sample assembly result; GC Content (%): GC content of the sample assembly sequence; N rate (%): the content of unknown base N in the sample assembly sequence. |
Table 2
Gene information statistics
Species | Accession No. | Genome size (bp) | Coding Gene number (#) | CDS total length (bp) | CDS average length (bp) | CDS length / Genome (%) |
T. ascendens | MN535012 | 132,613 | 83 | 74,469 | 897 | 56.16 |
T. distichum | MN535013 | 131,947 | 83 | 74,217 | 894 | 56.25 |
T. mucronatum | MN535011 | 132,037 | 83 | 74,217 | 894 | 56.21 |
Accession No.: Accession number of the complete chloroplast genome in genebank database. CDS: coding sequence. |
Tandem Repeats Analysis
A TR is a repetitive sequence of adjacent specific nucleic acid sequence patterns repeated twice or more, including simple sequence repeats (SSR), whose repeat motif is 1-6 nucleotides, and long sequence repeats, whose repeat motif is ≥ 7 nucleotides. A total of 639 TRs were detected in the T. ascendens cp genome using Phobos (Additional file 3); the total length of repeats was 8462 bp, and TRs were widely distributed in the coding and non-coding regions of the cp genome (Fig. 2, Circle 3). Repeated motifs ranged from mononucleotide to 95-nucleotide. Among these, 601 were SSRs and 38 were long sequence repeats. Mononucleotide repeats were the most abundant SSRs, accounting for 38.03% (243) of the total, of which 238 repeat units were A/T and only five were G/C (Additional file 4). Of the 55 (8.61%) dinucleotide repeats, 34 were AT/TA type, 17 were AG/TC type, four were AC/TG type, and none were GC/CG type. For trinucleotide repeats, except for four AGC/TCG and two AGG/TCC types, the rest were all repeat types with A/T ratio higher than GC. Among the 79 (12.36%) tetranucleotide repeats, 64 had a higher A/T ratio than GC. Because of the high A/T content of repeat motifs, an increase in repeats will lead to a low GC content of chloroplast genes. The total length of repeats ranged from 7 bp to 453 bp, of which 325 (50.86%) were short and less than 10 bp, 286 (44.76%) were medium-sized and TRs ranging from 10 bp to 20 bp, and 28 (4.38%) were long and TRs ranging from 20 bp to 50 bp. Long TRs were mostly distributed in non-coding regions (Fig. 2, Circle 3, Green dots). The total length of seven long TRs was more than 50 bp (Table 3). Their total lengths were 98 bp(ycf2), 110 bp(psbJ-clpP), 116 bp(rps18), 145 bp(trnI-ycf2), 152 bp(clpP-accD), 333 bp(ycf1) and 453 bp(clpP-accD) (Additional file 3).
Researches have shown that there are many TRs on the CDS of accD gene and its surrounding regions of gymnosperms[13]. In order to study the general features of ycf1 genes in cupressophytes, we analyzed the ycf1 gene sequences in 44 species(Fig 3). The length of ycf1 gene in Pinaceae is similar to that of Ginkgo biloba and Cycas, which were about 5000bp. However, the ycf1 gene length in cupressophytes experienced an extraordinary expansion, ranging from 6666bp to 8931 bp, with Taxus baccata and Sciadopitys verticillata has the shortest and longest CDS, respectivelly. There were no TRs in Ginkgo biloba and Cycas cp gemones, but, except for four species, TRs were detected in most conifers. In the Taxodium, TRs were only detected in T. ascenden. The same situation happened in Cupressus, with TRs are detected in Cu. chengiana and Cu. gigantea but not in Cu. jiangeensis. It can be seen from figure 3 that the insertion positions of TRs on the ycf1 CDS were family specific. For Cupressaceae, there are two major insertion positions, one located in the middle of the CDS, the other one is near the C-terminal region.
Dispersed Repeats
Fifty dispersed repeats were detected in the T. distichum, T. mucronatum, and T. ascenden cp genomes, respectively, using REPuter (Additional file 5). In the T. distichum cp genome, there were 24 forward repeats, 21 palindromic repeats, three complement repeats, and two reverse repeats. In the T. mucronatum cp genome, there were 26 forward repeats, 18 palindromic repeats, one complement repeat, and five reverse repeats. In the T. ascendens cp genome, only 36 forward repeats and 14 palindromic repeats were detected.
In cupressophyte cp genomes, the highly reduced IRs are replaced by short repeats that have the potential to mediate homologous recombination [6, 11, 16]. Three sIRs >100bp were detected in T. ascendens (Table 4), and their sequences were identical in all three Taxodium cp genomes. Among them, sIR1 and sIR3 contained complete trnQ-UUG and trnI-CAU genes, respectively. For sIR2, one copy was located in the intergenic region(IGS) region of petA-ccsA and the other in the psbJ-clpP IGS.
The trnQ-UUG gene was the only one in a sIR longer than 200 bp. If the 282-bp IR is able to mediate homologous recombination (HR), we would expect the presence of two type isomers. Semi-quantitative PCR with a variable number of cycles was conducted to verify the presence of the two isomers. The isomers illustrated in Fig. 1 is designated as the type I , and the other is the type II. All four reactions generated products, which verified the presence of both the I and II forms. It was also apparent that there were minor differences in amplification efficiency between the four PCR reactions. With 30 PCR cycles, the electrophoresis bands of type 1(rps4/chlB products) are very bright, while those of type 2(psbK/trnL products) are much weaker (Fig. 4). These results suggest that the type I is predominant in T. ascendens, in agreement with our assembly results.
To quantify the relative frequency of the two isomeric genomic forms, Illumina paired-end reads were mapped to the genome and isomer frequencies were calculated using the method of Guo[16]. There were 297 read pairs that spanned the trnQ-containing IR copies, of which 293 pairs (98.65%) supported the type I isomer while four pairs (1.35%) supported the type II isomer.
Phylogenetic and Rearrangements Analysis
Phylogenetic tree based on the 51 single copy coding genes of 44 species were constructed(Fig. 5). Among the genus Taxodium, T. mucronatum and T. distichum were clustered together. The genus Taxodium has the closest relationship with Glyptostrobus, and then forms a group with Cryptomeria. Unanimously, mauve alignment showed that no rearrangements occurred between Taxodium and Glyptostrobus pensilis, while at least four rearrangements occurred between Taxodium and Cryptomeria japonica (Additional file 6).
Previous studies have shown that cycads possess the oldest sequence of genes in seed plants. We conducted mauve alignment between Taxodium and Cycad taitungensis cp genomes. Compared with Cycas, there were 13 conserved gene clusters in Taxodium cp genomes, which were labeled S01 to S13 (Table 5) (Figure 2, Circle 2). Therefore, there were at least 13 rearrangements in the process of transformation from cycad chloroplast genome structure to T. ascendens genome structure. The size of the conversed gene blocks ranged from 1236 bp to 40489 bp. Five of the 13 inversion endpoints occurred near tRNAs, including trnI, trnT, trnQ, trnF, and trnM. Interesting, there is a sequence between S07 and S08 that can't find homologous sequence on Cycad, and its position (101807-102061) overlaped with the 63-nucleotide repeat TR493 (101809-102141) on ycf1(Additional file 3). Two other inversion endpoints were also in TRs. The inversion endpoint (99473) between conversed gene blocks S07 and S08 was in the mononucleotide repeat TR484 (99466-99478) located in petA-ccsA. The inversion endpoint (116761) between conserved gene blocks S10 and S11 was in the 95-nucleotide repeat TP571 (116478-116930) located on clpP-accD.
The accD gene or its adjacent region is a hot rearrangement area of cupressophyte cp genomes, Li et al. found that there are five types of gene order in cupressophytes, and speculated that many inversion events have occurred here during the evolution of cupressophyte cp genomes [13].We also analysis the sequence variability of the genes adjacent to the ycf1 gene in Taxodium. The gene order around ycf1 of the 44 analyzed species could be classified into twelve types (Figure 6). Cycas, Ginkgo, and Taxaceae have the same gene order: ndhH-rps15-ycf1-chlN-chlL gene order(type I). As we can see from fig.6, at least one side (right side/left side)of ycf1 gene in type II(Pinaceae), type III (Podocarpuceae), type IV(Podocarpuceae) and type V(Podocarpuceae) maintains the same gene order of type I. However, in Cupressaceae (type VI to XII), gene orders of both sides (right side and left side)of ycf1 gene were totally different from type I. Therefore, the arrangement frequency around ycf1 gene in Cupressaceae was much higher. In the Cupressaceae family, Glyptostrobus, Taxodium ,Metasequoia, Taiwannia and Cunninghamia have a conversed trnL(UAG)- trnP(GGG)-ycf1-rpl20 –rps18(type VI) gene order. And in Juniperus, Cupressus and Hesperocyparis, the gene order is mainly: trnL(UAG)-ccsA-ycf1-trnL(CAA)-ycf2(type XI). In view of the diversity of gene organization around ycf1 gene, it is speculated that ycf1 gene may be frequently involved in the rearrangement events of cupressophytes cp genomes.
Comparative Analysis of Genomic Structure
Compared with T. ascendens, 83 indels, including 43 deletions and 40 insertions of different origins were detected in T. distichum and T. mucronatum(Fig. 2, Circle 4). Among them, 82 indels occurred in IGS regions, and only one 252bp indel occurred in the CDS region of ycf1. Therefore, the total CDS length of T. ascendens is 252bp longer than of T. distichum and T. mucronatum. The indel did not caused frame shifts or stop codons. Among the 83 indels, only seven (8.43%) were located outside repeat regions, and the remaining 76 (91.57%) were located within 51 TRs(Additional file 7). Among these, 64 indel sequences were integer multiples of repeat motifs, that is, the generation of indel sequences created differences in the number of complete repeat motifs. Twelve indel sequences were non-integer multiples of repeat motifs, i.e., the indel sequences contained partial incomplete repeat motif sequences. Of the 51 TRs containing indels, 47 (92.16%) were SSR indels of 1-4 nucleotides. Among these, 30 belonged to mononucleotide repeat type A/T, 14 belonged to dinucleotide repeat type AT/TA, two belonged to trinucleotide repeat type AAT/ATT, and one belonged to tetranucleotide repeat type AAAT/ATTT. All SSR indels contained only A/T. The remaining four large indels were 19-, 22-, 28-, and 95-nucleotide repeats.
A total of 45 SNPs were detected in T. distichum compared with T. ascendens, representing 31 in IGS regions and 14 in CDS regions(Fig. 2, Circle 5), with six synonymous mutations and eight non-synonymous mutations (Table 6). A total of 50 SNPs were detected in T. mucronatum compared with T. ascendens, representing 35 in IGS regions and 15 in CDS regions, with six synonymous mutations and nine non-synonymous mutations. No mutations appeared on start/stop codon or caused the triplet codon of the site mutates into a termination codon. We merged SNPs occurring at the same site into one for statistical analysis, so a total of 69 SNPs of diverse origin were found. Among these, 45 (65.22%) SNPs were found in non-coding regions and 24 (34.78%) SNPs were found in 16 coding regions, including rps16, psbC, rpoB, rpoC2, ndhA, rps12, rpl2, rps3, cemA, ycf1, clpP, accD, rbcL, atpB, ndhK, and chlN. Of the 69 SNPs, 18 (26.09%) were located within the TR sequences, 32 (46.38%) were located within 100 bp windows adjacent to the repeats, and only 19 (27.54%) were located outside the 100 bp windows adjacent to repeats (Fig. 7).
Analysis of Chloroplast Genome Hypervariable Regions of Taxodium
Regions enriched with indels and SNPs can be considered as hypervariable regions in the complete cp genome. Since most indels show periodic variation of repeat motifs, the number of polymorphic sites contained in an indel is affected not only by the degree of periodic variation but also by the size of the repeat motif. When two TRs have the same number of sites, compared with repeats with longer repeat motifs, the repeat sequence with shorter repeat motif has more repeat cycles, indicating a higher polymorphic potential. Therefore, when we select hypervariable regions, these should not only be based on the number of polymorphic sites, but also regions with more indel/SNPs of different origins to ensure that the region has higher polymorphic potential. Based on this principle, the most variable regions of the Taxodium cp genome are clpP-accD, trnV-rps12, ndhF-trnN, trnV-ndhC, trnI-trnL, ycf1-rpl20, atpI-atpH, rpl36-rps11, ycf1, psbJ-clpP, and trnI-rrn16(Figure 2, HR01-HR11). These hypervariable regions contained at least four SNPs and/or indels of diverse origin in this study (Additional file 8). These regions can be considered interspecies mutational hotspots in Taxodium and could be potentially high-resolution DNA barcodes in the study of population genetics.