Cp genome sequencing and assembly
PacBio (10838 long reads, >5kb) and Mate-pair reads (7.04G of raw data were produced with 2×150 bp pair-end read lengths) were used to assemble the cp genome of CWN into a circular contig of 156,762 bp length. Circular genome maps were drawn with OGDRAW software (Fig. 1). Raw reads have been deposited in the NCBI Sequence Read Archive (SRA, SRR12002624). Assembled cp genome sequences and accompanying gene annotations of C. sinensis cv. Wuyi Narcissus have been deposited in the NCBI GenBank (Accession numbers: MT612435).
Comparative analysis of four cp genomes
Gene content
All four complete Camellia cp genomes displayed the typical quadripartite structure of most angiosperms, including the large single copy (LSC), the small single copy (SSC) and a pair of inverted repeats (IRa and IRb). Among these cp genomes, genome size ranged from 156,762 bp (CWN) to 157,353 bp (CIA). The length varied from 86,301 bp (CWN) to 87,214 bp (CIA) in the LSC region, from 18,079 bp (CWN) to 18,285 bp (CSA) in the SSC region, and from 26,030 bp (CIA) to 26,090 bp (CSS or CWN) in IR region (Tab. 1).
Four plastomes are highly conserved in gene content, gene order, and intron number. Each cp genome contained a total of 137 genes, including 92 protein-coding genes, 37 transfer RNA (tRNA) genes and 8 ribosomal RNA (rRNA) (Supplementary Tab. S1). Of them, 60 protein-coding and 22 tRNA genes were located within LSC, 16 protein-coding genes, 14 tRNA coding genes and eight rRNA coding genes were located within IRs and 11 protein-coding and one tRNA gene were located within SSC. The rps12 gene was a divided gene with the 5′ end exon located in the LSC region while two copies of 3′ end exon and intron were located in the IRs. The ycf1 was located in the boundary regions between IRa/SSC, leading to incomplete duplication of the gene within IRs. The rps19 genes in CSS, CSA, and CWN were crossed the LSC/IR region while in CIA was located within LSC. There were 18 genes containing introns, including 6 tRNA genes and 12 protein-coding genes. Except for two introns in the ycf3 and clpP genes, all other genes contained only one intron. MatK gene was located within the intron of trnK-UUU with the largest intron (2,489 bp). Overlaps of adjacent genes were found in the complete genome, rps3-rpl22, atpB-atpE, and psbD-psbC had a 16 bp, 4 bp, and 53 bp overlapping region, respectively. Unusual initiator codons were observed in rps19 with GTG and orf42 with ATC in four cp genomes. The initiation codon of ndhD in CIA was ATG, while that of other three cp genomes was GTG.
Sequence variation and IR expansion/contraction
To elucidate the level of the genome divergence, the sequences were plotted to check their identity using the program mVISTA by aligning the four Camellia cp genomes with CWN (Fig. 2) as a reference. The whole aligned sequences showed high similarities with only a few regions below 90%, suggesting that tea plant plastomes were rather conserved (Figure 4).
Sequence divergence analyses of four cp genomes revealed Pi values in the range from 0 to 0.01917 with an average of 0.00093, indicating moderate genetic divergence existed within the four cp genomes. However, four regions (including rp12/trnH-UGU, psaA/ycf3, atpB/rbcL and psbT/psbH) had relatively higher divergence values (Pi > 0.006) (Fig. 3). These four regions were all located within LSC, indicating that the LSC region had more gene mutations than the IR region and the SSC region.
Mutations may cause changes in the length of the coding gene sequence, leading to changes in the coding and non-coding regions. Therefore, the number and distribution patterns of variable characters in coding and non-coding regions among four cp genomes were further analyzed. The result showed that the proportion of variability in non-coding regions was with a mean value of 1.82%, while in the coding regions was 1.15%. Five coding genes had over 4% variability proportion, such as rps19, ndhF, ndhD, ndhI and ycf1. Five non-coding regions had over 10% variability proportions, such as rpl12/trnH-GUG, trnE-UUC/trnT-GGU, ndhD/psaC, ndhI/ndhA and rps15/ycf1(Fig. 4). These divergence hotspot regions might provide information for marker development in phylogenetic analyses of tea plants, but further verification is needed. Among them, the coding gene rps19, ndhF, ycf1 and non-coding region rpl2/trnH-GUG, rps15/ycf1 were located in the junctions of IR/SC region, which supported that the length of angiosperm cp genomes was variable primarily due to the expansion and contraction of IR/SC boundary regions [35].
To further elucidate the potential contraction and expansion of IR regions, the gene variation at the IR/SSC and IR/LSC boundary regions of the four plastomes was compared (Fig. 5). The genes rps19, ycf1-5'end/ndhF, ycf1 and rp12/trnH-GUG were located in the junctions of LSC/IR and SSC/IR regions. The rps19 gene in CSS, CSA, and CWN was 279 bp, and crossed the LSC/IRa region by 46 bp while the rps19 gene in CIA was just 150 bp, and all located in the LSC region, 1bp away from the IRa region. The ycf1-5'end gene in CSS, CSA, and CWN was 1071 bp, and crossed the IRa/SSC region by 2 bp while in CIA was 1065 bp, and crossed the IR/SSC region by 33 bp. The ndhF gene in all four cp genomes was located in the SSC region. The ndhF gene in CSA, CIA, and CWN was 2247 bp while in CSS was 2139. The ndhF gene in CSS was 165 bp away from the IRa region, in CSA or CWN was 57 bp away from the IRa region while in CIA was 88 bp away from the IRa region. The ycf1 gene in CSS or CWN was 5622 bp, in CSA was 5628 bp while in CIA was only 1038 bp. The ycf1 genes in all four cp genomes crossed the IR/SSC region. The ycf1 gene in CSS or CWN was with 4553 bp located in the SSC region and 1069 bp in IRb region, in CSA was with 4559 bp located in the SSC region and 1069 bp in IRb region while in CIA was with only 6 bp located in the SSC region and 1032 bp in IRb region. The rpl2 gene in CSS, CSA or CWN was 107 bp away from the LSC region while in CIA was 82 bp away from the LSC region. The trnH-GUG gene in CSS, CSA or CWN was 2 bp away from the IRb region while in CIA was 637 bp away from the IRb region.
Among four species, three Chinese tea varieties (CSS, CSA, and CWN) were similar in both gene sequence and IR/LSC boundary pattern, except for length variations in ndhF and ycf1. The triploid CWN was closer to CSS, indicating that triploidy did not cause significant changes in cp genome. However, there were relatively obvious differences in IR/LSC boundary of cp genomic between Chinese tea and Indian tea, suggesting that environmental selection may be responsible for the differences.
Repeat and indel sequence analyses
SSRs are small repeating units of cpDNA, and have been widely used to characterize genetic variation among plant genotypes [36-38]. A total of 671 SSRs were identified in four cp genomes, of which 57% were in IGS, 34% were in CDS, and 9% were in Intron (Fig. 6C). 74.0% of these SSRs were monomers, 19.3% of dimers, 0.5% of trimers, 5.3% of tetramers, 0.9% of hexamers and no pentamers found (Fig. 6A). Comparing the four genomes, except for 167 SSRs of CIA, the other three were all 168. A total of 128 SSRs were fully shared among four cp genomes (Fig. Additional file 1: Table). There were 47 loci with different SSR types, most of which existed in the LSC region. Among them, CSS had 7 unique types, CSA had 18 unique types, CIA had 9 unique types, and CWN had 14 unique types. (Fig. 6B, Supplementary Tab. S2).
Long repeat sequences of plastomes had been reported to play roles in genomic rearrangement and sequence variation [39,40]. In total, 270 repeats were detected in four plastomes, including three categories of long repeats: tandem, forward and palindromic. The number of the three repeated types is consistent in CSS and CWN, as follows: 23, 20, 23. However, it is 19, 20, 23 in CSA and 21, 23, 32 in CIA. The sizes of repeats ranged from 11 to 82 bp (Fig. 7A, 7C). The four cp genomes have a total 57 identical repeat sequences. In addition, CSS had 1 unique repeat, CIA had 18 unique repeats, CWN had 2 unique repeats, while CSA had no unique repeats (Fig. 7B). These unique repeats were found mainly in the intergenic psaA/ycf3, atpB/rbcL, trnW-CCA/ trnP-UGG, rps19/rpl2, psbT/psbN, rpl2/trnH-GUG and gene rpl2, ycf1, ycf2. Only one repeat was in the intron regions (ndhA) (Supplementary Tab. S3).
Indels played an important role in sequence evolution [41]. In indels analysis, the indel events in simple sequence repeats were filtered out. By comparing four cp genomes, a total 67 indels were found. Indels ranged in size from 1 to 637 bp (Fig. 8A), and most of the Indels events occurred in IGS regions (72%), with 24% in CDS regions and only 4% in Intron regions (Fig. 8B). As expected, single-nucleotide Indels (1 bp) were the most common, but 16 long Indels (>10bp) were found. The longest one was an insertion of 637 bp in CIA (intergenic rp12/trnH-GUG), followed by a 335 bp deletion in CWN (intergenic trnE-UUC/trnT-GGU) and a 107 bp deletion in CIA (gene rps19). The largest number of long inserts were found in intergenic psaA/ycf3, where CSA had a unique 10 bp deletion, CIA had a unique 12 bp insertion, and CSA and CWN had a common 17 bp deletion. Secondly, two deletions occurred in the gene coding region of nhdA in CIA, which were respectively 53bp and 66bp deletions (Fig. 8C, Supplementary Tab. S4).
The above results showed that the divergent regions of cp genomes were almost all associated with the long repeat sequences (intergenic rp12/trnH-UGU, atpB/rbcL, psbT/psbN and gene ycf1) and the indel sequences (intergenic rp12/trnH-UGU, psaA/ycf3, ndhI/ndhA, trnE-UUC/trnT-GGU, psbN/psbH and gene rps19, ndhF, ndhD, ndhI, ycf1, ycf2). Furthermore, these regions also contained genes from all IR/SC boundary (rps19, ycf1-5'end/ndhF, ycf1 and rp12/trnH-GUG). Therefore, the sequence repeats and indel events might play an important role in the expansion and contraction of the border regions. These repeats and indels might be further served as genetic markers for phylogenetic and population genetic studies [42,43].
Codon usage pattern analyses
Codon use bias can be used to reflect the origin, evolution and mutation mode of species or genes [44]. The ENc plots showed only a few points lie near the curve, however, most of the genes with lower ENc values than expected values lay below the curve (Fig. 9). Therefore, the codon usage bias of the chloroplast genome was slightly affected by the mutation pressure, but natural selection and other factors play an important role.
To further investigate the extent of influence between mutation pressure and natural selection on the codon usage patterns, Neutrality plot (GC12 vs. GC3) was performed. The correlation between GC1 and GC2 was strong (CSS: r = 0.445; CSA: r = 0.453; CIA: r = 0.445; CWN: r = 0.464, p <0.01). However, no significant correlation was found for GC1 with GC3 (CSS: r =0.141; CSA: r = 0.139; CIA: r = 0.078; CWN: r = 0.141) or GC2 with GC3 (CSS: r = 0.146; CSA: r = 0.143; CIA: r = 0.078; CWN: r = 0.152), which suggested mutation pressure had a minor effect on the codon usage bias. Moreover, the slope of Neutrality plot showed that mutation pressure accounts for only 0.52% - 8.42% on the codon usage patterns in four cp genomes, while natural selection accounts for 91.58% - 99.48% (Fig. 10). These results further suggested that natural selection played an important role in the codon usage patterns.
The patterns of codon usage are strongly correlated with GC content [45]. The overall GC content was almost identical with each other among the four cp genomes, about 37.3%, indicating a higher AU content. Furthermore, the distributions of codon usage in the heatmap for 4 Camellia species showed that about half of the codons with low RSCU values were infrequently used and almost all codons with RSCU >1 ended with A/U (Fig. 11). These data indicated that the four cp genomes tended to use A/U bases and A/U-ending codons. In addition, we also found that the codon usage patterns of the three Chinese teas were more similar, and the cp genome of CWN has a closer relationship with that of CSS in RSCU cluster analysis. In contrast, the codon usage patterns of Indian tea (CIA) were quite different from that of Chinese teas. The RSCU values of the 36 codons (36/64, 56.25%) were identical in the three Chinese teas, but different from those in Indian tea (Fig. 11, Tab. 2), indicative of a different selection on codon usage in the cp genomes between Chinese tea and Indian tea. This also suggested that the two Assam tea plants (Chinese Assam type and Indian Assam type) might undergo different domestications.
Phylogenetic analysis
To further understand the evolutionary mechanism and taxonomic classification of Camellia, a most comprehensive phylogenetic analyse was performed based on complete cp genome from 37 Camellia species with Apterosperma. oblata as an outgroup so far, including Chinese type tea (CSS), Chinese Assam type tea (CSA) and Indian Assam type tea (CIA) and a nature triploid species (CWN) (Fig. 12). Phylogenetic trees were generated by Bayesian inference (BI) and Maximum likelihood (ML) based on the complete cp genome sequence have similar topologies, and almost all relationships have high bootstrap values. Both trees showed that natural triploid CWN was formed into a monophyletic clade with a bootstrap value of 100%, suggesting a polyploid event in the evolution of tea plants. Further, our results clearly indicated that the ancestors of tea plants differentiated into two branches at a certain node, with one branch continuing to differentiate into Chinese Assam type tea and the other branch continuing to differentiate into Indian Assam type tea and Chinese type tea, respectively. This result suggested that these three species might have different domestication origins, which supported an earlier idea that there are three independently domesticated tea species. The one is Chinese type tea, and the other two are distinct Assam tea: a Chinese tea from the southwestern province of Yunnan (Chinese Assam type tea) and an Indian tea from the Assam region (Indian Assam type tea) [46]. The existence of two distinct domestication origins of Assam tea had long been a major topic of debate [5]. Our study further suggested that there were two distinct domesticated origins of Assam tea.
Because of the frequent hybridization and other factors, the classification of the genus Camellia based on morphology had been controversial. Some previous studies had shown that cp genomes can provide useful information for solving complex taxonomic problem of Camellia species [7,47]. In terms of morphological classification, Chang et al. classified the genus Camellia into 4 subgenera with 22 sections [48], while Ming et al. revised the classification of Chang and classified the genus Camellia into 2 subgenera with a total of 14 sections. In Ming et al. classification revision, the section Chrysantha established in Chang’s classification was merged into the section Archecamellia, and the section Heterogenea was separately established [49, 50]. Corresponding to Ming's classification, our results showed that not all subgenus tea or subgenus camellia were individually clustered, suggesting that this classification needed to be reconsidered to revise: (i). at the subgenus level, both BI tree and ML tree (BS=100%) showed that C. danzaiensis should belong to subgenus camellia, rather than subgenus tea, which was consistent with a previous study by Huang et al. 2014. In addition, (ii). C. nitidissima and C. petelotii of sect. Archecamellia of subgenus tea (Syn. sect. Chrysantha Chang) were clustered with C. szechuanensis of sect. Heterogenea of subgenus Camellia in BI tree (BS=88%), and not all species of sect. Heterogenea clusters came together. This result supported a previous study which pointed out that sect. Chrysantha Chang should not be merged into sect. Archecamellia, and sect. Heterogenea should not be recognized in taxonomic treatments of Camellia species by analyzing the secretory structure [51]. Moreover, (iii). The positions of C. yunnanensis, C. luteoflora and C. tachangensis were with low bootstrap values in ML tree, which were also in doubt and need further verification. More cp genomes of Camellia would be needed for further studies to solve these problems as phylogenomic analysis tends.