Chloroplast genome features of cultivated tea
The lengths of the whole genomes of cultivated tea (CSSA, CSSL and CSAY) were slightly different, ranging from 157,025 bp to 157,100 bp. However, compared with CSSA and CSSL, the genome of CSAY was different. Both CSSA and CSSL contained 81 unique CDS genes, 30 tRNA, 4 rRNA and 3 pseudogenes (ycf1, ycf2 and ycf15). Among them, atpF, ndhA, ndhB, petB, petD, rpl2, rpl16, rpoC1, rps16, trnG-GCC, trnI-GAU, trnL-UAA, and trnV-UAC contained a single intron, while clpP and ycf3 contained two introns. However, in CSAY, orf42 and ycf15 were lost, and rps12 and trnA-UGC had an inserted intron sequence (Fig. 1).
Comparison of chloroplast genomes between cultivated tea and wild tea
In our study, first, we compared CSS with its two cultivated species (CSSA and CSSL). These species were defined as the Chinese cultivated type. Then, we compared CSA with its one cultivated species (CSAY). These species were defined as the Assam cultivated type. Finally, we compared CSS, CSA and 12 wild but related species: Camellia azalea (CAZ), Camellia crapnelliana (CCR), Camellia cuspidate (CCU), Camellia grandibracteata (CGR), Camellia impressinervis (CIM), Camellia petelotii (CPE), Camellia pitardii (CPI), Camellia pubicosta (CPU), Camellia reticulata (CRE), Camellia sinensis var. pubilimba (CSP), Camellia taliensis (CTA) and Camellia yunnanensis (CYU). These species were defined as the wild type (Table 1 & 2).
Chloroplast genomic similarity
In the Chinese cultivated type, the average length across the cultivated species was 62 bp smaller than CSS. In the Assam cultivated type, the genome length of CSAY was 72 bp larger than CSA. In the wild type, the average length of the wild species was 156,923 bp, which was 194 bp and 105 bp variation compared with CSS and CSA, respectively. This showed that there was less length variation when comparing cultivated species with wild species (Table 1). Similarly, the number of genes and the GC content of cultivated species were more stable than that of wild species. After comparing the genes and introns insertion or deletion among the Chinese cultivated type, Assam cultivated type and wild type, we found that introns of the rps12 gene were deleted in CSS and its two cultivated species. The orf42, ycf1 and ycf15 genes were deleted in CSA and CSAY. However, these events occurred randomly in wild species. The differences in the GC content of the CDS, intron and IGS in the Chinese cultivated type and Assam cultivated type were approximately 0.01% to 0.03%, and 0% to 0.02%, respectively, but we found that the differences of the CDS, intron and IGS in the wild type were 0.02% to 1.05%.
mVISTA and Blast Ring Image Generator (BRIG) were used to compare the genomic sequence identity. Comparing CSS and CSA with their cultivated types, the regions with relatively low identity were psaA_ycf3, petL_petG and ycf1_ndhF. Comparing CSS and CSA with other wild types, the regions with relatively low identity were atpH_atpI, trnE-UCC_trnT-GGU, psaA_ycf3, ycf15_trnL-CAA, ycf1_ndhF and ndhG_ndhI (Figs. 2 & 3). In conclusion, at the genomic level, the cultivated species were more conserved than the wild species.
The expansion and contraction of IR regions
The locations of inverted repeat (IR) regions were extracted via a self-BLASTN search, and the characteristics of the IR/Large single copy region (LSC) and IR/Small single copy region (SSC) boundary regions were analyzed. The IRs boundary regions of the 17 complete Camellia cp genomes were compared, showing slight differences in junction positions (Fig. 4). In order to detect possible IR border polymorphisms, first of all, we compared the four IR boundaries of the Chinese cultivated type. No difference was found at the LSC/IRb or IRa/LSC border; meanwhile, only minor differences were discovered at the IRb/SSC and SSC/IRa borders. Next, we compared the four IR boundaries of the Assam cultivated type, and the results were similar. Then, we compared the cp genome boundaries of the wild type. The rps19 gene at the LSC/IRb boundary expanded 52 bp from the LSC region to the IRb side in CPU, while it stopped at 46 bp from the LSC region in the rest of the species. On the other side of the IRa/LSC boundary, the lengths of the spacers between the IRa/LSC junction and the rpl2 gene (in IRa) were 112 bp for CPU, while those of the rest of the species were all 106 bp. Consistently, in all of the compared cp genomes, the ycf1 gene spanned the SSC/IRa region and the length of ycf1 ranged from 963 bp to 1069 bp in IRa. Remarkably, most species have an ycf1 pseudogene at the IRa/LSC junction, while this was not observed in CSA, CTA, CIM, CPI, CCR, CCU, or CYU. Similar to most plants, the ndhF gene involved in photosynthesis was located in the SSC region. However, the ndhF gene was located at the IRb/SSC boundary of CRE, and there was a 35 bp overlap between ndhF and ψycf1.
Nucleotide diversity
Comparisons based on the nucleotide diversity (Pi) values of the Chinese cultivated type, Assam cultivated type, and wild type were presented, including the intergeneric regions (IGS), protein-coding genes and introns (Table S1, Fig. 5). In our study, the average Pi values for the genes, introns and IGS in wild type were approximately 6.6, 3.5 and 9.1 times that of the Chinese cultivated type. In addition, the Pi values for all regions in the Assam cultivated type were 0. Comparing Chinese cultivated type with wild type, the Pi values of most genes, introns and IGS in the wild species were higher than those of in the cultivated species. For example, rps12, petD, rps19, trnI-CAU_rpl23, trnI-CAU_ycf2, trnI-GAU_rrn16, clpP_intron, rps16_intron, and atpF_intron were highly variable in the wild species, but they were not variable in the three cultivated species. For the photosynthetic genes, except for ndhD, ndhF, ndhH and psbC, the Pi values of the photosynthetic genes of three cultivated tea were 0. The Pi values of these genes were smaller than that of the wild species. These results indicate that these genes and noncoding regions were more conserved among the cultivated species than among the wild species.
Furthermore, although the average Pi values of the cultivated species were lower, we still found that the Pi values of rps16, rps4, trnL-UAA_intron, rps4_trnT-UGU, ndhC_trnV-UAC, cemA_petA, rpl33_rps18, psbN_psbH, rpl36_infA, rpl14_rpl16, rps7_rps12, ndhG_ndhI, trnV-GAC_rps12, and rps12_rps7 in the Chinese cultivated type were higher than those in wild species, and these difference sequences were mainly located in the LSC region (Fig. 5).
Phylogenetic analysis of cultivated tea and wild tea
We constructed three phylogenetic trees of cultivated and wild tea, namely, the complete cp genomic tree (complete cp-Tree), all shared protein coding genes among all species tree (SCDS-Tree) and the ycf1 gene tree (ycf1-Tree) (Fig. 6-8). All phylogenetic trees supported the hypothesis that the Thea subgenus could be divided into two clades: clade Ⅰ, including CSS, CSSL, CSSA, CSA, CSAY, CGR, CPU and CSP, and clade Ⅱ, including CPE CIM, CTA and CCU. Clade I was strongly supported, because the posterior probabilities or bootstrap values obtained by neighbor-joining (NJ), maximum parsimony (MP), Bayesian inference (BI) and maximum likelihood (ML) were very high for each lineage. These results suggested that the evolutionary direction of the seven species in clade I was the same. All phylogenetic trees proved that CSS was the closest relative to CSSA and CSSL, and CSA was the closest relative to CSAY. In particular, in the ycf1-Tree, the posterior probabilities or bootstrap values of these species were lower than those of the complete cp-Tree and the SCDS-Tree. The value of CSSA was less than 50%. These results suggested that the ycf1 gene has evolved in cultivated tea.
In addition, we found conflict among the three trees (Fig. 6-8). The topological structures consisting of the Camellia subgenus (CPI, CRE, CAZ, CCR, and CYU) and the Thea subgenus (CPE, CIM, CTA and CCU) were poorly supported by the complete cp-Tree, SCDS-Tree and ycf1-Tree, because most bootstrap values or posterior probabilities were less than 50% for each lineage. These results may be caused by unbalanced sampling.
The cp-Tree showed some structural variations among the Camellia cp genomes (Fig. 6). The clade, which was made up of CSS, CSSL, CSSA, CSA, CSAY, CGR, CPU, CSP and CPE, was characterized by the rps12 intron deletion, the pseudo ycf1 gene, and the pseudo ycf15 gene (except for CSA and CSAY). The other species, except for CRE and CAZ, had lost the pseudo ycf1 gene and the orf42 gene.
Chloroplast genome variation and evolution in cultivated tea
To explain the changes in the cp genome structure of the cultivated tea group, we detected single nucleotide polymorphism (SNP) and insertion/deletion (indel) in the cp genome of cultivated tea. In the Chinese cultivated type, after comparing the whole cp genome of three species, 67 SNPs and 46 indels were found. The LSC, IRb, SSC and IRa regions contained 43, 3, 13, and 8 SNPs and 37, 2, 5, and 2 indels, respectively (Table S2). Most of the SNPs and indels were located in the noncoding region (IGS and intron). There were 39 SNPs and 41 indels in this region, while 28 SNPs and 5 indels were found in the protein coding region. The two ycf1 genes, which are located at the junction of SSC and IRa, contained the most SNPs and indels, 6 and 2, respectively. For the photosynthetic genes, psbC, ndhD, ndhF and ndhH presented SNP variations, while the psbI gene presented indel variation. For the 14 sequences with higher Pi values in cultivated species than in wild species, trnV-GAC_rps12 and ndhG_ndhI contained the most abundant SNPs, with 5 and 2 respectively (Fig. 5). In the Assam cultivated type, after comparing the whole cp genome of two species, 4 indels were found, but no SNPs. All indels were located in the IGS region. In particular, a long sequence (77 bp) was inserted into the IRb/SSC boundary region.
To have a clear view of the evolution of cultivated species, we used their 80 shared protein coding genes to calculate their nonsynonymous nucleotide substitution (Ka) rates, synonymous nucleotide substitution (Ks) rates and Ka/Ks ratio. First, we compared CSS and its cultivated species. The results showed that only 16 protein coding genes had synonymous or nonsynonymous mutations (Fig. 9, Table S4). Among them, there were nonsynonymous mutations in matK, rps16, rpoC2, rpoB, accD, clpP, rps8, ycf1, ndhD, ndhH and rps15. The genes with the highest rate of nonsynonymous mutations were rps16, rps8 and rps15. There were synonymous mutations in rpoB, psbC, rps4, ycf4, rpoA and ndhF. The highest mutation rates were rps4, ycf4 and rpoA. Of the 80 genes, 79 had a Ka / Ks value of 0, and only rpoB, had a Ka/Ks value of 0.3004< 0.5, suggesting very strong purifying selective pressure. Then, we compared CSA and its cultivated species. However, no protein coding genes had synonymous or nonsynonymous mutations, suggesting very strong purifying selective pressure.
The site specific selection events of 16 genes with synonymous or non-synonymous mutations were analyzed by Bayesian Empirical Bayes (BEB), and we found that some amino acid sites of ycf1 and rps15 exhibited site-specific selection (Table S6). In ycf1, there were six sites under positive selection, and in rps15, there was one site under positive selection. For example, in the rps15 gene, the codon ACC (threonine) of CSS was mutated to AAC (asparagine) in two cultivated species.