Mapping of clean reads to the reference genome ‘Shuchazao’
CSS ‘Shuchazao’ has been observed to have significant differences in bud, leaf and budding flower size compared with CSA ‘Yunkang 10’ (Fig. 1). The completion of the two reference genome sequences (‘Shuchazao’ and ‘Yunkang 10’) is a notable advance for comparative genomic studies on tea plants in Thea section. Therefore, genome-wide genetic variations were identified between the two genome assemblies. After filtering the raw data, a total of 324,154,064 clean reads from the CSA whole genome sequencing data were generated; these reads had a coverage depth of 10.4X the ‘Yunkang 10’ genome with a 100 bp length and 43% GC content. Through alignment, a total of 317,878,025 clean reads were mapped to the reference genome, accounting for 98.1% of total reads. The mapped clean reads contained two types of sequencing reads: pair-end and single-end reads. The former was predominantly type (317,063,284, 99.7%), while single-end reads accounted for only 0.3% (814,741 clean reads).
Identification and distribution of SNP and InDel loci
After a series of filtering, a total of 7,071,433 SNP loci were generated, with an average SNP density in the tea genome being estimated to be 2,341 SNPs/Mb. Based on nucleotide substitutions, the detected SNPs were classified as transitions (Ts: G/A and C/T) and transversions (Tv: A/C, A/T, C/G, and G/T), which accounted for 77.46% (5,818,773) and 22.54% (1,692,958), respectively (Fig. 2A), with a Ts/Tv ratio of 3.44. In transitions, the number of A/G is equivalent to the C/T type, which included 2,905,203 and 2,913,570, respectively. For transversions, the number of four types (A/C, A/T, C/G and G/T) are almost evenly distributed with an insignificant difference among them, which accounted for 27.23% (460,988), 24.72% (418,536), 20.84% (352,802) and 27.21% (460,632), respectively (Fig. 2A).
A total of 255,218 InDels were identified, with an average density of 84.5 InDels/Mb. The length distribution of InDels was analyzed by dividing the lengths into different groups and calculating the ratios for the corresponding length groups (Fig. 2B). It is obvious that mononucleotide InDels is the most abundant type, accounting for 44.27% (112,976) of the total number. The length of InDels ranging from 1 to 20 bp was predominant, accounting for more than 95.5% (243,749) of the total InDels. A clear tendency was that the number of InDels gradually decreased with increasing InDel length.
Location and functional annotation of SNPs and InDels
The annotation of the ‘Shuchazao’ reference genome was used to uncover the distribution of SNPs and InDels within distinct genomic regions. According to the gene structure of the reference genome, the overwhelming number of SNPs (94%) was identified in intergenic regions, while only 6% (440,298) of SNPs were located in genic regions (Fig. 3A). Among the SNPs located in genic regions, 89,511 SNPs were detected in the CDs region, which contained 38,670 synonymous and 50,841 non-synonymous SNPs, respectively. Similarly, a small proportion of InDels were located in the genic regions, which accounted for only 12% (31,130) of the total number (Fig. 3B). Remarkably, 3,406 InDels were located in the CDs region, which can be regarded as the preference for developing InDel markers.
To better understand the potential functions of these genetic variations within genes, GO term enrichment analysis of genes containing SNPs/InDels within CDs region was performed. These genes were classified into biological process, cellular component and molecular function categories (Fig. S2). Regarding the genes containing SNPs, the GO terms of cellular process, metabolic process and single-organism process were dominantly abundant in the biological process (Fig. S2A). In the cellular component category, the top three enriched GO terms were membrane, cell and cell part. Based on the molecular function category, catalytic activity and binding are predominantly enriched, while others accounted for a small proportion (Fig. S2A). Interestingly, a nearly consensus result was obtained for GO terms analysis of genes containing InDels, nothing but the number of genes is less compared with the number of genes containing SNPs (Fig. S2B).
Validation and polymorphism of newly-developed InDel markers
Initially, all InDels were used for designing primer pairs using Primer3.0. To validate the InDels and develop polymorphic InDel markers, we selected 100 InDel markers that were distributed on different scaffolds. To facilitate the screening and development of more practical markers, the lengths of all selected InDels ranged from 5 to 20 bp in length. To determine the reliability and polymorphisms of the primers, six tea cultivars were selected for testing their amplified fragments using Fragment AnalyzerTM 96. Of the total primer sets tested, 48 primer pairs were successfully amplified with unambiguous bands and length polymorphisms among the six tea cultivars, 19 primer sets generated non-polymorphic or empty amplifications, and 33 primer pairs yielded non-specific amplification or ambiguous bands. Consequently, the 48 primer sets were regarded as elegant InDel markers and used for further analysis.
To test cross-cultivars/subspecies transferability, the 48 InDel markers were conducted on a panel of 46 tea cultivars belonging to section Thea of genus Camellia. The detailed information of the 46 tea cultivars is listed in Table S1. The results of 18 InDel markers testing on various tea cultivars are shown in Fig. 4, demonstrating that unambiguous and polymorphic bands were obtained based on these markers. The amplified results of the remaining 30 markers were also demonstrated (Fig. S3). For the newly developed markers, 20, 25 and 3 InDel markers generated high polymorphism, moderate polymorphism, and low polymorphism in the 46 tea cultivars, respectively. The PIC value of each InDel marker was presented in Table 1. The amplified allele sizes across them were within the ranges detected in the donor tea cultivar, implying that the amplified fragments were derived from the same loci and that the primer binding sites of the alleles were highly conserved among distinct tea cultivars/subspecies. Several crucial parameters for evaluating polymorphism of markers were subsequently conducted, such as the number of alleles (Na) per locus ranged from 2 (CsInDel15, CsInDel16, CsInDel21, CsInDel24, CsInDel25, CsInDel33, CsInDel35, CsInDel39, CsInDel41, CsInDel46, and CsInDel47) to 14 (CsInDel38) with an average of 4.02 alleles, the major allele frequency (MAF) ranged from the lowest 0.266 (CsInDel20) to the highest at 0.957 (CsInDel41 and CsInDel47) with an average of 0.585, the observed heterozygosity (Ho) ranged from 0.021 (CsInDel24) to 1.000 (CsInDel15, CsInDel19, and CsInDel29) with an average of 0.524 and the expected heterozygosity (He) ranged from 0.082 (CsInDel41 and CsInDel47) to 0.869 with an average of 0.528, the polymorphic information content (PIC) values were from the lowest value 0.078 (CsInDel41 and CsInDel47) to the highest 0.849 (CsInDel38) with an average of 0.457 (Table 1). Notably, the value of He has a similar variation trend as the PIC value, while it has a distinct variation trend with Ho values. The primer sequences and genomic locations of these newly developed markers are listed in Table S2. These results showed that these newly developed InDel markers are informative and possess good transferability among various tea subspecies/cultivars.
Population structure and genetic relationship analysis
Population structure analysis was performed on the 46 tea cultivars using Structure2.3.3 software based on 48 newly-developed InDel markers. The Q-plot output presented our grouping results, indicating that the two groups were the optimal classification at K = 2 (Fig. 5A). Apparently, tea cultivars from southern and southwestern China (Guangxi, Guangdong, Yunnan and Sichuan Provinces) belonging to Camellia sinensis var. assamica were clustered tightly together. In comparison, the tea cultivars possessing smaller leaf sizes and shorter heights that were cultivated in several other provinces were classified into another group (Fig. 5B).
To further confirm the applicability of the developed InDel markers for classification, we constructed a phylogenetic tree based on their genetic distances (Fig. 5C). Two major branches were generated (designated as α and β groups), which contained 17 and 29 tea cultivars, respectively. Group α can be further divided into two subgroups, which were designated as α-1 and α-2 subgroups and consisted of 13 and 4 members, respectively. The dendrogram reflects that the phylogenetic relationships among them are highly consistent with their backgrounds or places of origin, as well as displaying consistency with the results from population structure analysis although a small discrepancy was observed (Fig. 5C).
Identification of genetic variation in catechin/caffeine biosynthesis-related genes
Tea cultivars belonging to Camellia sinensis var. assamica possess significant differences in phenotypes (plant height, leaf size and flower) and major characteristic secondary metabolites (such as catechin and caffeine, which contributed tremendously to tea quality) compared with Camellia sinensis var. sinensis. Therefore, we detected the contents of catechin (flavan-3-ols) and caffeine in both ‘Shuchazao’ and ‘Yunkang 10’ based on HPLC analysis. The total content of catechin in both buds and the second leaf from ‘Yunkang 10’ was higher than from ‘Shuchazao’ (Fig. 6A). To understand the potential molecular mechanism of difference, we performed the catechin biosynthesis pathway based on several previous studies (Fig. 6B). After search, we identified a number of SNPs and InDels in some crucial genes that are involved in the catechin biosynthesis pathway, including phenylalanine ammonia-lyase (PAL), cinnamic acid 4-hydroxylase (C4H), 4-coumarate-CoA ligase (4CL), chalcone synthase (CHS), chalcone isomerase (CHI), flavanone 3-hydroxylase (F3H), flavonoid 3'-hydroxylase (F3'H), flavonoid 3',5'-hydroxylase (F3'5'H), dihydroflavonol 4-reductase (DFR), leucoanthocyanidin reductase (LAR), anthocyanidin synthase (ANS), anthocyanidin reductase (ANR), and 1-O-galloyl-β-D-glucose O-galloyltransferase (ECGT, which belongs to subclade 1A of serine carboxypeptidase-like (SCPL) acyltransferases) (Table 2).
Detection of caffeine content in the two tea varieties demonstrated that the caffeine in both bud and the second leaf from ‘Yunkang 10’ is lower than that from ‘Shuchazao’ (Fig. 7A). In Figure 7B, the well-studied caffeine biosynthesis pathway was also performed based on previous studies [10, 28-31]. Similarly, a number of genetic variations within some critical regulatory genes were also detected, such as in IMP dehydrogenase (IMPDH), guanosine synthase (GMPS), 5′-nucleotidase (5’-Nase) and tea caffeine synthase (TCS) genes (Fig. 7C and Table 2). Collectively, these results indicate that certain genetic variations within these genes may explain the significant difference in catechin/caffeine synthesis between ‘Shuchazao’ and ‘Yunkang 10’.