Sequencing and variant discovery
A total of 253 cultivated-type tea plant accessions, including 172 ancient landraces and 81 modern landraces distributed in PR and YR Basins of Guizhou Plateau [14], were used in this study. Of these, the geographic distribution of 249 accessions in the PR and YR basins of Guizhou Plateau was shown in Fig. 1A, and the other four accessions introduced from other provinces and cultivated in many tea gardens in Guizhou Plateau were also included in this study. GBS analysis of 253 cultivated-type tea plant accessions was performed using Illumina HiSeq Xten platform. A total of 255.2 GB clean data with an average of 1.00 Gb clean data for each accession was obtained (Additional file 1: Table S1). We mapped the clean reads in to released tea reference genome sequence (http://tpia.teaplant.org/). Our results showed that 29,393,327 single nucleotide polymorphisms (SNPs) were identified using GATK (v3.7.0). After filtering, 112,072 high-quality SNPs were identified and calculated heterozygosity values, fund that the average of heterozygosity of each accession is 7.89% (Additional file 1: Table S2). Based on the nucleotide substitution, 112,072 SNPs were classified into transitions and transversions. Transitions were observed in 1,585,601 (77.87%) and transversions in 450,567 SNPs (22.13%). The frequency of substitutions was 137,380 (6.75%) A/T, 112,861 (5.54%) A/C, 117,244 (5.76%) G/T, 83,082 (4.08%) C/G, 805,072 (39.54%) C/T, and 780,529 (38.33%) A/G, with the transitions to transversions ratio of 3.51 (Table 1).
Table 1
Genetic diversity parameters of 253 cultivated-type tea accessions in Guizhou plateau.
| Transitions | | Transversions |
| CT | AG | | AT | AC | CG | GT |
Numbers of allelic sites | 805,072 | 780,529 | | 137,380 | 112,861 | 83,082 | 117,244 |
Percentage of allelic sites | 39.54% | 38.33% | | 6.75% | 5.54% | 4.08% | 5.76% |
Total (Percentage) | 1,585,601(77.87%) | | 450,567(22.13%) |
Estimation of genetic diversity
The average heterozygosity (Pi), observed heterozygosity (Ho), minor allele frequency (MAF) and inbreeding coefficient (Fis) were used as indicators of genetic diversity. In this study, the Pi, Ho, MAF, and Fis of 253 cultivated-type tea plant accessions were 0.230, 0.082, 0.149 and 0.657, respectively (Table 2). We first compare the genetic diversity of two tea plant populations distributed in PR Basin and YR Basin of Guizhou Plateau, and the result showed that Pi, Ho, and MAF values in tea population of PR Basin were significantly higher than that in tea population of YR Basin except the Fis value (Table 2, Fig. 1A). In the PR Basin, the Pi, Ho and MAF values in tea population of WS02 were significantly higher than those in other water systems. The Fis value in WS04 was higher than those in other water systems. In the YR Basin, the Pi and MAF in WS06 were significantly higher than those in WS05 and WS07 except that Ho value. The Fis value in WS07 was higher than other water systems. We estimated the genetic diversity of two cultivated statuses (ancient landrace and modern landrace) of the cultivated-type tea population, the Pi and MAF was significantly higher in ancient landrace than in modern landrace. The Ho was significantly higher in modern landrace than in ancient landrace (Table 2).
The positive Tajima’s D value of all test tea populations suggested that the evolution of these tea populations displayed the population bottlenecks, and/or balancing selection (Table 2). The analysis of the genetic differentiation coefficient (Fst) and genetic distance (GD) of seven water systems showed that the pairwise Fst value were less than 0.05 among seven water systems. The highest pairwise genetic distance (GD) was in WS04 vs WS02, WS05. The lowest pairwise genetic distance (GD) was in WS01 vs WS06, WS07. (Table 3).
Table 2
Genetic diversity parameters of 253 cultivated-type tea accessions in Guizhou plateau.
Type | | Number | Tajima D | Pi | Ho | MAF | Fis |
Basins | PR | 75 | 0.752 | 0.234a | 0.091c | 0.151a | 0.625ab |
YR | 174 | 1.048 | 0.228d | 0.079g | 0.148bc | 0.664ab |
Water systems | WS01 | 11 | 0.387 | 0.212g | 0.103b | 0.142d | 0.460c |
WS02 | 26 | 0.380 | 0.232ab | 0.111a | 0.151a | 0.563b |
WS03 | 27 | 0.321 | 0.225e | 0.081ef | 0.147c | 0.653ab |
WS04 | 11 | 0.337 | 0.218f | 0.066h | 0.142d | 0.695a |
WS05 | 9 | 0.408 | 0.216f | 0.083de | 0.142d | 0.612ab |
WS06 | 151 | 0.999 | 0.229cd | 0.081f | 0.149bc | 0.658ab |
WS07 | 14 | 0.216 | 0.204h | 0.058i | 0.133e | 0.704a |
Cultivation status | ML | 81 | 0.593 | 0.218f | 0.084d | 0.142d | 0.627ab |
AL | 172 | 1.098 | 0.232ab | 0.082ef | 0.151a | 0.662ab |
All | all | 253 | 1.236 | 0.230bc | 0.082ef | 0.149b | 0.657ab |
Note: Pi heterozygosity, Ho observed heterozygosity, MAF minor allele frequency, Fis inbreeding coefficient; The different letters indicate a significant difference in p= 0.05 levels by the T-test; PR, Pearl River Basin contains WS01 Liujiang River System, WS02 Hongshui River System, WS03 Beipanjiang River System and WS04 Nanpanjiang River System. YR Yangtze River Basin contains WS05 Yuanjiang River System, WS06 Wujiang River System, WS07 Chishui River System. |
Table 3
pairwise Fst and genetic distance among seven water systems of 253 accessions in Guizhou plateau
| WS01 | WS02 | WS03 | WS04 | WS05 | WS06 | WS07 |
WS01 | | 0.211 | 0.211 | 0.215 | 0.211 | 0.209 | 0.209 |
WS02 | 0.037b | | 0.220 | 0.225 | 0.224 | 0.221 | 0.219 |
WS03 | 0.034d | 0.020h | | 0.220 | 0.219 | 0.218 | 0.219 |
WS04 | 0.033d | 0.019i | 0.004m | | 0.225 | 0.221 | 0.222 |
WS05 | 0.040a | 0.027e | 0.011l | 0.013k | | 0.221 | 0.223 |
WS06 | 0.027e | 0.021g | 0.011l | 0.002o | 0.011l | | 0.217 |
WS07 | 0.035c | 0.016j | 0.013k | 0.011l | 0.023f | 0.003n | |
Note: The bottom left is the value of pairwise genetic differentiation coefficient (Fst); The upper right is the value of pairwise genetic distance; The different letters indicate a significant difference in p= 0.05 levels by the T-test; WS01 Liujiang River System, WS02 Hongshui River System, WS03 Beipanjiang River System, WS04 Nanpanjiang River System, WS05 Yuanjiang River System, WS06 Wujiang River System, WS07 Chishui River System. |
Population structure, PCA, and phylogenetic analysis
A total of 112,072 high-quality SNPs were used to analyze the population structure, PCA and phylogenetic analysis of 253 cultivated-type tea plant accessions. Population structure analysis revealed that the minimum cross-validation (CV) error occurred when K equals 3, indicating that three ancestral groups and one admixture group were identified (Fig. 2A, Additional file 2: Fig. S1). The accessions with the membership coefficients greater than 0.80 were assigned to corresponding pure group, while those with the membership coefficients less than 0.80 were assigned to the admixture group.
The first pure group contained 35 accessions, including 32 (91%) modern landraces and 3 (9%) ancient landraces, and the four introduce varieties were classed into this group (Additional file 3: Table S2) (referred to as ‘the modern landraces group’ or ‘CG-1’ from now). The second pure group contained 25 accessions (2 modern landraces and 23 ancient landraces), and further analysis showed that these ancient landraces mainly derived from PR Basin (Additional file 3: Table S3) (referred to as ‘the PR ancient landraces group’ or ‘CG-2’ from now). The third pure group was composed of 60 ancient landraces, including 53 (88%) from YR Basin and 7 (12%) from PR Basin (Additional file 3: Table S3) (referred to as ‘the YR ancient landraces group’ or ‘CG-3’ from now). In addition, 133 (52.56%) accessions were assigned to admixed group (referred to as ‘the ancient hubs group’ or ‘CG-4’ from now), which contained 47 modern landraces and 86 ancient landraces. Among them, 38 accessions were collected from PR Basin, which contained 11, 6, 13, and 8 accessions in WS01, WS02, WS03, and WS04, respectively. Ninety-five accessions were sampled from YR Basin, including 8 accessions from WS05, 84 accessions from WS06, and 3 accessions from WS07 (Additional file 3: Table S3).
To further verify the stability of potential population structure, the principal component analysis (PCA) and NJ phylogenetic tree were used to explore the cluster relationships among the 253 cultivated-type tea accessions using 112,072 SNPs. PCA and NJ tree result reveals four major clusters corresponding to the clusters CG-1, CG-2, CG-3 and CG-4, which further verifies the accuracy of population structure. (Fig. 2B, Fig. 2C).
Linkage disequilibrium analysis
Linkage disequilibrium (LD) was commonly used to reveal much about domestication and breeding history. We report LD estimation in a population of 253 accessions using 29,393,327 non-LD pruned SNPs. Our results showed that the LD decayed rapidly with the increasing physical distance. LD dropped to half of its the maximum value at 10 kb (Fig. 2D). Moreover, the lowest LD decay was observed in CG-2, while the most rapidly LD decay was found in CG-4 among 4 inferred populations (Additional file 4: Fig. S1).
Genetic differentiation analysis of inferred populations
Based on the population structure analysis, the genetic diversity among the inferred populations (CG-1, CG-2, CG-3, and CG-4), including Tajima’s D, Pi, Ho, and MAF were calculated (Fig. 3). Our results showed that the Pi, Ho, and MAF values in CG-4 population were significantly higher than those in pure populations (CG-1, CG-2, and CG-3). Except MAF value was no significant difference detected in CG-4 and CG-3. Moreover, the Pi, Ho, and MAF values in CG-3 population were higher than that in CG-1 and CG-2 populations. The Pi, Ho, and MAF values in CG-2 population were higher than that in CG-1 population. The positive Tajima’s D values were observed in four inferred populations suggested that the evolution of these four inferred populations displayed the population bottlenecks, and/or balancing selection (Fig. 3).
We analyzed the pairwise genetic differentiation coefficient (Fst) across four inferred populations. The mean Fst between CG-4 and CG-1 was 0.060, suggesting that there is moderate divergence between CG-4 and CG-1 populations. In addition, the mean Fst in CG-1 vs CG-2, CG-1 vs CG-3 and CG-2 vs CG-3 were 0.090, 0.089, and 0.091, respectively, suggesting that there is not only moderate divergence between modern landraces population (CG-1) and ancient landraces populations (CG-2 and CG-3), but also exists in ancient landraces of CG-2 and CG-3 populations. The highest genetic distance (GD) was observed in CG-2 vs CG-3, and the lowest GD was found in CG-4 vs CG-1 (Fig. 3).
Development core collection
The core and mini-core sets were developed to choose a minimum number of accessions representing the maximum diversity of the original collection, which was used in molecular marker assisted breeding, GWAS and other purpose [33–35]. The maximum length subtree method implicated in DARwin v.6.0.17 was repeatedly used to remove the most redundant accessions until the pruned edge and sphericity index percentage tend to level corresponding to 195 accessions (Additional file 3: Table S1). The 195 accessions were selected to represent 253 cultivated-type tea accessions (referred ‘core set’ from now). On x-axis where the number of accessions decreased from 195 to 85, the pruned edge and sphericity index percentage were increasing stably and slowly, indicating that the information of 111 accessions had no significant difference, suggesting the sphericity index and pruned edge had no significant impact after removing these accessions (Additional file 5: Fig. S1). The mini-core set was constituted by the remaining 85 accessions (Additional file 3: Table S1).
To further estimate the quality of mini-core and core collections, we construction the NJ tree to verify whether the backbone of the NJ tree has changed. The 253 cultivated-type tea plant accessions could be further divided into seven clusters based on the topology of the NJ tree: cluster I~VII (Fig. 4A). The genetic distance matrix between individuals using a SNP database was used to construct the NJ tree. The cluster I contained one ancient landrace from WS02 of PR Basin, and the cluster II was consisted of 29 accessions, including 15 modern landraces and 14 ancient landraces, mainly distributed in YR Basin. The cluster III contained 69 accessions, including 68 ancient landraces and 1 modern landraces, mainly distributed in WS06. The cluster IV consisted of 12 accessions, including 11 ancient landraces and 1 modern landrace. ten ancient landraces from WS06, 1 modern landrace from WS02 and 1 ancient landraces from WS03. The cluster V contained 44 accessions, 37 accessions of which was ancient landraces and 7 accessions of which was modern landraces, 35 accessions from YR Basin and 9 accessions from PR Basin. The cluster VI was consisted of 42 accessions, including 36 ancient landraces and 6 modern landraces, thirty accessions from PR Basin and 12 from YR Basin. The cluster VII contained 56 accessions, including 51 modern landraces and 5 ancient landraces, 17 accessions from PR, 35 accessions from YR and 4 accessions from OT (Additional file 3: Table S1).
We further analyzed the MAF, Pi, and genetic distance among whole set (253 cultivated-type tea accessions), core set, and mini-core set. There is no significant difference in MAF and Pi between core set and whole set, while the mini-core set captured 97% Pi and MAF of the whole set. The genetic distance of the mini-core and core sets decreased slightly, while the minimum value (lower limit) of the genetic distance range of whole set, core set and mini-core set were 0.036, 0.076 and 0.076 (Fig. 4A, Table 4). The proportion of accession with pairwise genetic distance value in 0.200-0.250 among core set and mini-core set increased obviously (Fig. 4B). In general, these results suggested that the min-core and core sets contained accessions from seven clusters of NJ tree, two basins, seven water systems, and two cultivated statuses, and can represent the genetic diversity of whole set. (Fig. 4, Additional file 3: Table S1).
Table 4
genetic diversity of mini-core and core set of cultivated-type tea plant of Guizhou plateau.
group | simple size | Pi | MAF | AGD | GDR |
mini-core | 85 | 0.223b | 0.145b | 0.211c | 0.076-0.265 |
core set | 195 | 0.230a | 0.149a | 0.217b | 0.076-0.273 |
whole set | 253 | 0.230a | 0.149a | 0.265a | 0.036-0.347 |
Note: Pi heterozygosity, MAF minor allele frequency, AGD average genetic distance, GDR Genetic distance range; |