Population Structure Analysis Revealed Water or Ancient Hub-induced Genetic Diversity of Cultivated-type Tea Plants in Guizhou Plateau

We collected 253 cultivated-type tea plant accessions from Guizhou plateau and analyzed the genetic diversity, PCA, phylogenetic, population structure, LD, and development of core collection using the genotyping-by-sequencing (GBS) approach. A total of 112,072 high-quality SNPs were identied, which was further used to analyze the genetic diversity and population structure. In this study, we found that the genetic diversity in cultivated-type tea accessions of PR Basin were signicantly higher than that in cultivated-type tea accessions of YR Basin. Moreover, four groups, including three pure groups (CG-1, CG-2 and CG-3) and one admixture group (CG-4), were identied based on population structure analysis, which was veried by PAC and phylogenetic analysis. Our results showed that the highest GD and Fst values were found in CG-2 vs CG-3, followed by CG-1 vs CG-2 and CG-1 vs CG-3. The lowest GD and Fst values were detected in CG-4 vs CG-1, CG-4 vs CG-2, and CG-4 vs CG-3. allele frequency, Fis inbreeding coecient; letters indicate a signicant difference in p= levels by the PR, River WS01 River River WS03 Beipanjiang River System WS04 River River contains WS05 Yuanjiang River System, WS06 Wujiang River System, WS07 Chishui River

Previous studies revealed that wind [28], water [29], animals [30,31] and human activity [32] contributed to the distribution and genetic diversity of plants. To further explore the distribution characteristics and transmission model effect of the basins and ancient hubs on the distribution and genetic diversity of tea plants, we sampled 253 cultivated-type tea accessions from 32 regions distributed in seven water systems in PR and YR basins of Guizhou Plateau. Base on GBS analysis, the genetic diversity, population structure, and linkage disequilibrium were investigated using the SNPs data of 253 cultivated-type tea accessions. Moreover, we further explore the contribution of basins and ancient hubs to the genetic diversity of cultivated-type tea plant population distributed in PR and YR basins of Guizhou Plateau. In addition, the core collection of these tea plant accessions was constructed.

Sequencing and variant discovery
A total of 253 cultivated-type tea plant accessions, including 172 ancient landraces and 81 modern landraces distributed in PR and YR Basins of Guizhou Plateau [14], were used in this study. Of these, the geographic distribution of 249 accessions in the PR and YR basins of Guizhou Plateau was shown in Fig. 1A, and the other four accessions introduced from other provinces and cultivated in many tea gardens in Guizhou Plateau were also included in this study. GBS analysis of 253 cultivated-type tea plant accessions was performed using Illumina HiSeq Xten platform. A total of 255.2 GB clean data with an average of 1.00 Gb clean data for each accession was obtained (Additional le 1: Table S1). We mapped the clean reads in to released tea reference genome sequence (http://tpia.teaplant.org/). Our results showed that 29,393,327 single nucleotide polymorphisms (SNPs) were identi ed using GATK (v3.7.0). After ltering, 112,072 high-quality SNPs were identi ed and calculated heterozygosity values, fund that the average of heterozygosity of each accession is 7.89% (Additional le 1: Table S2). Based on the nucleotide substitution, 112,072 SNPs were classi ed into transitions and transversions. Transitions were observed in 1,585,601 (77.87%) and transversions in 450,567 SNPs (22.13%). The frequency of substitutions was 137,380 (6.75%) A/T, 112,861 (5.54%) A/C, 117,244 (5.76%) G/T, 83,082 (4.08%) C/G, 805,072 (39.54%) C/T, and 780,529 (38.33%) A/G, with the transitions to transversions ratio of 3.51 (Table 1). Table 1 Genetic diversity parameters of 253 cultivated-type tea accessions in Guizhou plateau.  Fig. 1A). In the PR Basin, the Pi, Ho and MAF values in tea population of WS02 were signi cantly higher than those in other water systems. The Fis value in WS04 was higher than those in other water systems. In the YR Basin, the Pi and MAF in WS06 were signi cantly higher than those in WS05 and WS07 except that Ho value. The Fis value in WS07 was higher than other water systems. We estimated the genetic diversity of two cultivated statuses (ancient landrace and modern landrace) of the cultivated-type tea population, the Pi and MAF was signi cantly higher in ancient landrace than in modern landrace. The Ho was signi cantly higher in modern landrace than in ancient landrace ( Table 2).
The positive Tajima's D value of all test tea populations suggested that the evolution of these tea populations displayed the population bottlenecks, and/or balancing selection ( Table 2). The analysis of the genetic differentiation coe cient (Fst) and genetic distance (GD) of seven water systems showed that the pairwise Fst value were less than 0.05 among seven water systems. The highest pairwise genetic distance (GD) was in WS04 vs WS02, WS05. The lowest pairwise genetic distance (GD) was in WS01 vs WS06, WS07. (Table 3).  Population structure, PCA, and phylogenetic analysis A total of 112,072 high-quality SNPs were used to analyze the population structure, PCA and phylogenetic analysis of 253 cultivated-type tea plant accessions. Population structure analysis revealed that the minimum cross-validation (CV) error occurred when K equals 3, indicating that three ancestral groups and one admixture group were identi ed ( Fig. 2A, Additional le 2: Fig. S1). The accessions with the membership coe cients greater than 0.80 were assigned to corresponding pure group, while those with the membership coe cients less than 0.80 were assigned to the admixture group.
To further verify the stability of potential population structure, the principal component analysis (PCA) and NJ phylogenetic tree were used to explore the cluster relationships among the 253 cultivated-type tea accessions using 112,072 SNPs. PCA and NJ tree result reveals four major clusters corresponding to the clusters CG-1, CG-2, CG-3 and CG-4, which further veri es the accuracy of population structure. (Fig. 2B, Fig. 2C).

Linkage disequilibrium analysis
Linkage disequilibrium (LD) was commonly used to reveal much about domestication and breeding history. We report LD estimation in a population of 253 accessions using 29,393,327 non-LD pruned SNPs. Our results showed that the LD decayed rapidly with the increasing physical distance. LD dropped to half of its the maximum value at 10 kb (Fig. 2D). Moreover, the lowest LD decay was observed in CG-2, while the most rapidly LD decay was found in CG-4 among 4 inferred populations (Additional le 4: Fig.  S1).

Genetic differentiation analysis of inferred populations
Based on the population structure analysis, the genetic diversity among the inferred populations (CG-1, CG-2, CG-3, and CG-4), including Tajima's D, Pi, Ho, and MAF were calculated (Fig. 3). Our results showed that the Pi, Ho, and MAF values in CG-4 population were signi cantly higher than those in pure populations (CG-1, CG-2, and CG-3). Except MAF value was no signi cant difference detected in CG-4 and CG-3. Moreover, the Pi, Ho, and MAF values in CG-3 population were higher than that in CG-1 and CG-2 populations. The Pi, Ho, and MAF values in CG-2 population were higher than that in CG-1 population. The positive Tajima's D values were observed in four inferred populations suggested that the evolution of these four inferred populations displayed the population bottlenecks, and/or balancing selection (Fig. 3).
We analyzed the pairwise genetic differentiation coe cient (Fst) across four inferred populations. The mean Fst between CG-4 and CG-1 was 0.060, suggesting that there is moderate divergence between CG-4 and CG-1 populations. In addition, the mean Fst in CG-1 vs CG-2, CG-1 vs CG-3 and CG-2 vs CG-3 were 0.090, 0.089, and 0.091, respectively, suggesting that there is not only moderate divergence between modern landraces population (CG-1) and ancient landraces populations (CG-2 and CG-3), but also exists in ancient landraces of CG-2 and CG-3 populations. The highest genetic distance (GD) was observed in CG-2 vs CG-3, and the lowest GD was found in CG-4 vs CG-1 (Fig. 3).

Development core collection
The core and mini-core sets were developed to choose a minimum number of accessions representing the maximum diversity of the original collection, which was used in molecular marker assisted breeding, GWAS and other purpose [33][34][35]. The maximum length subtree method implicated in DARwin v.6.0.17 was repeatedly used to remove the most redundant accessions until the pruned edge and sphericity index percentage tend to level corresponding to 195 accessions (Additional le 3: Table S1). The 195 accessions were selected to represent 253 cultivated-type tea accessions (referred 'core set' from now). On x-axis where the number of accessions decreased from 195 to 85, the pruned edge and sphericity index percentage were increasing stably and slowly, indicating that the information of 111 accessions had no signi cant difference, suggesting the sphericity index and pruned edge had no signi cant impact after removing these accessions (Additional le 5: Fig. S1). The mini-core set was constituted by the remaining 85 accessions (Additional le 3: Table S1).
To further estimate the quality of mini-core and core collections, we construction the NJ tree to verify whether the backbone of the NJ tree has changed. The 253 cultivated-type tea plant accessions could be further divided into seven clusters based on the topology of the NJ tree: cluster I~VII (Fig. 4A). The genetic distance matrix between individuals using a SNP database was used to construct the NJ tree.
The cluster I contained one ancient landrace from WS02 of PR Basin, and the cluster II was consisted  Table S1).
We further analyzed the MAF, Pi, and genetic distance among whole set (253 cultivated-type tea accessions), core set, and mini-core set. There is no signi cant difference in MAF and Pi between core set and whole set, while the mini-core set captured 97% Pi and MAF of the whole set. The genetic distance of the mini-core and core sets decreased slightly, while the minimum value (lower limit) of the genetic distance range of whole set, core set and mini-core set were 0.036, 0.076 and 0.076 (Fig. 4A, Table 4). The proportion of accession with pairwise genetic distance value in 0.200-0.250 among core set and minicore set increased obviously (Fig. 4B). In general, these results suggested that the min-core and core sets contained accessions from seven clusters of NJ tree, two basins, seven water systems, and two cultivated statuses, and can represent the genetic diversity of whole set. (Fig. 4, Additional le 3: Table  S1).

Discussion
Previous studies revealed that wind [28], water [29], animals [30,31] and human activity [32] contributed to population distribution, genetic exchange, species population expansion and so on. However, the distribution characteristics and transmission model of cultivated-type tea plants was still unclear. In this study, 253 cultivated-type tea accessions, distributed in PR and YR Basin, were collected from Guizhou Plateau for the rst time. The population structure analysis, genetic diversity analysis, core collection construction and genetic exchange mechanism of these cultivated-type tea accessions were performed, and further analysis revealed that the ancient hubs and basins played important roles in the distribution characteristics and transmission model of cultivated-type tea plant in Guizhou Plateau.
Genetic diversity of cultivated-type tea plants GBS was widely used for analysis the population structure in many plants, such as maize [36], common bean (Phaseolus vulgaris L.) [37], wheat [38], and tea plants [14,27]. Previous study showed that 390.3 Gb clean data was obtained from 415 tea accessions with an average of 0.94 Gb clean data per accession. A total of 79,016 high-quality SNPs were identi ed [14]. We generated 255.2 Gb clean data from 253 tea accessions with an average of 1.00 Gb clean data per accession, and identi ed 112,072 high-quality SNPs, which was higher than the data reported in the previous study. Moreover, the transition/transversion rate was 3.51, which was higher than those in common bean (Phaseolus vulgaris L.) (1.27) [37], apricot (1.78~1.79) [39] and lettuce (2.10) [40] and lower than the previous report of tea plant (4.02) [14]. This result suggested that transitions were better tolerated the natural resistance, which could be that they were synonymous mutations in protein-coding sequences [41].
The Pi, Ho and MAF values of cultivated-type tea population in PR Basin were signi cantly higher than those in YR Basin, and the Fis of cultivated-type tea population in PR Basin was lower than that in YR Basin ( Table 2, Fig. 1B). These results further indicated that the genetic diversity of cultivated-type tea population in PR Basin was signi cantly higher than that in YR Basin. In addition, the higher and lower genetic diversity were detected in WS02 and WS04 of PR Basin, respectively. And the higher and lower genetic diversity were detected in WS06 and WS07 of YR Basin, respectively ( Table 2). A plausible explanation of these results could be WS02 admixed a great of modern landraces on the basis of the initial ancient landraces, and frequent genetic exchange occurred among the individuals of these two cultivated statuses. As an important river tra c, Wujiang River run through the whole Wujiang water system (WS06) and promoted the frequent genetic exchange of cultivated-type tea plant population. And WS04 and WS07 located at the edge of two basins, and few corridors were provided to promote the genetic exchange. The genetic diversity level was higher in ancient landraces than in modern landraces, except Ho value. That could be the cultivation of the ancient landraces was not for breeding purposes [14], while in order to serve production, the modern landraces had suffered a certain degree selects [14,42].
Previous studies showed that there exist population bottlenecks and/or balancing selection when the positive Tajima's D values was detected in a population [43,44]. The positive Tajima's D values were observed in all these populations, suggesting that these populations existed population bottlenecks, and/or balancing selection ( Table 2). The Fst is widely used as a measure of population structure and Fst between 0.00-0.05 indicate little divergence and 0.05-0.15 moderate divergence [45][46][47]. All the pairwise Fst value of the seven water systems were in the range of 0.00-0.05, indicating there were little divergence in these water systems.
Population structure, PCA and phylogenetic tree analysis of cultivated-type tea plants Population structure analysis showed that 253 cultivated-type tea accessions from Guizhou Plateau were grouped into four groups: three pure groups (CG-1, CG-2, and CG-3) and one admixture group (CG-4) (Fig. 1A). Our results showed that the CG-1 contain the four varieties introduce from other Province, which further revels that there were similar domestication and arti cial selection direction between modern breeding and modern landraces. The highest GD and Fst values were detected in CG-2 vs CG-3, suggesting that the genetic background of the wild ancestors of CG-2 and CG-3 was very different owning to the geographical separation caused by YR and PR basins. The GD and Fst values in CG-1 vs CG-2 and CG-1 vs CG-3 were higher than those in CG-4 vs CG-1, CG-4 vs CG-2, and CG-4 vs CG-3, and lower genetic diversity was detected in CG-1, suggesting that the moderate divergence and higher genetic distant between modern landraces (CG-1) and ancient landraces (CG-2 and CG-3), narrow genetic diversity in CG-1 were mainly caused by arti cial selection and domestication. The highest level of genetic diversity was detected in CG-4, and minimum GD and Fst values were discovered in the admixture group (CG-4) vs other groups, suggesting that the members of CG-4 group were derived from the cross-breeding among CG-1, CG-2, and CG-3. Moreover, the members of CG-4 group were distributed in the nearby river, ancient hub road section, and the junction of two basins (Fig. 1B, Additional le 6: Fig. S1), which caused the frequently cross-breeding among CG-1, CG-2, and CG-3.

Development core collection
Low breeding e ciency and lack excellent varieties were the major challenges in global tea industry [8]. Tea germplasms, as the invaluable fundamental resources of biotechnology studies and variety improvement, enhanced the extraordinarily rapid development of tea plant genomics, genetics, and breeding [48][49][50][51][52]. The core collection was used for detecting the novel variation, selecting excellent varieties, and providing the excellent germplasms for breeders by using smaller populations and greater genetic diversity [53]. So far, development of core collection was widely used in many plants, such as cowpea [54], alpine plum [55], walnut [56], tea plant [27,48] and so on. However, development of the core collection of cultivated-type tea accessions from Guizhou Plateau has not been reported. In this study, the core set and mini-core set we developed contain 77.0% and 33.6% of number of individuals of initial set, separately. These accessions of mini-core and core sets accounting for two cultivated statuses (modern landraces and ancient landraces), two basins (YR Basin and PR Basin) and seven water systems (WS01~WS07). In addition, the contain number proportion of accession was consisted with the genetic diversity of modern landraces and ancient landraces. The core set were almost as the initial set in Pi, MAF and AGD. These results indicated that the core set can well represented 253 cultivated-type tea accessions for further researches. In order to save management costs, the mini-core set with smaller number represent most of genetic of initial set, which will be used in eld experiment, association analysis and providing parent lines for variety improvement [57].

Conclusions
Base on the GBS analysis of cultivated-type tea plant accessions, a total of 112,072 high-quality SNPs were identi ed, which was further used to analyze the genetic diversity and population structure. In this study, we found that the genetic diversity in cultivated-type tea accessions of PR Basin were signi cantly higher than that in cultivated-type tea accessions of YR Basin. Moreover, four groups, including three pure groups (CG-1, CG-2, and CG-3) and one admixture group (CG-4), were identi ed based on population structure analysis, which was veri ed by PAC and phylogenetic analysis. Our results showed that the highest GD and Fst values were found in CG-2 vs CG-3 owning to the geographical separation caused by YR and PR basins. The moderate divergence between modern landraces (CG-1) and ancient landraces (CG-2 and CG-3), and lowest genetic diversity in CG-1 were mainly only caused by arti cial selection and domestication. The members of CG-4 group were derived from the cross-breeding among CG-1, CG-2, and CG-3. Moreover, the members of CG-4 group were distributed in the nearby river, ancient hub road section, and the junction of two basins which caused the frequently cross-breeding among CG-1, CG-2, and CG-3.
Finally, we developed the mini-core and core collections of cultivated-type tea plant, this information will bene t the germplasm protection and management, genome-wide association studies and breeding.

DNA extraction, Library construction and Sequencing
The Plant Genomic DNA Rapid Extraction kit (Beijing Biomed Gene Technology Co., Ltd., Beijing, China) was used to extract genomic DNA from the samples according to the instructions provided by the manufacture. The DNA isolated from each sample was digested by restriction endonuclease SacI and MseI (5 U, NEB) and the adaptors "SacAD and MseAD" with unique barcodes were ligated with the DNA fragments, which were separated on 2% agarose gel and the length range from 500 to 550 bp was chosen for ampli cation before sequencing on an Illumina Hi-seq platform. The original paired-end sequence length was 150 bp [14,58].

Sequence alignment and SNP identi cation
The barcodes were used to de-multiplex the raw DNA reads with trimming the adaptors using a custom perl script. Only reads with quality values >5 were retained and mapped to the reference genome

Development of core collection
The NJ tree was generated based on the 112,072 SNPs. Then, the 'maximum length subtree function' was used to develop the core collection as described previously for tea. The threshold and steps of development the core collections were completely referred to our previous report [27].   [12]. For the three membership coe cients, CG-1 (the modern landraces group) was in yellow, CG-2 (the PR ancient landraces group) was in red and CG-3 (the YR ancient landraces group) was in blue in the pie chart.