Molecular identification of Lingyun Baihao wild and cultivated tea through genome-wide sequencing

The tea plants have been one of the most widely consumed and oldest nonalcoholic beverage crops in the world. Lingyun Baihao tea plants are unique species and widely distribute in Guangxi. It is well known that aroma usually determine the quality of tea. In recent years, molecular identification has been employed in the tea plant at the genetic level. However, the association of aroma loss issues to genetic diversities between wild and cultivated tea plants is not clear. Here, we identified 112,938,224 single nucleotide polymorphisms (SNPs) and 10,736,211 insertions/deletions (InDels) for the tea plants through the whole genome sequencing technology. Both SNPs and InDels have certain characteristics in the tea genome, which transition in SNPs type distributes wildly comparing with transversion and single base variation has absolute advantage in InDels. These markers demonstrated that DNA polymorphisms and high diversity. Collectively, we confirmed that the loss of aroma might due to genetic differences at the molecular level between wild and cultivated tea plants, which will be valuable resources for further genetic researches of Lingyun Baihao tea plant.


Introduction
The tea plant (Camellia sinensis (L.) O. Kuntze) is one of the most popular crops worldwide (Xia et al. 2017). Its leaves are rich in tea polyphenols, carotenoids and theanine, which are beneficial to human health. Among them, carotenoids are fat-soluble pigments in fresh tea leaves and become the aroma precursors determinant for aroma quality (Feng et al. 2019), which not only determine the quality of tea, but also give tea great medicinal benefits (Yang et al. 2009). Self-incompatibility and longterm allogamy genetically bring about the tea plant to its high heterogeneity and distinct genetic variation (Chen et al. 2005). Therefore, compared with other crops, such as tomato (Labate and Baldo 2005), soybean (Ramakrishna et al. 2018) and maize crops (Wang et al. 2020a, b), the research on tea plants by molecular biology is relatively backward, which hinders researchers from finding the reason for the loss of aroma and from further improving the tea quality. With two recent draft genome sequence successive releases of CSA 'Yunkang 10' (Xia et al. 2017) and CSS 'Shuchazao' , tea plant has recently rapidly become experimental model for functional genomic researches. The third-generation single nucleotide polymorphism (SNP) is gradually applied for pedigree analysis, origin and evolutionary analysis, population structure and diversity analysis, construction of linkage maps, QTL mapping, and marker-assisted selection Yamashita et al. 2019;Xia et al. 2019;Zhang et al. 2014;Hu et al. 2014;Zhang et al. 2017;Belaj et al. 2018;Xu et al. 2018;Sarkar et al. 2019). Besides, restriction fragment length polymorphisms (RFLPs), random amplification of polymorphic DNAs (RAPDs), amplified fragment length polymorphisms (AFLPs), cleaved amplified polymorphic sequences (CAPS), simple sequence repeats (SSRs), and inter-simple sequence repeats (ISSRs) molecular markers have been successfully applied in genomic studies in tea plant (Mukhopadhyay et al. 2016). Among them, single nucleotide polymorphisms (SNPs) identified simple sequence repeats and development by genomewide scan, based on DNA polymorphisms, are most useful and powerful tools for genetic researches (Ujihara et al. 2011;Zhang et al. 2014). For example, numerous SNPs found from cultivated and wild tea suggesting that artificial selective footprints during domestication of these tea accessions (Yang et al. 2016). Lingyun Baihao tea resources currently owned by Lingyun prefecture were surveyed, which covers a wide range of growing areas and has a long history of cultivation. The association of tea aroma with genetic difference between ancient and cultivated tea to date, however, is still unclear. Therefore, it is urgent to figure out the problem in the genetic level of Lingyun Baihao tea.
This research believes that genetic differences contribute to the difference in aroma between ancient and cultivated tea. In this study, we genetically analyzed several SNPs and InDels variations in Lingyun Baihao tea genome by adopting the whole genome sequencing technology. CSS 'Shuchazao' was the reference genome sequences and promoted to complete several principal objectives (Xia et al. 2020a, b). Then, whole genome genetic variation and distribution patterns of the tea plant were investigated. Thus, a number of SNPs and stable InDels were developed, identifying genetic variations and newly developing abundant InDel markers would provide most precious resource for Lingyun Baihao tea plant genetic and genomic studies. Collectively, we confirmed that the loss of aroma in cultivated tea plants is due to genetic differences at the molecular level, which will be valuable resources in further genetic researches of Lingyun Baihao tea plant.

Plant materials
In this study, we had collected 3 tea samples in total, which were from Lingyun, Baise, a southern city in China. As described in this paper, the sample named CWLS was an artificial breed and located in a tea plantation (E 106° 26′ 42″, N 24° 21′ 12″, Baise, China). Similarly, GYLC, was also a cultivar in other tea gardens (E 106° 32′ 35″, N 24° 31′ 43″, Baise, China). The last one was an ancient tea species, XFL, which was from a primitive ecological mountain (E106° 32′ 19″, N 24° 32′ 4″, Baise, China). To identify genetic variation in the tea genome, young leaves were harvested from these samples respectively and chilled immediately in dry ice, and they would be stored in 4 ℃ until use.

Phenotyping
The understanding of Lingyun Baihao tea is limited to the processing technology, and its detailed plant morphology is not very clear. To investigate the problem, we take some measures, such as field investigations, private interviews, local social survey, specimen collection, specimen identification and combined with data verification, to find out phenotype characteristic of the tea plant. Leaf samples from tea plants were scanned to measure leaflet length, leaflet width, serrations, and lateral veins number. Leaf samples collected were the fully expanded lateral older leaf; duplicate samples were collected from each plant. Scanned leaves were analyzed using excel spreadsheet. Main stem widths of the plants were recorded on Aug. 22, 2020.

DNA extraction
Total DNA genome was extracted from 0.25 g of each tea sample using the DNA isolation kit (MoBio Laboratories Inc., Carlsbad, CA, USA) following the manufacturer's instructions. DNA quality was determined by electrophoresis using a 0.8% agarose gel and also by DNA quantification using a NanoDrop spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE, USA). After that, DNA genome was stored at -80℃.

SNP and InDel analysis
We take paired end sequencing and read 150 bp in each end by Illumina HiSeq platform. CSS 'Shuchazao' as reference genome sequence was used for initial alignment of raw sequences. For three sequences from the experimental genome sequence using the BWA.0.7.5a (Li and Durbin 2009;Pabinger et al. 2013) aligner with the backtrack approach. SAMTOOLS (Li and Handsaker 2009) removes duplicates from the comparison results. After alignment of raw reads was used to identify SNPs and InDels by DePristo report (DePristo et al. 2011). The raw reads of all samples were deposited in the National Genomic Date Center (NGDC) under the BioProject number PRJCA009765.Please access it from the following link: https:// bigd. big. ac. cn/ gsa/ browse/ CRA00 7064.

Surviving ancient tea tree
Lingyun Baihao tea is one of the ancient tea trees with a long history in Guangxi. We found that a living Lingyun Baihao ancient tea plant with the largest trunk, about 130 cm tall unexpectedly (Fig. 1A). Interestingly, it was obvious that the aroma of wild Baihao ancient tea is stronger than that of cultivated tea (Fig. 1B, C). The main reason may be that the flavor of tea products varies with the age of the leaves and buds, as the chemical compositions change with age, which was in good agreement with previous study (Li et al. 2015). Then, we measured the leaflet length, leaflet width, serrations, lateral veins number of the Baihao cover in each sample, respectively ( Fig. 1D-G). Botanical characteristics show that the older leaf area of wild ancient tea plant is large, with more serrations and thicker of the Baihao cover. However, their numbers of lateral veins have no significant difference (Fig. 1G). These results clearly indicated that there are differences between cultivated and wild ancient tea trees, and the morphology of cultivated tea trees has changed mainly due to various complex factors. To analyze the organ tissues of the tea systematically, we selected leaf tissues for further genome-wide sequencing analysis in this study.

Genome-wide sequencing, statistical analysis
Since the genomes of the tea plant C. sinensis var. sinensis (CSS) and C. sinensis var. assamica (CSA) (Xia et al. 2020a, b;Zhang et al. 2020a, b) were deciphered recently, the genomic research on it has been further advanced and a lot of useful information has been excavated. The three tea samples, that is, CWLS, GYLC and XFL, respectively, were taken from Lingyun, Baise, Guangxi. These three tea plants were sequenced through HiSeq platform of Illumina. After that, we got enormous data, including the GC content and mapped reads of genome as described in Table 1. The total sequencing data was 242.58G, with an average of 80.8601G per sample. The amount of high-quality clean data was 241.382G, with an average of 80.4606G for each sample. In our datasets, the sequencing of CWLS sample produced more raw data than the other two, about 9.1 × 10 10 bps. And XFL, the wild sample, produced the least data. Moreover, we obtained more than 499 million total reads in each sample and the ratio of mapping to the genome was more than 95% in each sample. Similarly, reads generated by the wild tea plant were less than the cultivated samples. In addition, the average sequencing depth of CWLS, GYLC and XFL was 26.77X, 28.3X, 25.67X, respectively (Table 1). It was stable for the 3 samples to generate almost the same GC content, which was about 40%. This result was in line with the previous report ). The quality of sequencing data is mainly determined by the values of Q20 and Q30. Generally, a Q30 value greater than 85% indicates acceptable quality. In this study, all Q20s and Q30s were well above 90%, which were provided in Table 1. Therefore, our data was suitable for further analysis.

Detected SNPs mutation sites and annotation
Single nucleotide polymorphisms (SNPs), is used to identify the alteration of genome such as transition or transversion of a single base. After applying above criteria, we obtained 112,938,224 SNP's sites in total and average value was 37,646,074 of each sample in Table 2. SNPs located in different positions of genomes, such as upstream or downstream of a gene within 1 Kb, exonic, intronic and intergenic region ( Table 2). It was shown that the number of SNPs distributed in the intergenic region was more variable than any other regions of each sample. For instance, variants of XFL sample in the intergenic region were 27,057,718 larger than in the exonic or intronic region, which was 511,399 or 2,332,815, respectively. It seemed that SNP's loci in intergenic region might play a key role in the evolution of tea plant. In the meantime, variants in the intergenic region of wild tea plant are also less than the cultivated tea plants.
In general, SNPs can be divided into two genotypes: one is transition (Ts), namely G/A and C/T; the other is transversion (Tv), namely A/T, C/G, A/C, G/T. The numbers of Ts of CWLS, GYLS, XFL tea plants were 35,263,114,32,550,860 and 32,526,767,respectively. The Tv numbers were 10,333,791,9,610,388,9,607,801, respectively. The former accounted for 88% (100,340,741) in total SNP's sites which occupied almost all compared to the later suggested that transition (G/A, C/T) was dominant in SNPs of tea plant in our study (Fig. 2). This result was consistent with the view that transitions turned out to be more frequent than transversions (Wakeley 1996). Moreover, the ratio of Ts/Tv was 2.8 in average and the heterozygosity rate (Het rate) value ranged from 7.091% (XFL) to 9.641% (CWLS). Interestingly, it was a trend that Ts, Tv, Ts/Tv and Het rate values were all decreasing from CWLS to XFL, which means that the ancient tea plant was less genetically diverse than the cultivated tea plants. We speculated that it might be associated with the growing environment in evolutionary process of the tea plants. In order to observe all types of SNPs more intuitively, we plotted the SNPs mutation spectrum (Fig. 2). It showed that the variants of transition types outdistanced transversion types, which were more than two times. Collectively, this study showed that SNPs distributed widely and had large mutation sites in the Baihao tea plant genome and the wild tea plant has less SNPs variation than cultivated.

Detected InDels and annotation
In addition to SNPs, genetic variation also involves the insertion and deletion of small fragments (InDels). From the bioinformatics analysis, we obtained 10,736,211 InDels in total. The average number of InDels was 3,578,737 among the three samples (Table 3). Like SNPs, InDels are most widely distributed in the intergenic region, with an estimate of 8.9 million in total. It's worth noting that the trend of changes for all types in InDels is about the same as SNPs, which was a declining tendency. Unsurprisingly, Insertion/deletion ratio in the wild sample, which is 1,380,370/2,082,904, is less than cultivated samples. In these samples, the wild sample always generated fewer information than the other two samples and it may facilitate us understanding clearly with the tea plant. To further determine InDel length distribution, we showed 1-21 bp length InDels information. InDels divided into two parts, one was in CDS region; the other was in genome. In the CDS region, 1 bp InDel accounted for about 40% of the total, which dominates the entire length distribution (Fig. 3A). Similarly, more than 40% InDels had 1 bp length in the genome (Fig. 3B). It suggested that most genetic variations were deletions or insertions of single bases in the tea plant. It is obvious that the longer the base length of the mutation, the fewer InDels corresponding to it in the genome length distribution (Fig. 3B). However, it is not an evident tendency in CDS length distribution which is uneven ranged from 1 to 21bps. For instance, among 1 bp, 3 bp and 6 bp length, their InDels had about 40%, 15%, 10%, representatively, whereas both 2 bp and 4 bp had less percentage. Thus, as the length of the indel increases, it presents a distribution that was irregular fluctuation (Fig. 3). Taken together, our results not only uncovered a large polymorphic marker of the Lingyun Baihao tea through detecting InDels, but also Fig. 1 Phenotyping test of tea tree. A The tree trunk of ancient tea plant, XFL; two leaves and one bud of it was shown in middle; the two on the far right are the front and back of the leaf. B The tree trunk of cultivar tea plant, CWLS; two leaves and one bud of it was shown in middle; the two on the far right are the front and back of the leaf. C The tree trunk of cultivar tea plant, GYLC; two leaves and one bud of it was shown in middle; the two on the far right are the front and back of the leaf. D-G Three independent tea trees botanical characteristics. D leaflet length, E leaflet width, F serrations number, G lateral veins number. Data are means ± SD of biological replicates (n = 10). Different letters above the bars indicate significant differences (P < 0.05, Tukey's test) ◂ provided an important clue for further studying the cause of aroma loss in this accession.

Discussion
Tea plants have been one of the most widely consumed and oldest nonalcoholic beverage crops in the world (Chen et al. 2006;Li et al. 2016;Xu et al. 2019). Plenty of tea plant functional studies have been carried out in different laboratories, since the draft genome sequence of Camellia sinensis var. sinensis released . For example, Zheng et al. (2019) reported that non-additive action plays a major role in tea volatile heterosis. And a study revealed several key candidate genes and nucleotide variants for genetic improvement of quality characteristics in tea (Maritim et al. 2021). Moreover, Wang et al. (2020a, b) assembled a 3.26-Gb high-quality chromosome-scale genome for the 'Longjing 43' cultivar of Camellia sinensis var. sinensis.
The Lingyun Baihao tea is unique in the tea plant species which is in the Lingyun, Guangxi. However, no other study was carried out about its genetic variation. Besides, human activities also are the additional factors that have been impacting the genetic diversity of tea plants and the adverse effects of encroachment of humans are increasing continuously (Zhao et al. 2014). Therefore, nothing is more important than detecting genetic level of characteristics and discovering some special traits in order to provide a unique opportunity to understand the difference between wild and cultivated tea plants. In this study, we comprehensively identify genetic mutation through SNPs and InDels analysis, which was the first time to reveal the highly genetic diversity to date in the Lingyun Baihao tea and it confirmed that the loss of aroma might due to genetic differences at the molecular level in cultivated tea plants.
Blasting with reference genome, we obtained about 1.7 billion reads in total, including 112,938,224 SNPs and 10,736,211 InDels. Study results showed DNA polymorphisms and high genetic diversity in Lingyun Baihao tea genome. This study also showed that most of SNPs type variants were transitions in the tea genome (Fig. 2), which was similar to other plant species (Bai et al. 2013). And the wild sample was identified a set of 36,665,519 SNPs with about 73.8% located in intergenic regions, which was less than the cultivated samples. SNPs we obtained outdistanced others, such as Xia et al. (2020a, b) reported that there were 6,252,201 SNPs among 81 tea accessions. The main reason of this maybe using different analysis tools and setting different parameters for different studies, such as setting the mapping quality of SNP should not be less than 20 in this study, while Xia et al. set this value not less than 40. Similarly, in terms of InDels, single base variation had absolute advantage, which was about 40% in CDS region and more than 40% in the genome (Fig. 3). In other words, there were few InDels larger Fig. 2 SNPs mutation spectrum. T:A > C:G and C:G > T:A belong to transition type. T:A > G:C, T:A > A:T, C:G > G:C and C:G > A:T belong to transversion type. The abscissa represents the number of SNPs, and the ordinate represents the type of SNP mutation  . 3 The length of InDel distribution in the CDS region and genome. A. CDS InDel length distribution. B Genome InDel length distribution. The ordinate is the length of InDel, and the abscissa is the proportion of the number of InDel of this length to the total than 10 bp in the three tea samples. The difference between CDS region and genome is mainly because genome is very large, including coding regions and non-coding regions. We identified a more significant number of SNPs and InDels, but we didn't know which sites could be the most important one to regulate gene expression. And what is the relationship between phenotype and SNPs/InDels? Multiple quantitative trait loci (QTLs) and environmental factors contribute to phenotypic variation. A SNP-related variation explained more than 4.0% of the variance in tea plant across some environments by association mapping (Jin et al. 2016). According to the report, it seemed that natural tea plant had a less genetic variation comparing to the artificial cultivation (Niu et al. 2019). This view supported our data that XFL, wild sample, had less SNPs and InDels loci than CWLS/ GYLC, cultivated samples. Many genetic variations and significant allelic imbalance were found, including alleles associated with aromas and stress. Several biological processes, like reactive oxygen species (ROS) and terpenoids metabolism, have been enriched in genes . A total of six putsative terpene biosynthesis gene clusters had been identified from the tea plant genome, followed by the discovery of three new genes of the TPS-b subfamily, CsTPS08, CsTPS10, and CsTPS58 (Qiao et al. 2022).
There is a possibility that reducing expression of one or two of these genes may result in a loss of aroma in tea plant. And it was also mainly due to the change of tea plant development environment and the epigenetic mechanism influences the genotype of tea tree (Azizi et al. 2020;Zhang et al. 2021). In addition, during tea plant evolution, there were numerous differences in these variation characteristics compared with Arabidopsis (Jander et al. 2002) and many factors involved in the tea evolution process (Zhang et al. 2021). For instance, the molecular mechanism coupled with long non-coding RNAs in the tea plant. Moreover, Jeyaraj et al (2019) reported that miRNAs played an important role in tea plant immunity, such as stress responses against C. gloeosporioides stress. And these factors may be associated with the loss of aroma of the cultivated tea. Obviously, a bred tea plant would differ from a wild tea plant in terms of genotype. It may be due to factors such as incomplete sampling of wild teas and the presence of wild teas in cultivation (Zhang et al. 2020a, b). For example, there was a significant difference between the cultivation type and wild type in the percentage of genetic diversity (GD), observed heterozygosity (Ho) and polymorphic loci (PPL). Unexpectedly, the cultivation type was higher than the wild type in these aspects (Niu et al. 2019). However, there were also some defects in this study. For example, we have not enough samples to study the phylogenetic structure among Lingyun Baihao tea plants. But, it is not an obstacle for us to reveal that the loss of aroma might be due to tree age and genetic differences at the molecular level in cultivated tea plants comparing with the wild. It was obvious that the aroma of wild tea leaves was significantly stronger than cultivated in tea gardens after our observing. However, we don't reveal the exact mechanism of aroma loss in the cultivated tea plants. According to the report, tea aroma formation can be divided into three types, which are enzymatic reactions when the leaf cell is alive, enzymatic reactions when the leaf cell is disrupted and thermophysical and chemical reactions (Zeng et al. 2019). These reactions are associated with biotic stress and abiotic stress, which can enhance the expression of aroma synthetic genes. This view indirectly supports that the different environments might affect genetic diversity among Lingyun Baihao tea plants. Under stresses, tea plants could form some specialized metabolites as a response mean, such as catechins, amino acids, caffeine, and aroma compounds, which contribute to characteristic tea function and flavor . In addition, Yamashita et al. (2020) had developed a new genomic predictions (GPs) model to identify SNPs potential genes related to tea plant metabolism which was of great significance to the future development of tea industry. Although very large numbers of omics studies had been employed in the tea plants (Xia et al. 2017;Li et al. 2017;Wei et al. 2018;Xia et al. 2020a, b;Zhang et al. 2020a, b), little genetic information was available for the local characteristics of Lingyun Baihao tea plants in Guangxi and the important role of wild tea has relatively underappreciated.
In the future, further work is promising because the diversity of metabolic characteristics of tea plant are considered as vital for the flavor and quality of tea products Zhang et al. 2020a, b). Besides, it is very meaningful to study the molecular mechanism of tea plant evolution and aroma loss of cultivated tea tree, such as detecting the content of catechins (Jiang et al. 2020), identifying molecular markers related to the theanine (Fang et al. 2021), regulation of secondary metabolites (Shi et al. 2011;Li et al. 2015), epigenetic regulation of DNA methylation (Xia et al. 2020a, b), even using CRISPR-Cas9 technology to edit gene of plant . The rapid development of sequencing technology will promote the omics studies between wild and cultivated tea plants, revealing the special traits of wild tea plants and help people better understand the value of Lingyun Baihao wild tea plants.

Conclusion
This study first revealed the genetic diversities to the world through identifying and displaying an enormous number of SNPs and InDels in Lingyun Baihao tea, which indicates that the genome of the tea plant is highly genetically variable. SNPs and InDels markers provided a practical method not only to evaluate the genetic diversity between the wild tea plants and cultivars, but also provided a new insight for people to understand the value of wild tea plants and give a clue for further studies.