Comparative Transcriptome Analysis Reveals Putative Genes Responsible for High Theacrine Content in Kucha (Camellia kucha (Chang et Wang) Chang)

The purine alkaloid theacrine, synthesized in a high amount in Kucha (Camellia kucha (Chang et Wang) Chang), has shown multiple pharmacological effects such as anti-depression, dehydration and hypnotic activities. It is critical to better understand the theacrine metabolism at the molecular level for future breeding programs. To identify the putative genes involved in theacrine biosynthesis, we carried out comparative transcriptome analysis between Kucha (Kucha-6 and Kucha-11) and conventional varieties (Yinghong 9 and Qingxin 1). HPLC showed that Kucha synthesized more than 27-fold theacrine in comparison to conventional varieties. A total of ~ 671.61 million high quality clean reads were yielded, and 71.42 ~ 76.87% of these were aligned to the reference genome for each library. We identified 19,095 differentially expressed genes (DEGs) by a pairwise approach, among which 1,142 common DEGs between Kucha and conventional varieties were screened. Functional annotation allowed the identified of 20 oxidoreductase (OR) genes and seven methyltransferase (MT) genes with enhanced expression in Kucha, which might be associated with theacrine diversity in the examined materials. Particularly, two N-methyltransferase-encoding genes (TEA010054 and TEA022559) may catalyze the final methylation step during theacrine synthesis in Kucha. The information generated will be helpful in identifying candidate genes and breeding tea plant with high theacrine content.


Introduction
Purine alkaloids (PAs), existing in about 100 species in the plant kingdom, are secondary metabolites formed from purine nucleotides (Ashihara and Crozier, 1999). As one of the most important non-alcoholic beverages, tea plant synthesized diverse high-value pharmaceutically important PAs including fatigue resistance caffeine, antimicrobial agent theobromine and potential asthma drug theophylline. The theacrine, only accumulated in high amounts in Kucha, also displayed several beneficial effects in improving learning and memory (Li et al. 2015), anti-depressive (Guo et al. 2009) and hypnotic activities (Qiao et al. 2017).
To elucidate the rate-limiting steps of PAs biosynthesis in tea plant, numerous studies have been conducted, and a few genes involved in PAs turnover have been characterized. Among these genes, tea caffeine synthase 1 (TCS1) was first isolated in tea leaves (Kato et al. 2000) and shown to control caffeine accumulation by two molecular macular mechanisms: with reduced transcription level or the encoded proteins with only N-3 methylation activity (Jin et al. 2016). Recently, the theacrine synthase CkTcs with key residues that are required for the N9-methylation, was confirmed to catalyze the final methylation steps during theacrine synthesis in Kucha (Zhang et al. 2020ab), and thus were suggested as rate-limiting enzyme in PAs biosynthesis.
Usually, caffeine is the predominant purine alkaloids (2-5%) in young leaves of tea plant, followed by theobromine (Wan et al. 2013). In contrast, theacrine is the major purine alkaloid in Kucha, which is a special tea resource first discovered in Yunnan Province (Ye et al. 1999). Tracer experiments concluded that theacrine was synthesized from the downstream branch of caffeine metabolism pathway (Zheng et al. 2002). In this sense, Kucha can be considered as a class of mutants of normal tea and thus are ideal materials to elucidate PAs biosynthesis pathways and identify related candidate genes.
RNA sequencing (RNA-seq) has become a useful tool for identifying and characterizing secondary metabolism genes and their pathways in plants Guo et al. 2019;Wang et al. 2020b;Zhou et al. 2019). The recent release of complete tea genome sequence and detailed annotations allowed exploratory analysis of DEGs associated with important secondary metabolisms . In this context, we established complete transcriptome of Kucha (Kucha-6 and Kucha-11) and conventional varieties (YingHong 9 and Qingxin 1) using an Illumina HiSeq™ 2500 platform. Based on this data, potential candidate genes encoding oxidoreductase and methyltransferase were revealed. This study will provide useful gene resources for the investigations of PAs biosynthesis and genetic improvement of tea plant.

Quantification of Purine Alkaloids in Different Genotypes
Purine alkaloids determination showed that the total purine alkaloids content varied in the four genotypes, highest in YH9 (45.57 mg/g) and lowest in K11 (28.87 mg/g). The content of caffeine (11.33-42 mg/g) was relatively abundant in all genotypes, whereas the content of theobromine (2.46-4.56 mg/g) was low. On the other hand, the theacrine presented the greatest difference, with more than 27-fold higher in Kucha (13.16 mg/g in K6 and 14.94 mg/g in K11, respectively) compared with the conventional varieties (0.48 mg/g in YH9 and 0.37 mg/g in QX1, respectively) ( Fig. 1 and Table S1).
HQ clean reads, excluding rRNA, were mapped to the tea plant reference genome (cv. Shuchazao) . As a result, the total align ratio was ranked from 71.42% to 76.87%, with a uniquely mapped rate of 69.22% to 74.53%. We then assembled each transcriptome and compared with reference genome annotation, result showed that according to the 33,932 reference genes, a total of 30,369 known genes and 8,696 novel genes were identified in all samples (Table 1 and S2). Moreover, a total of 5,962 novel genes were shared by the four varieties, while 80 and 198 were uniquely expressed in Kucha and conventional genotypes, respectively (Fig. S1A). We further invested the functions of the unique novel gene in Kucha and found that only two annotated genes were assigned into molecular function and cellular component categories (Fig. S1B). According to the KEGG pathway analysis, those unique novel genes were enriched in valine, leucine and isoleucine biosynthesis and ether lipid metabolism pathways (Fig. S1C).
The pair-wise Pearson's correlation coefficients were computed to test the reproducibility of the sequencing data. As shown in Fig. 2a, the robust correlation values within each genotype indicated high repeatability between replicates. PCA analysis indicated that the major transcriptomic variation (86.8%) could be explained by the first two principal components (Fig. 2b). The PCA cleanly clustered the samples into four groups, with the assignment of QX1 similar to that of K6.

Identification of Differentially Expressed Genes (DEGs)
The DEGs were determined between Kucha and conventional varieties using a pairwise approach. In detail, 10,732 (4,298 unregulated and 6,434 downregulated) and 11,461 (5,077 upregulated and 6,384 downregulated) DEGs were identified in YH9 vs K6 and YH9 vs K11comparisons, respectively. The total number of DEGs in QX1 vs K6 and QX1 vs K11 comparisons were 6,032 (3,091 unregulated and 2,941 downregulated) and 9,290 (5,209 upregulated and 4,081 downregulated), respectively. Based on the analysis, the number of downregulated DGEs was higher than the upregulated DEGs in the comparisons of YH9 vs K6 and YH9 vs K11, but lower in QX1 vs K6 and QX1 vs K11 comparisons. Interestingly, the number of DEGs in QX1 vs K6 was far less than other comparisons (Fig. 3a). Moreover, we obtained 1,142 common DEGs from the four comparisons using a Venn diagram (Fig. 3b).

Functional Characterization of DEGs
To understand the functions of the obtained DEGs, GO subcategory and KEGG pathway analyses were performed. Among these DEGs, 6,606 annotated genes  d KEGG classification. The histogram represents the DEGs distribution into five major KEGG metabolic categories: metabolism; genetic information processing, environmental information processing, cellular processes and organismal systems were assigned into one or more GO terms under three domain categories (Fig. 3c). Within the biological process category, the top-enriched GO terms were metabolic process (3,110, 47.08%), cellular process (2,893, 43.79%) and single-organism process (2338, 35.39%). Within the molecular function, the terms related to catalytic activity (3,732, 56.49%) and binding (2,776, 42.02%) were the most enriched. For the cellular component, the dominant terms were membrane (2,342, 35.45%) and cell (2,105, 31.86%), followed by cell part (2,078, 31.46%). A total of 5,054 DEGs were distributed into 134 KEGG pathways of five main categories, among which the largest categories were metabolism and genetic information processing, followed by organismal systems, environmental information processing and cellular processes. (Fig. 3d). The most represented pathways were "Metabolic pathway" (1972, 39.02%), "Biosynthesis of secondary metabolites" (1298, 25.68%), "Plant-pathogen interaction" (467, 9.24%), "Biosynthesis of antibiotics" (440, 8.71%) and "Microbial metabolism in diverse environments" (362, 7.16%). Additionally, "Purine metabolism" pathway accounted for a considerable portion of the DEGs (125, 2.47%) (Table S3).

Putative Genes Involved in the Theacrine Biosynthesis
The putative pathway of theacrine biosynthesis was proposed according to the KEGG database (Fig. 4). In brief, the formation of theacrine is initiated by the assembly of xanthosine, which acts as methyl acceptor for caffeine biosynthesis, then conversion of caffeine to theacrine occurs by sequential oxidation and methylation steps with 1,3,7-trimethyluric acid acting as the intermediate. Based on this, we grouped genes that are related to theacrine accumulation into three categories: synthesis of xanthosine, synthesis of caffeine and synthesis of theacrine. A total of 38 genes related to enzymes of xanthosine synthesis were identified, including four genes encoding adenosine kinase (ADK), 12 genes encoding adenine phosphoribosyltransferase (ARPT), seven genes encoding AMP deaminase (AMPD), two genes encoding IMP dehydrogenase (IMPDH), 11 genes encoding 5′-Nase and two genes encoding guanine deaminase (GAD) ( Table S4). Expression pattern analysis showed that these genes were expressed quite differently among different genotypes (Fig. 5a). In particular, three genes assigned as common DEGs, including one ADK gene (TEA022322), one AMPD gene (TEA007189) and one 5′-Nase gene (XLOC_057370), were highly expressed in Kucha. Fig. 4 Theacrine biosynthesis pathway. The entire pathway was divided into three modules based on the formations of key intermediates and the final product: xanthosine synthesis pathway (grey background), caffeine synthesis pathway (blue background) and theacrine synthesis pathway (orange background). Abbreviation of enzymes: Anase, adenosine nucleosidase; ADK, adenosine kinase; ARPT, adenine phosphoribosyltransferase; AMPD, AMP deaminase; IMPDH, IMP dehydrogenase; 5′-Nase, 5′-nucleotidase; GAD, 7-NMT, 7-Methylxanthosine; guanine deaminase; MXN, methylxanthosine nucleotidase; TCS, caffeine synthase; CkTcS, theacrine synthase For the 13 caffeine synthase (TCS) genes involved in caffeine synthesis, TEA015791 and TEA028050 exhibited very high expression level in all experimental materials (FPKM > 700), while two common DEGs (TEA010054 and TEA022559) were highly expressed in Kucha ( Fig. 5b and Table S5). Although most of the five genes encoding S-adenosylmethionine synthetase (SAMS) in the methyl donor synthesis exhibited diverse expression patterns, none of them were identified in common DEGs ( Fig. 5c and Table S5).
In the case of the conversion of caffeine to theacrine, genes related to enzymes of oxidoreductase (OR) and methyltransferase (MT) might be bottlenecks, and it would require up-regulated in high theacrine content genotypes. Thus, we focused on common DEGs encoding the enzymes mentioned above. The data showed that 44 genes annotated as OR, of which 20 genes were highly expressed in Kucha ( Fig. 5d and Table S6). Meanwhile, a total of 21 genes annotated as MTs were identified and seven of them showed enhanced expression levels in Kucha ( Fig. 5e and Table S7). Furthermore, four of the above-mentioned up-regulated MTs were grouped into N-methyltransferase (TEA009098, TEA010054, TEA022559, XLOC_035717), two into O-methyltransferase (TEA030958, XLOC_047820) and one into C-methyltransferase (TEA023802).

Validation of Gene Expression Using qRT-PCR
To validate the quantification of the RNA-seq analysis in this study, eight genes including one IMPDH gene (TEA024175), three TCS genes (TEA010054, TEA022559 and TEA030024), two MT genes (TEA023802 and XLOC_047820) and two OR genes (TEA007843 and XLOC_026296) were selected for RT-qPCR analysis. As shown in Fig. 6, all the genes, except for TEA007843, shown a similar expression pattern in qRT-PCR analysis consistent with the RNA-seq results, indicating the reliability of our RNA-seq data.

Discussion
Kucha, a kind of special germplasm resource found in individual areas of China, is often used as traditional medicine by local resident (Wang et al. 2008). Though it is difficult to discern them from conventional varieties because of the morphological similarity, the remarkably bitter taste in both fresh leaves and made tea made Kucha very distinctive. The theacrine, in spite of showing varied content in different sources of Kucha, is defined as a characteristic substance and account for the bitterness in this unique germplasm (Ye et al. 1999;Li et al. 2012; Jin  Zhang et al. 2020ab). In recent years, theacrine has attracted increasing attention for its potential therapeutic applications in the pharmaceutical industry, and thus motivated research efforts to unveil the molecular basis underlying theacrine biosynthesis. For instance, two RNA-seq based studies have reported that MT genes were involved in the final step of theacrine formation (Wang et al. 2020a;Zhang et al. 2020b). However, oxidoreductase-encoding genes that catalyze the critical step before methylation have not been mentioned yet.
Guangdong Province, an important tea producing region in China, is rich in diverse tea germplasms. Earlier, we collected Kucha germplasms throughout this region and found that the first leaves of K6 and K11 showed the highest concentration of theacrine. Hence, we performed transcriptome analysis with interest in identifying possible genes involved in theacrine biosynthesis. A total of 671.61 million of HQ clean reads were generated in this study, which was adequate for accurate quantification of gene expression (Vijay et al. 2013). The alignment rate to the reference genome was quite comparable between Kucha and conventional varieties, and the percentage of annotated genes was similar to that recently reported in tea plant by Zhu et al. (2019) and Zhou et al. (2020), suggesting that those genes may have conserved functions among different tea plants and our project has captured the majority of the tea transcriptome. Comparative transcriptome profiling of Kucha and conventional varieties identified 19,095 DEGs and 1,142 of them was assigned as common DEGs. Surprisingly, the number of DEGs in QX1 vs K6 was nearly half that of other comparisons. Besides, the PCA analysis indicated that K6 was relatively close to QX1 rather than K11. Considering the cross-pollination nature of tea plant and the breeding process of those materials, the possible reason might be that tea plant has not undergone long-term artificial directional selection and frequent genetic exchange may occur between K6 and parental sources of QX1 due to the geographical proximity (Zhang et al. 2020a).
The current research on the metabolic pathway of theacrine in tea plant is relatively clean (Ashihara et al. , 2011Zheng et al. 2002). We divided the entire pathway into three modules, i.e. xanthosine synthesis pathway, caffeine synthesis pathway and theacrine synthesis pathway, according to the formations of key Fig. 6 qRT-PCR validation of eight selected genes putatively involved in theacrine biosynthesis. The normalized expression levels (FPKM) from the RNA-Seq results are indicated on the y-axis to the left and the relative qRT-PCR expression level is shown on the y-axis to the right intermediates and the final product (Fig. 4). In this paper, the expression pattern of xanthosine biosynthesis-related genes that encoding ADK, ARPT, AMPD, 5′-Nase and GAD were investigated. Considering the variation of total purine alkaloids content among the tested materials, we proposed that the dynamic expression level of these genes in different pathways may influence an adequate supply of xanthosine for the downstream pathway. In particular, IMPDH was previously reported to be an inhibitor of the enzyme which reduced the rate of caffeine biosynthesis in tea and coffee leaves (Keya et al. 2003). Our data showed that the expression pattern of an IMPDH gene (TEA024175) was consistent with the total purine alkaloids content among the four genotypes. TCS is the most important enzyme that catalyzed the methylation of xanthosine to caffeine (Kato et al. 2000;Jin et al. 2016Jin et al. , 2018. The RNA-seq data revealed high expression of two TCS genes (TEA015791 and TEA028050) (FPKM > 700) in all samples, implying their important and conserved roles in the formation of caffeine. SAM is the methyl donor for the three methylation steps in the caffeine biosynthetic pathway (Ashihara et al. 2008). However, the expression pattern of SAMS genes did not show a consistent trend with purine alkaloids in our study, it is possible that SAM may not be a principal factor in the control of caffeine biosynthesis in our four genotypes.
Caffeine converts to theacrine by oxidation at the C8 and methylation at N9 positions, and genes related to enzymes of OR and MT might control these steps (Zheng et al. 2002). Based on the annotation results, we screened 20 OR genes and seven MTs genes that were significantly enhanced in Kucha. A former RNA-seq analysis of Niedu Kucha from Jiangxi Province showed that an NMT gene (TEA024443) was highly expressed in high-theacrine sample and therefore was suggested to be involved in the final synthesis process of theacrine (Wang et al. 2020a). Unfortunately, this gene displayed consistent moderate expression levels in K6,K11 and QX1 (FPKM: 5.00,8.89,8.72,respectively), but relatively high in YH9 (FPKM: 14.92) in our study. Recently, the TCS gene (CkTcs) encoding N9-methyltransferase was proved to catalyze the last step of theacrine formation (Zhang et al. 2020ab). Among the four common DEGs encoding N-methyltransferase in the current research, two TCS genes (TEA010054 and TEA022559) that were highly expressed in Kucha seemed most likely to be involved in the last step leading to theacrine biosynthesis from 1,3,7-trimethyluric acid. Thus, further functional characterization of these genes and their substrate specificity assay is required to determine whether they are genotype-specific in catalyzing the rate-limiting steps of theacrine synthesis.

Plant Materials
Two Kucha genotypes (Kucha-6 (K6) and Kucha-11 (K11)) and two conventional varieties (Yinghong 9 (YH9) and Qingxin 1 (QX1)) with extremely contrasting theacrine content were used in this study. The Kucha genotypes were collected from native genetic resources in Guangdong Province, which were called "bitter tea" by local residents. And the conventional genotypes were the most emblematic varieties in this region. Moreover, QX1 was bred from local naturally growing populations, while YH9 was bred from exotic populations introduced from Yunnan Province. All the materials were grown in Germplasm Bank of Tea Plant in Guangdong Province (Yingde,Guangdong,China,24°18′N,113°23′E), and details on agro-techniques (fertilization and irrigation) are available upon request. The first leaves of each genotype were harvested on May 2018. Part of the materials were frozen immediately in liquid nitrogen and stored an -80 °C for RNA isolation. The remaining samples were subject to purine alkaloids (caffeine, theobromine and theacrine) analysis.

Extraction and HPLC Determination of Purine Alkaloids
The harvested leaves were freeze-dried and ground finely. Purine alkaloids was extracted with H 2 O (1/100, w/v) in a water bath at 100 °C for 30 min and centrifuged at 5000 × g for 5 min. Then the obtained supernatant was metered and filtered through a 0.22 μm membrane prior to HPLC analysis. Purine alkaloids were determined using Agilent 1200 HPLC system (Agilent Technologies, USA). The HPLC was made to run with a column temperature of 35 °C, a wavelength of 231 nm, an injection volume of 10 μL, and flow rate of 1.0 mL/ min. 5 μm Zorbax SB-C18 column (Agilent) with solvent A (0.1% formic acid) and solvent B (acetonitrile) used as mobile phase. The gradient programs performed as follows: 0-15 min, linear gradient from 4 to 6% A; 15-30 min, linear gradient from 6 to 12% A; 30-55 min, linear gradient from 12 to 18% A; 55-58 min, linear gradient from 18 to 4% A. All samples were analyzed using three biological replicates.

RNA Extraction, Library Preparation and Sequencing
Total RNA was extracted using a TRIzol Kit following the manufacturer's protocol (Promega, Beijing, China). DNase I (TaKaRa, Dalian, China) was used to remove residual gDNA. RNA quality was assessed on an Agilent 2100 Bioanalyzer (Agilent Tech-nologies, USA). RNA samples with 260 nm/ 280 nm ratios between 1.8 and 2.2 were used for further analysis. Next, mRNA was enriched by Oligo (dT) beads, and sequencing libraries were created using an Illumina Truseq™ RNA Sample Prep Kit (Illumina, SD, USA) according to the standard high-throughput protocol. In total, 12 libraries (i.e., four genotypes, with three biological replicates) were sequenced on the Illumina HiSeq™ 2500 platform by Genedenovo Biotechnology Co., Ltd (Guangzhou, China).

Sequencing Data Processing and Read Mapping
The original reads were filtered with the following parameters: 1) reads containing adaptors, 2) reads with N (unknown base) ratio more than 10%, 3) reads containing more than 50% of low quality (Q-value ≤ 20) bases (Jiang et al. 2019). Then, the obtained clean reads were aligned to rRNA database to remove the remaining rRNA reads using Bowtie2 (Langmead et al. 2012). The high quality clean reads were mapped to reference genome ) by TopHat2 (version 2.0.3.12) (Kim et al. 2013) with maximum read mismatch of 2, distance between mate-pair reads of 50 bp and error of distance between mate-pair reads of ± 80 bp. We adopted Cufflinks software to perform reconstruction of transcripts and the reference annotation based transcripts (RABT) program was preferred (Trapnell et al. 2012). Genes with classcode "u" (the transcripts were unknown or in the intergenic spacer region) with transcript length longer than 200 bp and exon number more than 2 were defined as reliable novel genes. Novel genes were further annotated through alignment with the Nr and KEGG databases. The accession numbers of the transcriptome data were GSE163231 and SRP271859, which were deposited in a public functional genomics database, the Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) database.

Bioinformatic Analyses of Transcriptome Data
Gene abundances were quantified by software RSEM (Li et al. 2011). The gene expression level was normalized by using FPKM (Fragments Per Kilobase of transcript per Million mapped reads) method, which is able to eliminate the influence of different gene lengths and sequencing data amount on the calculation of gene expression. Person correlation coefficient and PCA on all samples were performed with R package gmodels (http://www.r-proje ct.org/) in this experience.

Quantitative Real-Time PCR (qRT-PCR) Validation
The qRT-PCR was performed to validate the transcriptomic data from the extracted total RAN. Eight genes related to theacrine synthesis were selected and analyzed. The corresponding primers were designed with Primer Premier 5.0 and listed in Table S8. The cDNA was synthesized using the PrimeScript™ RT Master Mix kit (Takara, Dalian, China). PCR amplification was performed using Applied Biosystems 7500 Real-Time PCR System according to the manufacturer's instructions. We chose GAPDH gene as the endogenous control (Lin et al. 2017). All reactions were run in three technical and three biological replicates. Statistical analysis was performed using the 2 −△△CT method (Livak et al. 2001).