Genome-wide transcriptome analysis of triterpene biosynthetic genes of Anoectochilus roxburghii plant

Background: Anoectochilus roxburghii is a medicinal plant and contains a variety of bioactive components, including triterpene, which exhibits important pharmacological properties with low toxicity. However, little is known about the biosynthetic pathway of triterpene or about the genome and transcriptome in A. roxburghii. Results: In order to analyze transcriptional determinants related to the biosynthesis of the bioactive components, we performed transcriptome sequencing in A. roxburghii (SRX1818644, SRX1818642 and SRX1818641) and annotated the sequences from three samples. In total, 137,679,059 clean reads were obtained, corresponding to 12.20 Gb of total nucleotides. They were then assembled into 86,382 contigs and 68,938 unigenes, which were further annotated according to sequence similarity with known genes in COG, EST, Nr, Pfam and Uniprot databases, leading to 10,04029,44239,55134,991 and 28,082 unigenes, respectively. GO analysis classified all unigenes into three functional categories, i.e. biological processes (43,206 unigenes in 22 categories), molecular functions (46,978 unigenes in 15 categories) and cellular components (20,951 unigenes in 18 categories). Candidate triterpenes biosynthetic genes ArHMGR1 in MEV pathway, ArDXS1, ArDXS4 ArDXS5, ArDXS8-10, ArDXR1-2 and ArHDR1-2 in MEP pathway and ArFDS1, ArSM and ArOCS were selected based on RNA-seq and gene-to-metabolites correlation analysis. Conclusion: The transcriptomes of A. roxburghii plant include 86,382 contigs and 68,938 unigenes. The assembled dataset allowed identification of genes encoding enzymes in the biosynthesis of bioactive components in A. roxburghii plant. Candidate genes that encode enzymes being important in triterpenes biosynthetic pathway were selected.

Arabidopsis thaliana [11]. As illustrated in Figure 1A, FDP is used to produce squalene, which is further converted by squalene monoxygenase (SM) to 2,3oxidosqualene, a common precursor of triterpenoid biosynthesis ( Figure 1A).
Although triterpenes are widely distributed in plant kingdom and known for their pharmaceutical benefits, understanding of their biosynthesis genes in Anoectochilus plants is still limited. Thus, transcriptome sequencing of Anoectochilus plant was carried out to reveal enzymes related to triterpenoids biosynthesis.

Results
Three cDNA libraries generated from RNAs of the whole A. roxburghii plants were subjected to paired-end sequencing on an Illumina platform. A de novo assembly strategy was used to construct A. roxburghii transcripts. In total, 165,643,100 raw reads with nucleotides of 15.43 Gb were generated from the sequencing platform (Table 1). After stringent quality check and data cleaning, 137,679,059 quality reads were obtained. The number of clean reads is 83.12% of raw reads ( Table 1).
The average length of the clean reads is 95.21 bp (Table 1) with a high Q20 value of 6 99.24%. All clean reads were assembled into 86,382 contigs, corresponding to 68,938 unigenes using Trinity. 74.23 % of the clean reads could be matched to assembled unigenes, and 87.13% of these clean reads are unique and match to unigenes (Table 1). BUSCO tool [14,15] was used to assess the quality and completeness of A. roxburghii plant transcriptome, the results showed that 1008 transcripts (70 %) of 1440 BUSCO genes were complete, among these complete genes, 978 were single copy and 30 were duplicated, 144 (10 %) were fragmented and 288 (20 %) were missing. Further analysis showed that the average length of the contigs and unigenes is 421 bp and 404 bp separately, and N50 lengths of contigs and unigenes are 464 bp and 434 bp, respectively (Table 1). All contigs and unigenes are longer than 200 bp, among those, 4012 contigs and 2822 unigenes are between 1000 and 2000 bp, and 245 contigs and 160 unigenes are longer than 2500 bp (supplementary Figure 1s). A. roxburghii transcriptome is 27874926 bp (around 27Mb) with a GC content of 46.81 %, which is higher than that of Arabidopsis (42.5 %), soybean (40.9 %) and chickpea (40.3 %) but lower than that of rice (55 %) [16].
Function annotations All unigenes were aligned to five public protein and gene databases, i.e. COG EST Nr Pfam and Uniprot. Blastx was performed against these databases, and proteins and genes sharing highest similarities (E≤10 -5 ) to query unigenes were adopted to annotate unigenes. Genes related to secondary metabolites biosynthesis, nucleic acid metabolic process, protein metabolic process and cellular metabolic process were annotated by GO. Cellular function group includes 15 categories in which catalytic processes hold the highest number of 46,978 unigenes. Cellular components can be divided into 18 categories, where cell part has the most unigenes of 20,951 ( Figure 2).
These GO annotations provide valuable information to investigate specific processes, molecular functions and cellular structures of A. roxburghii transcriptome.

Triterpenoid accumulation
Triterpene accumulation was analyzed using oleanolic acid as a general evaluation index. triterpenoid content was determined by test oleanolic acid using HPLC.

Relative expression level of triterpene biosynthetic genes in A. roxburghii
RT-qPCR was used to analyze relative expression of putative triterpene biosynthetic genes, including ArHMGR in MEV pathway, ArDXS , ArDXR and ArHDR in MEP pathway and ArFDS , ArSS and ArOCS of triterpene biosynthesis, respectively. All tested genes expression showed significant difference (P≤0.05) or extremely significant difference (P≤0.01) in root, stem and leaf.

Relative expressions of HMGR, DXS, DXR and HDR
HMGR is a rate-limiting enzyme in MVA pathway [9]. In A. roxburghii, two unigenes were annotated as ArHMGR1-2 (Supplementary 1, Supplementary Table S1-2). As shown in Figure 1B, ArHMGR1 showed highest expression in root, moderate expression in stem and scarcely expressed in leaf, while ArHMGR2 showed different expression pattern ( Figure 1B). Variable correlation coefficients between ArHMGR1-2 expression and triterpene accumulation are 0.877 and 0.051 respectively ( Table   2). The data suggest that, ArHMGR1 may play roles in triterpene biosynthesis as its relative expression level correlates with triterpene accumulation, while ArHMGR 2 may be irrelevant with triterpene biosynthesis.
High plants retain both MVA and MEP pathways [17]. They have similar functions in ginsenoside biosynthesis in ginseng root but MEP expressed a much higher level than MEV in ginseng Leaf [18]. To understand whether MEP pathway has any influence on triterpene biosynthesis in A. roxburghii, DXS, DXR and HDR in MEP pathway were analyzed. There are ten unigenes annotated as ArDXS1-10. Among them, ArDXS 1, ArDXS4 ArDXS5 and ArDXS10 were highly expressed in root, but scarcely expressed in stem and Leaf. ( Figure 1B). The relative expression levels of  Table   2). These high correlation coefficients suggest a strong relationship of triterpene accumulation with all the six ArDXS genes, and these ArDXSs are candidate genes in MEP pathway for triterpene biosynthesis. The levels of the remaining four genes  (Table 2) 10 DXR and HDR are also rate-limiting enzymes in MEP pathway in plant, but their roles vary with plants and conditions [10]. In this study, two unigenes were annotated as ArDXR1 and ArDXR2 and another two unigenes as ArHDR1 and ArHDR2 (Supplementary 1, Supplementary Table S1-2). ArDXR1-2 and ArHDR1-2 showed relative higher expression in root than in stem and leaf. Variable correlation coefficients between triterpene accumulation and the individual gene expression of ArDXR1-2 and ArHDR1-2 are 0.941, 0.944, 0.944 and 0.861 respectively, suggesting that ArDXR1-2 and ArHDR1-2 are positively correlated with triterpene accumulation.
Therefore, we believe that ArDXR1-2 and ArHDR1-2 are candidate genes contributing in triterpene biosynthesis.

Relative expression of FDS, SM and OSC
Farnesyl diphosphate synthase (FDS) catalyzes the formation of FDP ( Figure 1A Cyclization of 2,3-oxidosqulene is the committed step of triterpene biosynthesis to produce different triterpene skeletons by OSC [19]. In this study, one gene was annotated as ArSM (Supplementary 1, Supplementary Table S1-2), which showed highest expression in root, moderate expression in stem and lowest expression in leaf. The correlation coefficient between ArSM expression levels and triterpene contents was up to 0.955 (Table 2), suggesting that ArSM plays a role in the accumulation of triterpene and is a candidate gene catalyzing the formation of 2,3oxidosqulene.
Three genes were annotated as ArOCS1-3 ( Figure 1B, Supplementary 1, Supplementary Table S1-2). Although ArOCS1 showed higher relative expression in root, moderate in stem and lower in leaf, its expression level is significantly correlated with triterpene content. Therefore, we believe ArOCS1 is a candidate enzyme that catalyzes triterpene skeletons formation.

Discussion
RNA sequencing, is attractive for study of non-model medicinal pant without genomic sequence information [20]. With this technology, 15.43 Gb of raw reads of A. roxburghii transcriptome were obtained. Trinity achieved the best overall metric score of 95.9 comparing with the other assembly tools, such as SOAPdenovo-Trans and Trans-ABySS, and performed generally well in constructing full-length transcripts [21]. 86

Gene-to-metabolites correlation analysis
Triterpene and triterpene saponin are mainly accumulated in root in some plants roxburghii. The plant were grown at 25 ± 2 o C and with 80-100 % humidity for about three months to a height of around 10 cm. The whole plant was harvested and frozen in liquid nitrogen, then grounded into fine powder for RNA extraction.

RNA extraction and RNA-seq
Total RNA was extracted from A. roxburghii plant using Purelink TM Plant RNA Reagent (Invitrogen) according to the manufacture's instruction. A pool from five A. roxburghii plants was used as one sample. Each sample has triplicates. Genomic DNA was removed by the treatment with DNase I (Fermentas). RNA integrity was checked on 1 % agarose gel, and RNA concentration and purity were determined using NanoDrop 2000C spectrophotometer (Thermo Scientific, USA). mRNA was purified from total RNA using oligo (dT) magnetic beads and fragmented into small pieces in fragmentation at 37 o C for 10 min. First stranded cDNA was synthesized using Superscript II (Promega, USA) and random primers. The second stranded cDNA synthesis was subsequently performed using DNA polymerase I and RNaseH

Analysis of Illumina sequencing results
Raw reads generated from sequencing machine were filtered to obtain clean reads by removing the adapter sequences, repetitive, redundant and low-quality sequences from the raw reads. Predicted errors of raw reads were corrected using Rcorrector under default settings [52] and adapter sequences of all raw reads were removed using ILLUMINACLIP: adapters/TruSeq3-PE-2.fa:2:30. The quality of the trimmed and filtered reads were assessed using FastQC v0.10.1, low-quality reads with more than 40% of quality value ≤ 20 bases or the reads with unknown nucleotide percentage > 4% using NGS QC Toolkit v2.3. Clean read quality was evaluate with quality score (Q score) higher than 20 (Q20), the higher percentage of Q20 clean reads means the better quality. The data was uploaded to NCBI SRA (ID: SRP076090).
Clean reads were assembled into contigs with Inchworm with parameters of k value was 25 with step size 5 [53, 54]. The assembled contigs were clustered using Chrysalis [54], all overlapped contigs were connected into components using a de Bruijn graph approach. In a final step, butterfly simplifies all the generated graphs with parameters 'min kmer cov 2 -min contig length 100 -run' to generate unigenes [54]. Clean reads were aligned to assembled unigenes using SOAP2 [55], duplicated reads and multi-mapped reads were filtered from the alignment results to eliminate the PCR interference and ambiguous mapping. All unigenes were aligned by BLASTx database [57,58].The predicted protein sequences were submitted to the KEGG (http://www.genome.jp/tools/ kaas/) method. KEGG pathway annotation was performed by BLAST analysis against automatic annotation server (KAAS) KAAS (http://www.genome.jp/tools/ kaas/) with thebi-directional best-hit (BBH), Arabidopsis thaliana was used as reference species [59].

Analysis candidate triterpenes biosynthetic genes
The candidate triterpenes biosynthetic genes were obtained based on KEGG pathways and functional annotations. Unigenes involved in triterpene biosynthesis were aligned with plant protein sequences from public databases protein databases: Non-redundant (Nr) protein database, the Swiss-Prot protein database, the Kyoto Triterpene accumulation and QPCR data in this version collected in August, 2019.

Quantitative analysis of triterpene
Oleanolic acid was used as evaluation index for quantity analysis of triterpene.
Quantitative analysis of oleanolic acid was carried out according to the literature [60] with minor modifications. 0.5 g of fresh leaves was homogenized in 5 mL methanol and treated with ultrasonic waves for 90 min (60W, 4 s/5 s). After extraction, the sample was filtered through a 0.45 µm filter membrane before being injection for analysis. The quantitation analysis was done by means of standard curves. HPLC analyses were performed using a Waters™ 2695 HPLC pump separation module (Milford, MA, USA) equipped with an auto-injector system (100 µL). The mobile phase was acetonitrile and 0.5% acetic acid as a ratio of 85:15.
Flow rate was set to 0.8 mL/min. Agilent TC-C18 column (250 mm×4.6 mm,5 μm) was used and the column temperature maintained at 25℃. The effluent was 19 measured at a wavelength of 210 nm for the detection of ursolic acid. Acid content was calculated using a standard equation y = 71 820x -32 554, R2 = 0.9995, x was the concentration of extract while y was the absorbance of triterpenoids, linear range from 10 -400 μg / ml. Triterpene content was calculated according to the formula C= 5x/0.5, C was triterpene content. Each sample has three biological repeats and each biological repeat was analyzed in triplicates.

Quantitative real-time PCR analysis
Total RNA extraction and first DNA synthesis and quantitative real-time PCR (qPCR) was performed according to published protocols [61,62]. Each sample has three independent biological repeats, and technical triplets and a negative control of each sample were run. qPCR was performed using specific primers (Supplementary Table   S1-2), the cycle threshold (CT) values were used for further analysis. The primer efficiency of each amplicon was calculated by the program Linreg v. 12.1 based on the log linear phase of the amplification curve [63]. The Best-Keeper software [64] was used to search for stable reference genes among all genes tested. Based on Best-Keeper and related report, β-actin (Genbank ID: JF825425) and β-elongation factor (Genbank ID: JF825420.1) genes were used as reference [65], The relative expression level of triterpene biosynthetic genes in root, stem and leaf were compared with reference genes using the 2 -△△Ct method [66], relative expression of reference genes was set as 1.0. Expression profile of candidate triterpene biosynthetic genes in root, stem and leaf were shown as heatmap using the TBTools software (https://github.com/CJ-Chen/TBtools/releases).

Gene-to-metabolite correlation analysis
The expression of triterpene biosynthetic genes and the triterpene accumulation in different tissues were correlatively analyzed using Pearson's correlation analysis in 20 IBM SPSS Statistics 24 software (IBM, New York, USA). Gene-to-metabolite correlation was constructed to identify triterpene biosynthetic genes in A. roxburghii plant.

Statistics analysis
Data are presented as mean ± standard error of biological triplicates. Statistical significance was determined by using one-way ANOVA in IBM SPSS Statistics 24 software (IBM, New York, USA).

Consent for publication
The authors have full consent to publish. There are no person's data in the manuscript to report.

Availability of data and materials
De novo transcriptome sequencing result was deposited in NCBI SRA with ID of SRP076090. Sequences of genes encoding enzymes of triterpene biosynthesis were provided in the supplementary part.

Complemeting of interests
All authors declare there are no conflict of interests

Additional file 1:
Genes encoding enzymes that participate pentacyclic biosynthesis of A. roxburghii plant.