Transcriptome Assembly and Gene Expression Analysis of Leaf and Rhizome tissues of Dioscorea opposita


 Background: Dioscorea opposita is a monocotyledonous plant in the Dioscoreaceae family. The rhizome of D. opposita, known as “dioscoreae rhiaoma” or “Chinese yam”, is not only a well-known traditional Chinese medicine, but also a popular health food.Methods:In this study, the leaf and rhizome tissues of D. opposita were collected. Their transcriptome sequenceand were obtained using an Illumina Hiseq 4000 platform. The data was analyzed by bioinformatics software. Likewise, the consistent expression pattern of DEGs was validated with qRT-PCR analysis.Results: A total of 46,982,031 and 53,689,620 clean reads from leaf and rhizome tissues were acquired, respectively (SRP: PRJNA628588). 50,765 unigenes were obtained with an average length of 1050 base pairs (bp). A total of 13,259 significant differentially expressed genes (DEGs) were obtained from leaf versus rhizome tissues, including 5536 up-regulated and 7723 down-regulated unigenes. In addition, 20,607 simple sequence repeats (SSRs) were identified in this study. Conclusions:This study not only enriched the transcriptomic data of D. opposita, but also laid a foundation for the analysis of terpenoid synthesis pathway.

RNA-Seq is an efficient method for transcriptome analysis, which can also provide a highly sensitive and accurate platform at a low cost and is efficient for gene discovery and biosynthesis pathway analysis in nonmodel plants without knowing their genome sequences [14][15][16][17]. RNA-Seq is currently a very popular method that is applied in the exploration of plants at transcriptomic levels. The ultimate aim is often to reveal genes that are involved in the biosynthetic pathways in secondary metabolism of many species [18][19][20][21]. Until now, the genomic resources of D. alata [22], D. rotundata [23], and D. dumetorum [24] have been reported. Nevertheless, as a non-model species, there is lack of enomic resources, in particular, information on active substance metabolism of D. opposita.
In the study, the sample of leaf and rhizome tissues of D. opposita were collected from Wenxian County, called "Tie Gui Shan Yao". It is a local cultivar of D. opposita, belonging to one of the "four huaiqing Chinese medicines". There have reported that the content of terpenoids in"Tie Gun Shan Yao" was higher than other D. opposita [25].
In this study, transcriptome analysis was performed from leaf and rhizome tissues of D. opposita. The results will enrich the known genetic information for further gene function research and analyses, such as differential transcript abundance analysis. We also try to understand the transcript abundance difference and explore the molecular mechanisms of biosynthetic pathways in secondary metabolites in D. opposita.

Plant materials and RNA extraction
Leaf and rhizome tissues of D. opposita were collected from Wenxian County, Jiaozuo City, Henan Province of China, on October 2017Wenxian County Weather Station N34°56′49.54″, E113°10′18.54″; T:15℃. Three biological replicates from three leaf tissues (HSYY_1, HSYY_2, and HSYY_3) and three rhizome tissues (HSYK_1, HSYK_2, and HSYK_3) were used for the analysis. All samples were flash frozen in liquid nitrogen after cleaning and stored at -80 °C until RNA extraction.
Total RNAs were extracted from the leaf and rhizome tissues using the TRIzol® Reagent (Invitrogen, USA) following the product description with slight improvement [26,27]. The extracted RNAs were checked through NanoDrop™2000 spectrophotometer (NanoDrop Technologies, USA) and RNA integrity was assessed using electrophoresis in 1.0% agarose gel. Furthermore, the RNA concentration was determined by the RNA Agilent Bioanalyzer 2100 system (Agilent, CA, USA) to guarantee the RNA Integrity Number (RIN) values were not less than 7.0.

Transcriptome sequencing and de novo assembly
The cDNA libraries were constructed using the TruSeq-TM RNA sample preparation Kit (Illumina, San Diego, CA, USA) according to the manufacturer's instructions. The cDNA libraries were enriched through a polymerase chain reaction (PCR) with 15 cycles. Then, the target band was extracted and recycled by a 2% low range ultraagarose gel, and quantified by TBS380 Mini-Fluorometer. The suitable cDNA libraries were sent to Shanghai Majorbio Biopharm Biotechnology Co., Ltd. (Shanghai, China) where the RNA-Seq was performed using the Illumina HiSeq4000 SBS Kit (300 cycles, Illumina, San Diego, CA, USA) on an Illumina HiSeq4000 platform.
The raw reads were cleaned and assessed following the steps. Firstly, the adaptor sequence of reads was removed, and discarded the reads that have no inserted fragments due to the reason of adaptor self-connection; Secondly, low quality bases (quality value less than 20) were trimed off at the end of the sequence (3 '). If there are still bases with quality value less than 10 in the remaining reads, the whole reads are eliminated, otherwise it is retained.Thirdly, reads with "N" bases was also removed. Lastly, the reads less than 100bp in length were discarded after removing adapter and quality pruning. All the above mentioned processes were performed carried out using the SeqPrep (https://github.com/jstjohn/SeqPrep) and Sickle (Version 1.33) software [28]. The high quality clean reads of leaves and rhizomes of D. opposita were de novo assembled by Trinity (Version v2.8.5) software with operation parameters of k-mer size of 25, 27, 30, 31and 32(max normalize read 50 Min contig length 200). The assembly results were assessed based on the assembly size, number of contigs, N50 and complete BUSCOs. Then, 2x150bp reads were generated and filtered, and more than 200 bp contigs were produced. At last, assembly completeness and quality also were assessed with Benchmarking Universal Single-Copy Orthologs (BUSCO, version 3.0.2) and TransRate package (version 1.0.3) [29,30].
The functional classification of these unigenes was performed using Blast2GO and KEGG. The GO analysis was categorized into biological processes, molecular functions, and cellular component [34]. The pathway distributions were determined based on the KEGG database [35]. Unigenes were also aligned to the Evolutionary Genealogy of Genes: Non-Supervised Orthologous Groups (eggNOG) database for further functional prediction and classification [36].

Differential transcript abundance analysis
The clean reads were used to determine alignment counts and to quantify transcript abundance by the RSEM (RNA-seq by Expectation Maximizattion, http://deweylab.github.io/RSEM/) package [37]. The unigene and transcript abundance levels were analyzed by TPM (transcripts per million reads) [38].
The software DESeq2 (Version 1.24.0, http://bioconductor.org/packages/ stats/bioc/DESeq2/) was used to analyze the differentially expressed genes (DEGs) between leaf and rhizome tissues of D. opposita. The false discovery rate (FDR) can be used to assess p-value thresholds in multiple testing [39,40]. The significance of DEGs was defined with a threshold of FDR < 0.05 and an absolute value of log2-fold change of ≥2.0 (|log2fc|≥1.0) [41]. The DEGs enrichment analyses were carried out for GO and KEGG terms by the GOatools and KOBAS software [42,43].

Expression analysis of terpene biosynthetic related DEGs
We selected 2 DEGs, DN39197_c0_g1 and DN37057_c0_g1, which were the key enzyme genes in the terpene synthesis from D. opposita, to validate their expression using quantitative real-time PCR (qRT-PCR). Total RNAs from leaf and rhizome tissues of D. opposita were extracted through the same procedures as used for RNA-seq in the study. The RNA from each sample was reverse transcribed using a reverse transcription kit (PC17-TRUE script, Aidlab Biotechnologies Co., Ltd, China). The DoActin gene was used as an internal control. Primers for the 2 DEGs and DoActin were designed in the Primer Premier 5.0 software. All the primers sequences are listed in Table 1. The reaction system (10 µL) included 5 µL of SYBR Premix Ex TaqTM, 4 µL of cDNA template, and 0.5 µL of each primer (10 µM). The qRT-PCR reactions were conducted on a Roche Light Cycle 96 real-time PCR system (Roche, USA) with SYBR green fluorescent dye (TaKaRa, China). PCR amplification was initiated at 95 °C for 30 s, followed by 40 cycles of 95 °C for 5 s, 60 °C for 30 s, and 72 °C for 30 s. Three replicates were set for the experiment. Relative transcript abundance levels were calculated by the 2-ΔΔCt method [45]. Table 1 Primers for qRT-PCR of 2 DEGs and DoActin  Table 2.

Functional annotation of D. opposita unigenes
The functional annotation results indicated that a total of 24,840 unigenes were annotated (Table 3). Among them, different unigenes were matched in different databases, for example, 23,848 (46.98%) unigenes were matched in the NR database.In addition, 18,531 (36.50%) unigenes were matched in the Swiss-Prot database. Table 3 The . Long sequences of unigens were clearly more likely to be annotated than those with short sequences. A total of 7151 unigenes measuring over 2000 bp in length were annotated, and only 371 unigenes were unannotated.

Functional classification results of unigenes
To further evaluate the annotated unigene data, the unigene sequences were searched against the cluster of orthologous groups of proteins (COG) databases. In total, 23,139 unigenes assigned to 23 COG categories under the following types: information storage and processing, cellular processes and signaling, metabolism, and poorly characterized (Fig. 1 (Fig. S1).
In order to perform an in-depth analysis of unigene function based on the GO database,all of the three categories (BP, MF, and CC) were subdivided into level 3 and level 4 terms in order to determine their function. For example, cellular process is the main level 2 term from the BP category. In the level 3 terms, cellular process was assigned to the following specific functions: cellular metabolic process (GO: 0044237; 5615 unigenes; 28.06%), cellular component organization (GO:0016043; 1147 unigenes; 5.73%), cellular response to stimulus (GO:0051716; 555 unigenes; 2.78%), and signal transduction (GO:0007165; 526 unigenes; 2.63%), etc. (Fig. S2).

Expression analysis of unigenes from leaf and rhizome tissues
In order to explore the differential expression in leaf and rhizome tissues of D. opposita, the clean reads from the leaf and rhizome library were compared with the unigene database by using the RNA-seq by RSEM package. In this study, the sample correlation analysis was performed using RSEM software (Tables S1). A high correlation coefficient was obtained among the three biological replicates of each tissue. The correlation coefficients of rhizome_1 (HSYK_1) with rhizome_2 (HSYK_2) and rhizome_3 (HSYK_3) were 0.935 and 0.913, respectively. The correlation coefficients of leaf_1 (HSYY_1) with leaf_2 (HSYY_2) and leaf_3 (HSYY_3) were 0.919 and 0.935, respectively. The correlation coefficient between two different of leaf and rhizome tissues was significantly lower than that among the three biological replicates of each leaf and rhizome tissue. The results indicated that all three biological replicates of each tissue are all clustered together, thus suggesting that the sequencing data are suitable for further biological analysis (Fig. S3).
Principal component analysis (PCA) of the expression levels of total transcriptome (Fig. S5), revealed that the leaf (HSYY_1, HSYY_2, HSYY_3) and rhizome (HSYK_1, HSYK_2, HSYK_3) treatment groups could be obviously distinguished. In other words, the three biological replicates of each sampling group clustered well together. Hence, the sequencing data can be used for further analysis.

Identification and functional enrichment analysis of DEGs
To identify the differentially expressed genes (DEGs), we compared the expression unigenes in leaf and rhizome tissues based on the false discovery rate (FDR) through the DESeq2 software. The results showed that 13,259 significant DEGs (FDR< 0.05 and |log2FC|≥1) were identified in the leaf versus rhizome tissues, which included 5536 up-regulated unigenes and 7723 down-regulated unigenes (Fig. 3, Table S2).
To thoroughly understand the functions and interactions of DEGs between leaf and rhizome tissues, all DEGs were subjected to KEGG pathway enrichment analysis by mapping against the KEGG database using KEGG orthology based annotation system (KOBAS) software. The results indicated that 13,259 DEGs were involved in 144 metabolic pathways. There 23 pathways were enriched based on the p-value _uncorrected< 0.05 (Table S3).
The top 20 enriched pathways with a p-value of < 0.05 are presented in Fig. 5

Identification of unigenes involved in terpene biosynthesis
Comparison to the unigenes with the KEGG pathway database, indicated, a total of 174 unigenes related to terpene biosynthesis (Table S1), including terpenoid backbone biosynthesis (61), monoterpenoid biosynthesis (37), diterpenoid biosynthesis (19), sesquiterpenoid and triterpenoid biosynthesis (8), biosynthesis of terpenoids and steroids (3), and ubiquinone and other terpenoid-quinone biosynthesis (43). Many unigenes were involved in the pathway of terpenoid backbone biosynthesis, which belongs to the universal metabolic pathway map00900. Among these unigenes, DN2456_c0_g1 (DXS, K01662) is the first enzyme in the MEP pathway. DXS catalyze Dglyceraldehyde 3-phosphate (GA-3P) and pyruvate to form 1-deoxy-D-xylulose-5-phosphate (DOXP). The DXS gene is significantly less expressed in the rhizome than in the leaf. DN26885_c0_g1 (HMGR, K00021) is the main enzyme in the MVA pathway. It catalyzes 3-hydroxy-3-ethylglutaryl CoA (HMG-CoA) to form MVA. The expression of HMGR gene shows no difference in the leaf and rhizome tissues of D. opposita.

qRT-PCR validations of DEGs related to terpene biosynthesis
To further analyze the consistency of RNA-seq data in this study, the DN39197_c0_g1 and DN37057_c0_g12 genes involved in terpene biosynthesis were selected for qRT-PCR validation. The results showed that a similar expression pattern in qRT-PCR analysis was observed in RNA-seq data from the leaf and rhizome tissues of D. opposita (Fig. 6). Further analysis showed that DXR (DN4169_c0_g1, K00099) and GGPS (DN17971_c0_g1, K13789) were the rate-limiting enzymes in the MVP and MEP pathways, which are the basic processes of a terpene synthesis pathway.

Discussion
In this study, six transcriptome sequencing analyses of RNA from leaf (HSYY_1, HSYY_2, HSYY_3) and rhizome (HSYK_1, HSYK_2, HSYK_3) tissues of D. opposita were conducted using the Illumina platform. More than five million raw reads and four million clean reads were obtained from life and rhizome library, respectively, which indicated that there more reads were assembled in leaf than in rhizome. Based on the large number of clean reads, 50,765 unigenes were assembled after evaluating the assembly results, of which, 46.98% unigenes were annotated by BLASTX analysis using the publicly available protein database. Otherwise, there are 49.5% unigenes were annotated from the D. alata [22]. This phenomenon shows that genomic data of Dioscorea species is relatively scarce in publicly database. There only D. rotundata [23], and D. dumetorum [24] genomic sequences were reported. The N50 is one of the most important indices used to evaluate the assembly quality. In the study, the N50 length of unigenes was 1825 bp, indicating that the sequence assembly was high in quality and suitable for intensive research. Moreover, the N50 could be compared with those of other species [46][47][48]. The average length of the unigenes was larger (1050 bp) ( Table 2), than those in previous studies, such as those of Myrica rubra (531 bp) [49] and Sorbus pohuashanensis (770 bp) [50].
In this study, 24,840 and 25,925 unigenes were annotated and unannotated, respectively, in the NR, KEGG, COG, GO and Pfam database. However, the proportion of annotated long sequences was significantly higher than that of annotated short sequences. Similar results have been reported for other species [17,51].
The identified unigenes were compared among different Dioscorea species. A total of 5575 unigenes were matched with 315 Dioscorea species. Unigenes with the highese similarity were found in D. polystachya (16.41%), followed by D. bulbifera (5.70%), D. alatar (4.77%), and D. rotundata (3.28%), this result may indicate the evolutionary relationship among these species. The BLAST results (e-value 0.00001) of these Dioscorea species confirmed the correctness of transcriptome assembly and analysis in the present study. However, 296 Dioscorea species were matched to less than 100 unigenes of D. opposita. Although a small number of genes from Dioscorea species are available in several public database, complete transcriptome or genomic data are scarce. Indeed, to date, only the genomic resources for D. alata [22], D. rotundata [23], and D. dumetorum [24] have been reported. Therefore, enhancing research efforts on the genome of Dioscorea species is important and the main focus of the presenr study.
Most D. alata tubers have white flesh, and purple-flesh tubers with high anthocyanidin content are produced by spontaneous variation. In the present study, elite purple-flesh cultivar (DP) and conventional white-flesh cultivar (DW) of D. alata were performed using Illumina sequencing [22]. The transcriptomes were compared with the two species of D. alata [22], and we compared with two tissues of D. opposita. A total of 49.5% (60,02/125,123) [22] and 48.93% (24,840/ 50,765) unigenes were annotated in D. alata and D. opposita, respectively, which indicates that few genome sequences of Dioscorea species are available in public database; thus, the ratio of annotation is fairly low. For example, the alignment lengths of gene NC_039707.1 (AYI69196.1) in D. alata and gene DN1708_c1_g2 in D. opposita are identical at 2250 bp (750 aa), and the similarity of these genes is 100%. Thus, these genes are believed to be the psbA gene, which encodes Photosystem I P700 chlorophyll a apoprotein A1 protein. 97 unigenes that were compared with D. dumetorum. Among these unigenes, the unigene DN2638_c1_g5 in D. opposita was identified to be gene NC_039691.1 (YP_009528139.1) in D. dumetorum; these genes are 465 bp (155 aa), in length and encode ribosomal protein S7 in the chloroplast genome.
Through the Blast2GO database, a large number of unigenes GO categories were assigned ( Table 3), suggesting that the assembled unigenes represented a wide diversity of transcripts in the D. opposita genome, similar to in many previous reports [52,53]. In the three GO categories, reproduction and metabolic processes were the most abundant classes in BP cell and cell part activity were the most abundant classes in CC, which was consistent with some previous research [17,51,52].
Likewise, compared with the transcriptome of differentially expressed genes of leaf versus rhizome tissues by the DESeq2 software, there were 13,259 unigenes existing in significant DEGs, with 5536 up-regulated and 7723 down-regulated unigenes. These results showed that more unigenes abundance down-regulated in rhizome than in leaf tissue, which may be related to secondary metabolites of rhizome tissue. Moreover, most of these unigenes were related to transferase activity, membrane part, photosynthesis, plant-pathogen interaction, and plant hormone signal transduction.
Trinity can be used with different k-mer size and the standed flag for de novo transcriptome assembly [54]. The developers describe comprehensive workflow for assembly validation on the trinity software. In this study, analyses by using five k-mer sizes of 25, 27, 30, 31 and 32 were performed with Trinity (Version v2.8.5) website. The de novo transcriptome assembly was generated using the stranded flag and 2 × 150nt paired-end reads. The assembly results were evaluated on the basis of assembly size, number of contigs, and BUSCO (Table S4). The number of complete BUSCOs is highest for a k-mer size of 30. Meanwhile, the N50 of 1825 bp is in the expected range for a transcriptome assembly. So, 30 were selected as the best k-mer size for the assembly and analysis of the transcriptome data of leaf and enzymes tissues from D. opposita.
Based on previous reports [55,56], we speculate the expression levels of dioscin in different tissues of D. opposita may be different and are related to key enzymes of the biosynthetic pathway. Therefore, to explore the dioscin accumulation and translocation in leaf and rhizome tissues, we compared the differentially expressed genes (DEGs) and elementary identified terpene biosynthetic genes through the RNA-seq, which would lay a foundation to study the terpenoids biosynthesis pathway.
We analyzed SSR sequences with length of 1-30 bp, and detected 16,417 unigenes. Among these unigenes, 2338 and 5782 were protein coding and non-coding sequences, respectively. The results also indicated that SSRs with lengths of 3 or multiples of 3 were identified more often within coding sequences than SSRs of other lengths, likely because a complete codon would not lead to a frameshift, but to the addition/deletion of an amino acid.
Terpenoid biosynthesis requires common precursor substances, including isopentenyl diphosphate (IPP) and its allylic isomer dimethylallyl diphosphate (DMAPP), which are generated through the MVA and MEP pathways [55,56]. In the MEP pathway, DXR is a rata-limiting enzyme, that catalyze 1-deoxy-D-xylulose-5-phosphate (DXP) to generate MEP [57]. GGPS is the main enzyme in the terpene biosynthesis pathway, and can catalyze IPP to generate geranylgeranyl diphosphate (GGPP) [57]. In this study, many DEGs involved in terpenoids synthesi, such as diterpenoid biosynthesis, monoterpenoid biosynthesis and tropane, piperidine and pyridine alkaloid biosynthesis were detected. The expression patterns of the DXR (DN4169_c0_g1) and GGPS (DN17971_c0_g1) genes were studied through qRT-PCR, the results of which were consistent with the RNA-Seq tata. Our results also indicated that the abundance of DXR and GGPS genes is higher in the leaf tissue than in the rhizome tissue of D. opposita. This phenomenon may be related to the synthesis and accumulation of secondary metabolites in these tissues. Other enzymes, such as HMGR and hydroxymethylglutaryl-CoA synthase (HMGS) may also participate in the biosynthesis of terpenoids. These genes will be studied in great depth in future work.

Conclusions
In this study, the RNA-seq of leaf and rhizome tissues of D. opposita were obtained using an Illumina Hiseq 4000 platform. The results enriched the genomic data of D. opposita, and explore the different expression genes between leaf and rhizome tissues of D. opposita.        The top 20 enriched GO enrichment terms of differentially expressed genes (DEGs) of leaf (HSYY) and rhizome (HSYK) tissues. The upper x-axis indicates the number of unigenes, and the lower x-axis indicates the false discovery rate (FDR). The larger the FDR value, the more significant the enrichment. The y-axis indicates the different metabolic pathways. Each point on the fold represents the number of each metabolic pathway. The top 20 enriched GO enrichment terms of differentially expressed genes (DEGs) of leaf (HSYY) and rhizome (HSYK) tissues. The upper x-axis indicates the number of unigenes, and the lower x-axis indicates the false discovery rate (FDR). The larger the FDR value, the more significant the enrichment. The y-axis indicates the different metabolic pathways. Each point on the fold represents the number of each metabolic pathway. The top 20 enriched KEGG pathways of DEGs between leaf (HSYY) and rhizome (HSYK) tissues. The top 20 enriched KEGG pathways of DEGs between leaf (HSYY) and rhizome (HSYK) tissues.

Figure 6
Expression of two selected unigenes as determined through qRT-PCR in comparison to the RNA-seq data. Expression of two selected unigenes as determined through qRT-PCR in comparison to the RNA-seq data.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.