Illumina-Based Transcriptome Analysis of Four Genotypes of Pinellia ternata (Thunb.) Breit. CURRENT

widely used as a Chinese medicine. Currently, four genotypes of P. ternate have been identified according to their different leaf shapes, including Shaoye, Taoye, Zhuye and Liuye Type. In addition, the total alkaloids and organic acids content in tuber, tuber yields and other physiology and agronomic characteristics were significant difference. However, little is known about the evolution and genetic diversity of four genotypes of P. ternate . Moreover, the genomics of P. ternate remains uncharacterized. The recent development of RNA-Seq, a next generation sequencing technology, provides an opportunity to expand the identification of P. ternate genes through in-depth transcript profiling. In this study, we presented a comprehensive landscape of the transcriptome profiles of four genotypes of P. ternate using Illumina paired-end sequencing technology. Totally, 70,491,871, 72,819,035, 73,159,913 and reads , After Among these, assigned to and and detected.

An aliquot of 0.5 g sample powder (through 250 mesh screen) was accurately weighed and extracted by cold soaking for 10 h after sequential addition of 0.5 mL of ammonium hydroxide ethanol and 10 mL of chloroform. The extract was then extracted by sonication for 45 min and evaporated to dryness under a stream of nitrogen. The chemical standard of ephedrine hydrochloride was dissolved in water to a desired concentration of 0.2 mg/mL as the stock solution. The calibration curves were prepared by diluting appropriate volumes of stock solutions into seven different concentrations according to the following procedures: 1) Appropriate volume of stock solutions was diluted with water to a constant volume (2 mL); 2) After 8 mL of chloroform, 10 mL of citric acid-sodium citrate buffer and 1 mL of bromothymol blue standard solution were added, the mixed solution was shaken for 15 min and then chloro-form extract was collected; 3) The chloro-form extract was diluted with chloroform to 20 mL as working solution. All working solutions were measured as 418 nm and the calibration curve constructed by plotting UV absorbance (y) vs. the corresponding concentrations (x). was calculated as Y = 27.679X + 0.0654 with good linearity (R²=0.9991).

RNA Extraction
Total RNA was extracted from these samples using TRIzol (Thermo Scientific, MA, USA) according to the manufacturer's instructions. The extracted RNA was treated with RNase-free DNase I (New England BioLabs) for 30 min at 37°C in order to remove residual DNA. Equal amount of the extracted RNA from each sample were prepared for further experiments.

mRNA-seq Library Construction for Hiseq 2000 Sequencing
Extracted RNA samples were purified by oligo-dT beads (Dynabeads mRNA purification kit, Invitrogen), then polyA-containing mRNA were fragmented into 200-250bp with Fragment buffer (Ambion, TX, USA). First strand cDNA was synthesized in the mixture buffer containing N6 primer, First Strand Master Mix and Super Script II reverse transcription (Thermo Scientific, MA, USA) with the following reaction program: 10 min at 25℃, 30 min at 42℃, followed by 15 min at 70℃, and eventually hold at 4℃. Subsequently, Second Strand Master Mix was added and the double-stranded cDNA were generated in the reaction system for 2 h at 16℃. After purification using QIAquick PCR Purification Kit (QIAGEN, CA, USA), the End Repair Mix was added to convert the overhangs of the double-stranded cDNA into blunt ends and the mixture were incubated for 30 min at 20℃. Then, the purified DNA was mixed with A-Tailing Mix buffer before being incubated at 37℃ for 30 min. After purification and elution, the DNA samples were incubated with the single 'A' nucleotide in a thermal cycler for 30 min at 37℃ to protect DNA from ligating to one another during the adapter ligation reaction with Adapter and Ligation Mix. The adapters were ligated to the ends of the DNA fragments at room temperature for 20 min. The products of the ligation reaction were purified on a 2% agarose gel to remove unbounded adapters and adapters ligated to one another. A narrow size-range of 300-350 bp of sequencing library was selected and purified with QIAquick Gel Extraction kit (QIAGEN).
Then, the enrichment of cDNA fragments was performed using PCR and finally purified with Ampure

Sequence Data Assessment and Assembly
Clean reads were trimmed with the removing of reads in which the number of quality bases contained adaptor sequences, reads in which number of quality bases were more than 10% of the N-base, and reads in which more than half of the quality values of the bases were less than 5. The clean reads were then calculated with CycleQ 20 and the qualified reads were assembled into contigs by overlapping between the sequences using Trinity software. According to paired-end information of the sequences, the contigs were joined into transcript sequences to form isotigs, recovering full-length transcripts across a broad range of expression levels and displaying a similar sensitivity compared to the general genome alignments methods [10]. In this assembly, the overlap settings were 24 bp and 80% similarity, along with group pairs distance was set to maximum length 500 and all the other parameters were set to the default values. The longest transcriptions from the assembled component alternative splicing transcripts were chosen as unigene sequences. The transcript levels in Fragments per kilobase of exon mode per million mapped reads (FPKM) were quantified [11]. The measurement of FPKM of fragment density indicated the molar concentration of a transcript in the starting sample by normalizing for RNA length and the total fragment number.

Unigenes Annotation
The predicated amino acid sequences encoded by the assembled unigene sequences were annotated using the National Center for Biotechnology Information (NCBI) non-redundant protein (Nr) database and Swiss-Prot database. Using BlASTx (version 2.2.14), the cutoff criteria for E-value of annotation was set to <10 -5 . Based on the best BLAST hit (highest score), gene names were given and searches were limited to the first 10 significant hits for each query. EMBOSS software package was utilized for predicting the open reading frames (ORFs) for each unigene through the "getorf" program [12].
To further illustrate gene ontology (GO) terms of assembled unigenes, including molecular functions, biological processes and cellular components, the NCBI BlAST results were imported into Blast2GO software [13,14]. These GO terms were assigned to query sequences. A broad overview of groups of genes catalogued in the transcriptome for molecular functions, biological processes and cellular components were produced and further enriched and refined by ANNEX [15,16].
Moreover, the Clusters of Orthologous Group (COG) database were employed to predict and classify functions of unigene sequences [17]. By using the online KEGG Automatic Annotation Server (http://www.genome.jp/kegg/kaas/) and the bi-directional best hit method, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were assigned to the assembled unigenes and KEGG Orthology (KO) assignment was obtained [18]. The output of KEGG analysis contained KO assignments and KEGG pathways that were populated with the KO assignments.

EST-simple sequence repeat (SSR) Detection
The online program SSR Identification Tool (http://www.gramene.org/db/markers/ssrtool) was employed to identify SSRs in a given sequence [19,20]. The parameters were set to identify perfect di-, tri-, tetra-, penta-, and hexa-nucleotide motifs with a minimum of six, five, four, four, and four repeats, respectively. The information of the online report listed the total number of sequences containing SSRs among the submitted unigenes, sequence ID, SSR motifs, number of repeats (di-, tri-, tetra-, penta-, and hexanucleotide repeat units), repeat length, SSR starts, and SSR ends [21][22][23].

Statistical analysis
The experimental data were obtained from 3 independent tests and statistical analysis was performed by one-way analysis of variance (ANOVA) using SPSS Statistics (version 24.0, International Business Machines Corp, NY). P < 0.05 was considered as a statistically significant difference.

Analysis of Total Alkaloids and Organic Acids in Tubers of Four Genotypes of P. ternate
Preliminarily, we analyzed the total alkaloids and organic acids contents in tuber from four genotypes of P. ternate (SK, TK, ZK and LK). ZK was reported to have a preponderance of total alkaloids (0.047%) and organic acids (0.556%), while LK was shown the lowest content of total alkaloids (0.027%) and organic acids (0.363%) ( Table 1).  (Table 3). The sequence sizes distribution of unigenes was homogeneous (Fig. 3). Among these, 33,812 unigenes (26.75%) were longer than 1 kb while more than 57,781 unigenes (45.72%) were greater than 500 bp.

GO Classification
GO is an international standardized gene-function classification system that uses a dynamically updated, controlled vocabulary and a strictly defined concept to comprehensively describe the properties of genes and their products in any organism. The GO database comprises three ontologies, including molecular function, cellular components and biological processes. In this study, we obtained the GO functional annotations of the P. ternate unigenes by BLAST2 GO program [13]. Using the WEGO software [16], we performed GO functional classifications for all of the unigenes and examined the macro level distribution of gene functions of this species. Based on sequence homology, 15,534 unigenes were annotated in the GO database and categorized into the three main GO categories and 55 functional groups (Fig. 4). In the "biological process" category, the unigenes related to "cellular process" (49.33%) and "metabolic process" (54.08%) were predominant. The "cell" (51.07%) and "cell part" (51.07%) were found to be the most abundant classes. Under the "molecular function" category, the majority of unigenes were involved in "binding" (47.03%) and "catalytic activity" (51.30%).
hexa-nucleotide repeat SSRs of all the SSRs in this study were about 11.58%, 44.56%, 39.59%, 14.78%, 11.66% and 16.22%, respectively (Table 5). On the other hand, 65,527 SNP variations were also identified from the four genotypes. The high density SNP markers could be applied as useful tools for molecular research of P. ternate in case of SSR markers are not available.

Differential Expression of Unigenes of Four Genotypes of P. ternate
The four genotypes (SK, TK, ZK and LK) with different leaf shapes and different amounts of total alkaloids and organic acids in their tubers showed relative discrepancy of their transcript.
Consequently, we sought to analyze the differential expression of unigenes in the four genotypes of P.
ternate. The NOIseq and PossionDis program were employed to identify the differential expression of And for ZK vs. LK, 1,828 unigenes were differentially expressed, with 340 increased unigenes in ZK and 1488 down-regulated unigenes in LK (Table 6). To further analyze the possible function of unigenes with differential expression levels, we assessed their GO classifications. 1,206 unigenes with differential expression between SK and TK were  Table 7). The occurrence of unigenes with differential expressions among samples suggested that other differences may contribute to the relatively larger number of differentially expressed unigenes.

Discussion
The medication history of P. ternate in TCM has been more than 2,000 years, its phytochemistry and biological activities have been drawn the attention of researchers to constantly probe and further explore on it. The lack of genetic resources and genomic information has hampered the in-depth understanding of P. ternate in relationship between growth and medicinal properties. Herein, we first report the most comprehensive transcriptome analysis of four genotypes of P. ternate (Shaoye, Taoye, Zhuye and Liuye Type) using the Illumina platform. In our transcriptome analysis, the obtained total clean reads nucleotides, length distribution pattern and unigene lengths were similar to the previous Illumina based sequencing studies [34], indicating an effective de novo assembly of four genotypes of P. ternate transcriptome sequencing data. Among the 126,391 identified unigenes in transcriptome analysis, 46,053 (36.44%) unigenes were mapped to reference pathways and the rest (63.56%) still remains unknown.
GO functional classification is utilized to distribute gene functions at the macro level and predict the physiological role of each unigene [35,36]. In our studies, 15,534 unigenes were annotated and categorized into the three main GO categories and 55 functional groups. The annotated results revealed that the metabolic pathways the assembled unigenes involved are as diverse as the molecular functions they served. Among the biological process categorized by GO classification, metabolic process and cellular process emerged as the predominant positions, indicating the occurrence of important cellular and metabolic activities in P. ternate. For the molecular function category, catalytic activity and binding were the most highly represented GO terms, which are the similar results to reported studies [37,38].

Supplemental Information Note
The additional file mentioned on page 4 was omitted by the authors in this version of the paper.

Figure 3
Length distribution of P. Ternate unigenes.