Illumina sequencing and de novo assembly
To characterize the transcriptomes of Entada phaseoloides, we sequenced nine cDNA libraries prepared from the root, stem, and leaf tissues with three biological repeats by using the Illumina Hiseq 2500 platform. A total of 53.26 Gb clean data were obtained after removing adaptors, poly-A tails, and primer sequences, short (< 50 bp), and low-quality sequences. A total of 57–60 million for each tissue were generated (Additional file 1). The high-quality reads were assembled using the Trinity program [24] and the TGI clustering tool (TGICL) [25] to remove redundant sequences. Finally, 116,910 unigenes were identified, with an average N50 length of 1,218 bp (Additional file 2). The correlation indices between repeated samples were > 0.9 (Additional file 3), indicating that the Illumina sequencing results are credible.
Functional annotation
All assembled unigenes were searched against the Non-redundant (Nr), Uniprot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Pfam, Gene Ontology (GO), and Clusters of Orthologous Groups (COG) databases using the BLASTx program with E-value < 1e-5. Among the 116,910 sequences, 42,191 unigenes (36.1%) presented at least one putative function that matched an existing gene model. The detailed results are listed in Additional file 4. Based on the Nr database, the E-value distribution indicated that 70.67% of the matched unigenes ranged from 1e-5 to 1e-100 (Fig. 1a). For the similarity distribution, 46.43% unigenes exhibited a similarity of above 80%, whereas 50.79% of the unigenes showed a similarity of 40%–80% (Fig. 1b). Furthermore, 14.63% of the Entada phaseoloides unigenes shared high similarity with the genes of Cicer arietinum, 13.46% similarity with Cajanus cajan, 12.84% similarity with Glycine max, and 9.07% similarity with Medicago truncatula (Fig. 1c).
GO analysis included three main domains that describe biological processes, cellular components, and molecular functions.When GO was used to classify gene functions, 28,126 unigenes were assigned to 60 functional categories (Additional files 4 and 5). Within the biological process domain, the three most enriched categories were “biosynthetic process,” “cellular nitrogen compound metabolic process,” and “response to stress.” In the cellular component domain, the three most matched categories were “cellular component.” “nucleus,” and “protein complex.” In the molecular function domain, the three most common categories were “ion binding,” “molecular function,” and “kinase activity.”
To better understand the functions of specific metabolic pathways in Entada phaseoloides, we mapped the annotated unigenes to the reference biological pathways in the KEGG database. A total of 15,119 unigenes (13.0 %) could be assigned to five main categories and 31 sub-categories (Fig. 2, Additional file 6). These enzymes feature assigned functions in 28 secondary metabolic pathways in KEGG (Table 1). Among these unigenes, 147 encode key enzymes are related to the pathways for terpenoid biosynthesis, including the synthesis of the terpenoid backbone (63 unigenes), monoterpenoids (7 unigenes), diterpenoids (18 unigenes), sesquiterpenoids and triterpenoids (16 unigenes), and other terpenoid-quinone complexes (43 unigenes). Fifty unigenes are involved in alkaloid biosynthesis, including isoquinoline alkaloid (24 unigenes) and tropane, piperidine, and pyridine alkaloid biosynthesis (26 unigenes). Exactly 210 unigenes were associated with the flavonoid biosynthesis pathway, including the phenylpropanoid (162 unigenes), flavonoid (36 unigenes), flavone and flavonol (7 unigenes), and isoflavonoid (5 unigenes) biosynthesis pathways. Unigenes involved in these pathways should be further identified to understand their functions in the biosynthesis of active ingredients in leguminous plants.
Differentially expressed gene (DEG) analysis
The clean reads were mapped back onto the assembled unigenes by using the alignment via Burrows–Wheeler transformation (BWA) program to analyze the DEGs among different tissues. The fragments per kb per million reads (RPKM) value was calculated for each unigene in each tissue of Entada phaseoloides. The DEGs were identified (Additional file 7) using the Fisher's exact test with threshold of threshold of FDR≤0.01 and at least a 2-fold expression change [27]. The lowest number of DEGs was observed between the stem and leaf tissues, and the highest was noted between the root and leaf. Furthermore, the DEGs in one tissue were studied and compared with those in the other two tissues. The stem contained the largest number of highly expressed unigenes, having 8,962 unigenes more abundant in the stem. Fig. 3 shows the other expression differences among various tissues.
KEGG enrichment analyses were performed with the DEGs among the root, stem, and leaf tissues to investigate the genes regulating the distribution of triterpenoid saponin. These DEGs were evidently enriched in specific pathways. Meanwhile, the top 20 statistically pathways were analyzed based on the FDR≤0.01. Between the leaf and stem, plant hormone signal transduction, phenylpropanoid biosynthesis, photosynthesis, terpenoid backbone biosynthesis, and phosphatidyllnositol signaling system showed significant enrichment (Fig. 4a). Between the leaf and root, plant hormone signal transduction, phenylpropanoid biosynthesis, photosynthesis, cyanoamino acid metabolism, and ubiquinone and other terpenoid-quinone biosynthesis pathways first showed visible differential expression (Fig. 4b). Between the stem and root, plant hormone signal transduction, phenylpropanoid biosynthesis, photosynthesis, cyanoamino acid metabolism, and terpenoid backbone biosynthesis showed significant enrichment (Fig. 4c). In addition to the common pathways of primary metabolism, enriched secondary metabolic pathways, including terpenoid and phenylpropanoid biosynthesis, were also found between different tissues, indicating the possible distinct distribution of secondary metabolites in different tissues.
Putative genes involved in triterpenoid saponin backbone biosynthesis
Triterpenes are synthetized from a five-carbon isoprene unit through the cytosolic MVA pathway. Triterpenoid saponins are composed of six isoprene units and are derived from the C-30 hydrocarbon precursor, squalene. Squalene is synthesized from isopentenyl diphosphate (IPP) via the MVA pathway. All genes encoding the enzymes associated with the upstream regions of triterpenoid biosynthesis were successfully detected in the Entada phaseoloides transcriptome. Their expression value was monitored in three biological replicates along with their mean values (Table 2, Fig. 5). Most unigenes related to MVA pathway were specifically upregulated in the stem tissue. Hydroxymethylglutaryl-CoA reductase showed the highest expression, which is the rate limiting step MVA pathway for saponin biosynthesis.
The diversifying step in triterpenoid backbone biosynthesis is the cyclization of 2,3-oxidosqualene catalyzed by a class of OSCs. The major saponins in Entada phaseoloides are oleanane-type triterpenoid saponins derived from β-amyrin. The Illumina sequencing of Entada phaseoloides revealed 21 OSC sequences, among which eight unigenes were putative β-amyrin synthases. A full-length OSC sequence (EpBAS) with high identity to β-amyrin synthase was obtained (Additional file 8). The EpBAS cDNA included a 2,289 bp full open reading frame fragment. The deduced amino acid sequence of EpBAS (762 amino acids) shared 89.37% and 89.34% similarity with β-amyrin synthase in Abrus precatorius (ApBAS) and GiBAS in Glycyrrhiza inflata (Fig. 6), respectively. The relatively high similarities of the EpBAS protein with other β-amyrin synthases suggest that this gene encodes β-amyrin synthase in Entada phaseoloides.
CYP450s and UGTs
Earlier studies suggested that CYP450s and UGTs may account for the biosynthesis and accumulation of triterpene saponins in specific organs [28]. Tissue-specific transcriptome analysis of Entada phaseoloides suggests that the enzymes involved in triterpenoid saponin backbone are present in all the three tissues. Based on DEG analysis using transcriptome data, there is the possibility of further modifications such as oxidation and glycosylation using CYP450s and UGTs occur in the stem. Although the enzymes related to precursor biosynthesis are also present in root and leaf tissues which suggest the involvement of all the three tissues in the metabolic pathway. However, the metabolic analysis has indicated that it is the stem which mostly contains higher triterpenoid saponin content and utilized widely for its excellent pharmacological activity. In this study, in total of 326 CYP450s and 148 UGTs were found. Among the DEGs, 26 CYP450s and 17 UGTs were upregulated in the stem compared with the root and leaf tissues (Fig. 7a and b).
qRT-PCR validation of candidate genes involved in triterpenoid saponin biosynthesis
To verify the expression profiles obtained from Illumina sequencing, we performed qRT-PCR on nine selected genes related to triterpene saponin biosynthesis (Fig. 8). Consistent with the Illumina data, most of these genes showed strong expression levels in the stem compared with the root and leaf, and acetyl-CoA acetyltransferase, hydroxymethylglutaryl-CoA synthase, hydroxymethylglutaryl-CoA reductase and SQE genes were expressed abundantly. The expression fold changes were also close to the RNA-seq results. qRT-PCR results indicate that the RNA-seq data in this studty were reliable.