Quantitative analysis of A. elata aralosides
The saponins present within A. elata primarily contain oleanolic acid and hederagenin aglycone , with significant variations in total and monomer araloside accumulation among tissues in these plants. Specifically, the leaves of A. elata have been found to contain the largest quantity of these saponins, with progressively lower levels found in root and stem tissues. The roots of these plants contained higher oleanolic acid levels than did the other tested tissues, whereas hederagenin levels were highest in leaves relative to roots and stems (Table 1; Additional file 1: Figure S1). Two selected oleanane-type saponins (chikusetsusaponin IV and araloside X) and one hederagenin-type saponin (araloside VII) were also detected in these A. elata samples, with root chikusetsusaponin IV levels being fairly high, whereas they were minimal in leaf and stem tissues. In contrast, we detected high levels of aralosides VII and X in the leaves of these plants, suggesting that different UGTs were responsible for their generation. These tissue-specific saponin distribution patterns serve as a valueble reference for research efforts aimed at identifying those CYP450s and UGTs involved in araloside production in A. elata.
De novo A. elata sequence assembly
To identify genes pertaining to saponin biosynthesis in these A. elata plants, we next employed an Illumina HiSeq 4000 platform to sequence the total RNA transcriptomes in root, leaf, and stem tissue samples. In total, this approach yielded 448,112,618 reads that were assembled into 82,238 contigs, with the longest being 16,016 bp, and with an average contig length of 1,058 bp. We were then able to assemble these contigs into 66,713 unigenes with a 1,846 bp N50 length (Table 2). Next, these unigenes were annotated with the KEGG, UniProt, NCBI nonredundant nucleotide (Nt), and Nr databases via use of the BLASTN and BLASTX algorithms, leading to the annotation of 35,232 (52.81%) unigenes (Additional file 2: Figure S2). These transcriptome sequence data have been deposited in the NCBI Short Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) under the accession number SRR10428005-SRR10428013.
Enrichment of terpenoid backbone and triterpenoid biosynthetic pathways
After annotating 7,291 A. elata unigenes using the KEGG database, 79 and 47 unigenes were found to be associated with the terpenoid backbone and sesquiterpenoid/triterpenoid biosynthetic pathways respectively, containing the MVA and MEP pathways as well as the 2,3-oxidosqualene biosynthetic pathway (Fig. 1A). A KEGG pathway analysis indicated that 14 unigenes encoding 6 enzymes (AACT, HMGS, HMGR, MVK, PMVK, and MVD) were associated with the MVA pathway. Of these, 5 unigenes were annotated as HMGR, which is essential in this pathway as it catalyzes the conversion of 3-hydroxymethyl-glutrylCoA into meralonic acid (Fig. 1A). However, in the plastid, IPP is formed through the MEP pathway, which begins with pyruvate and glyceraldehyde-3-phosphate. A total of 22 unigenes encoding 9 enzymes (DXS, DXR, MEP-CT, COP-MEK, MECDPS, HMBPPS, HMBPPR, IDI, and GPS) were annotated as being involved in this pathway (Fig. 1A). Additionally, 28 unigenes encoding 4 putative enzymes (FPS, SS, SE, and bAS) were found to be associated with carbocyclic biosynthesis (Additional file 3: Table S1). For the majority of these genes, we were able to map >1 unigene to a given gene or gene family (Additional file 3: Table S1), suggesting that these sequences may correspond to multiple copies or partial fragments of a given gene . In each of enzymatic steps, those unigenes that showed high identity with functionally characterized genes and were differentially expressed in different tissues were next selected and arranged into a heat map. As shown in Fig. 1A, significant differential expression of these genes was evident in different A. elata tissues. Interestingly, all genes involved in the MEP pathway exhibited more robust expression in leaves relative to roots or stems, whereas genes involved in the MVA pathway other than HMGS, MVK, and SS were also expressed at higher levels in leaves. To confirm that these RNA-seq findings were accurate, 10 of these 19 genes were selected, and their expression profiles in a range of tissues were assessed via qRT-PCR. In agreement with our transcriptomic data, the selected genes were expressed at higher levels in leaves, with the exception of HMGS, which was expressed at higher levels in roots (Fig. 1B). These findings indicate that key triterpenoid skeleton biosynthesis reactions mainly occurred in leaves, explaining why leaves were found to contain higher saponin concentrations (Table 1).
A. elata CYP450 identification and phylogenetic analysis
Through our transcriptomic analysis, we were able to identify 254 CYP450 genes in these A. elata samples. This number of total CYP450 unigenes (254) was lower than the number identified in studies of Arabidopsis thaliana (272) and Panax ginseng (484). To obtain more comprehensive understanding of the CYP450 gene family, we classified 150 CYP450-encoding proteins >300 amino acids in length via alignment with the CYP450 database using allelic, subfamily, and family variant cutoff values of 97%, 55%, and 40%, respectively . Prof. David Nelson named these CYP450s according to reference sequences within a carefully annotated CYP450 reference database. We were ultimately able to classify these CYP450s into 9 clades, 40 families and 75 subfamilies, with 53% being A-type CYP450s and 47% being non-A-type CYP450s (Additional file 4: Table S2). The CYP71 clan represented all A-type CYP450s, containing 79 genes belonging to 17 families (CYP71, CYP73, CYP75-CYP78, CYP80-CYP82, CYP84, CYP89, CYP92, CYP98, CYP701, CYP706, CYP712 and CYP736).
To further explore the functional roles and evolutionary relationships for these A. elata CYP450s, 150 CYP540s along with 57 CYP450s from A. thaliana (37) and P. ginseng (20) were used to generate NJ phylogenetic trees for A-type (Fig. 2A) and non-A-type (Fig. 2B) CYP450s, separately. As shown in Fig. 2A, the 79 CYP71 members were assigned to 17 families forming a single clade along with 28 representative CYP450s. The phylogenetic tree for non-A-type CYP450s separated them into 8 clades, including three single-family clans (CYP51, CYP710 and CYP711) and five multifamily clans (CYP72, CYP74, CYP85-86 and CYP97), CYP85, CYP86, and CYP72 were the largest three clans, containing 23, 20, and 17 CYP450s, respectively, while CYP51, CYP710, and CYP711 each contained only a single member (Fig. 2B). A. elata UGT identification and phylogenetic analysis
We annotated 122 unigenes in the A. elata transcriptome as UGTs encoding proteins that were 77-704 amino acids in length. We then omitted those unigenes that encoded proteins shorter than 300 amino acids, yielding 92 UGTs that underwent blastp with those of functionally characterized plant UGTs, with those having >40% protein sequence similarity being incorporated into the same family (Additional file 5: Table S3). We separated 87 UGTs into 22 UGT families, while 5 UGTs (Unigene0019368; Unigene0025405; Unigene0034502; Unigene0034503; Unigene0060195) had deduced amino acid sequences < 40% identical to representative sequences and thus could not be assigned. The UGT73 family was the largest family, with 16 genes, with the next largest being the UGT 85 family, with 13 members.
We aligned A. elata UGT amino acid sequences with those of other functionally characterized plant UGTs from Arabidopsis, Panax ginseng, Medicago truncatula, Oryza sativa, Zea mays, Cicer arietinum, and Crocus sativus (Additional file 6: Table S4) to construct a phylogenetic tree. These A. elata UGTs were phylogenetically separated into 16 groups, including 14 conserved groups (A-N) that were identified in Arabidopsis and two novel groups (O and P; Fig. 3) identified in maize [25, 26]. No A. elata UGTs were incorporated into group Q. There were 16 UGT73 family members in group D, which was the largest A. elata UGT group, while groups C, I and N contained only one UGT each.
A. elata CYP450 and UGT gene tissue-specific expression patterns
To understand the tissue-specific expression of CYP450s and UGTs in A. elata, we next conducted an analysis of their expression profiles in the three different tissues that had been subjected to RNA-seq analysis. Of these 150 CYP450s and 92 UGTs, we found that 132 (87.42%) and 78 (84.78%) exhibited patterns of differential expression across these tissues, respectively. A hierarchical clustering analysis was used to assess CYP450 and UGT co-expression patterns in these different analyses, with an expression profile heat map for these genes being constructed according to their RPKM normalized expression values. This clustering analysis led to the assignment of 150 CYP450s and 92 UGTs to six clusters (C1-C6; Fig. 4A and B). The CYP450s that were most highly expressed in roots (31 genes), leaves (29 genes), and stems (15 genes) were grouped into clusters C2, C5, and C1, respectively. In addition, those CYP450s in clusters C3 (31 genes), C4 (8 genes), and C6 (36 genes) were expressed at the lowest levels in leaf, stem, and root tissues, respectively. An assessment of different A. elata tissues indicated that CYP450s were expressed at high levels in leaves (48.67%) and roots (41.33%) (Fig. 4A). However, a high proportion (63.04%) of these UGTs were expressed at high levels in leaves relative to roots (19.57%) and stems (17.39%) (Fig. 4B). This suggests that the modification of UGTs primarily occurs in leaves, explaining the greater variety of secondary metabolites in leaves.
A qRT-PCR approach was further used to validate these transcriptomic findings, with the expression of 12 randomly selected CYP450s being quantified (Additional file 7: Figure S3). These CYP450 expression profiles were consistent with the RPKM values, confirming the validity of our RNA-seq data. Previous research has shown that members of the CYP72A and CYP716A subfamilies are the primary CYP450s involved in pentacyclic triterpenoid saponin biosynthesis (Additional file 8: Table S5). As such, we next specifically focused on the expression patterns of the 3 CYP716A and 8 CYP72A genes identified in the A. elata transcriptome. The qRT-PCR profiles for these genes revealed that two CYP716A genes (CYP716A295 and CYP716A296) and two CYP72A genes (CYP72A762 and CYP72A764) were expressed at high levels in root tissues relative to stems and leaves. Expression of these four genes was consistent with measured oleanolic acid contents. In contrast, CYP72A759, CYP72A763, and CYP72A776 were expressed at higher levels in leaves and at lower levels in stems and roots, and the expression patterns of these three genes were consistent with hederagenin contents (Fig. 5). These results provided a reference for the selection of CYP450 candidates related to triterpenoid saponin biosynthesis.
Considering that the members of the UGT73 family primarily catalyzed glycosylation of oleanolic acid and hederagenin at the C-3 and C-28 sites (Additional file 9: Table S6), the tissue-specific expression profiles of 16 UGT73 members identified in this study were analyzed via qRT-PCR. As shown in Fig. 6, 15 out of these 16 UGT73 members exhibited significant differential expression profiles in three tissues. Of these, 7 UGTs (Unigene0000487, Unigene0006936, Unigene0016738, Unigene0043352, Unigene0045141, Unigene0045143, and Unigene0045144) were most highly expressed in leaves, while six UGTs (Unigene0003933, Unigene0033039, Unigene0034878, Unigene0047749, Unigene0060157, and Unigene0063855) were mostly expressed in roots, with progressively lower levels in stems and leaves. These findings indicate that most UGT73 members were highly expressed in A. elata roots and leaves.
Identification of candidate CYP450s involved in triterpenoid biosynthesis
Hederagenin aglycone and oleanolic acid are the major sapogenins in A. elata, and as such, we specifically assessed CYP450s related to the synthesis of these sapogenins. To date, a total of 36 CYP450s have been found to play roles in triterpenoid biosynthesis (Fig. 7; Additional file 8: Table S5). The CYP716A and CYP72A subfamilies are the primary CYP450 gene families involved in pentacyclic triterpenoid saponin diversification, with the CYP716A family being the largest multifunctional C28-oxidase family involved in such oleanane-type triterpenoid saponins biosynthesis [16, 27, 28] (Fig. 8). We were able to identify 3 CYP716A genes (CYP716A295, CYP716A296, and CYP716A306) and 8 CYP72A genes (CYP72A759-764, CYP72A776-777) in the A. elata transcriptome. To identify the most relevant unigenes involved in pentacyclic triterpenoid saponin biosynthesis for further analysis, we conducted BLASTx searches that compared these A. elata CYP450s to those 36 CYP450s known to be involved in triterpenoid biosynthesis. This analysis revealed that the A. elata CYP716A295 and CYP716A296 exhibited 93.97% and 94.39% sequence identity with P. ginseng CYP716A52v2, respectively, which is a β-amyrin 28-oxidase enzyme involved in oleanolic acid production  (Fig. 8). Moreover, CYP716A295 and CYP716A296 were expressed at higher levels in roots relative to stems and leaves, in line with oleanolic acid contents (Fig. 5, Table 1). As such, we selected CYP716A295 and CYP716A296 as the best candidate CYP450s likely to be involved in oleanolic acid biosynthesis in A. elata. Two CYP450s (CYP72A397 and CYP72A68v2) have thus far been identified as oleanolic acid 23-oxidases, catalyzing oleanolic acid oxidation into hederagenin [30, 31] (Fig. 8). In the present study, we observed higher expression levels of CYP72A759, CYP72A763, and CYP72A776 in leaf tissues, with progressively lower levels in stems and roots, consistent with the observed hederagenin distribution profile (Fig. 6). A BLASTx analysis further indicated that CYP72A763 and CYP72A776 encoded proteins with 54.92% and 70% identity to the oleanolic acid 23-hydroxylase CYP72A397, respectively. Given these results, we further selected CYP72A763 and CYP72A776 as the CYP450s most likely to be involved in hederagenin biosynthesis, although further functional validation will be necessary.
Phylogenetic analyses revealed CYP716A295 and CYP716A296 to be grouped in the CYP716A subfamily and to be most closely related to Maesa lanceolata CYP716A75 and P. ginseng CYP716A52v2, respectively, both of which encode β-amyrin 28-oxidase enzymes involved in oleanolic acid production [29, 32]. CYP72A763 and CYP72A776 clustered in the CYP72A group and were most closely related to M. truncatula CYP72A68v2 and Kalopanax septemlobus CYP72A397, respectively, both of which encode β-amyrin 23-hydroxylase enzymes [30, 31] (Fig. 7).
Identification of candidate UGTs involved in triterpenoid biosynthesis
To determine which UGTs were involved in araloside glycosylation, blastp was used to compare 92 A. elata UGTs to 16 functionally characterized UGTs. We determined that five unigenes (Unigene0003933, Unigene0034878, Unigene0047749, Unigene0060157, and Unigene0063855) exhibited 50%-60% identity to Barbarea vulgaris UGT73C10-13, which catalyzed the 3-O-glucosylation of oleanolic acid and hederagenin  (Fig. 8). This suggests that these unigenes may encode enzymes important for catalyzing oleanolic acid and hederagenin glucosylation at the C-3 position. We also used qRT-PCR to confirm that these five UGTs were expressed at high levels in roots, with lower expression levels in stems and leaves, consistent with observed oleanolic acid distributions in these tissues. This suggests that the 3-O-glucosylation of oleanolic acid occurs in roots. Phylogenetic analyses suggested that these five UGTs were grouped in the UGT73 family and were most closely related to Barbarea vulgaris UGT73C10 (Fig. 9). We therefore identified these UGTs as candidates involved in the glycosylation of oleanolic acid at the C-3 position, although future functional assays will be necessary to confirm this result.
Assessment of the subcellular localization of three CYP450:GFP fusion proteins
Almost all CYP450s are membrane-associated proteins that localize to the ER, with relatively few localizing to chloroplasts and mitochondria . As detailed above, all 150 of the A. elata CYP450s identified in this study were predicted to localize to the ER. To confirm this prediction, we conducted PEG-mediated transient expression of Arabidopsis protoplasts co-transformed with CYP450-GFP reporter proteins and DEP2-RFP (a ER-targeted marker) , with CYP716A295, CYP716A296, and CYP72A763 all being selected for this subcellular localization analysis. Consistent with what was observed in Arabidopsis protoplasts expressing DEP2-RFP, we found CYP450-GFP proteins to appear as a reticular ribbon upon microscopic examination with overlap between the CYP716A295, CYP716A296, and CYP72A763 GFP fluorescent proteins and DEP2-RFP fluorescence proteins. We therefore concluded that CYP716A295, CYP716A296, and CYP72A763 were enzymes bound to the ER membrane, consistent with the predictions made by Cell-PLoc (Fig. 10).