Quantitative analysis of A. elata aralosides
The saponins present within A. elata primarily contain oleanolic acid and hederagenin aglycone , with significant variation in total and monomer araloside accumulation among tissues in these plants. Specifically, the leaves of A. elata have been found to contain the largest quantity of these saponins, with progressively lower levels found in root and stem tissues. The roots of these plants contained higher oleanolic acid levels than did the other tested tissues, whereas hederagenin levels were highest in leaves relative to samples roots and stems (Table 1; Additional file 1: Figure S1). Two selected oleanane-type saponins (chikusetsusaponin IV and araloside X) and one hederagenin-type saponin (araloside VII) were also detected in these A. elata samples, with root chikusetsusaponin IV levels being fairly high whereas they were minimal in leaf and stem tissues. In contrast, we detect aralosides VII, and X showed a high level in the leaves of these plants, suggesting that different UGTs were responsible for their generation. These tissue-specific saponin distribution results offer significant value as a reference source when identifying those CYP450s and UGTs involved in araloside production in A. elata.
De novo A. elata sequence assembly
In order to identify genes pertaining to saponin biosynthesis in these A. elata plants, we next employed an Illumina HiSeq 4000 platform to sequence the total RNA transcriptome in root, leaf, and stem tissue samples. In total this approach yielded 448,112,618 reads that were assembled into 82,238 contigs, with the longest being 16,016 bp, and with an average contig length of 1,058 bp. We were then able to assemble these contigs into 66,713 unigenes with a 1,846 bp N50 length (Table 2). Next, these unigenes were annotated with the KEGG, UniProt, NCBI nonredundant nucleotide (Nt), and Nr databases via use of the BLASTN and BLASTX algorithms, leading to the annotation of 35,232 (52.81%) unigenes (Additional file 2: Figure S2). These transcriptome sequence data have been deposited in the NCBI Short Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) under the accession number PRJNA555256.
Enrichment of terpenoid backbone and triterpenoid biosynthetic pathways
Following the annotation of 7,291 A. elata unigenes with the KEGG database, we were able to assign unigenes to the terpenoid backbone and sesquiterpenoid/triterpenoid biosynthesis pathways, which contained the upstream MVA and MEP pathways and the 2,3-oxidosqualene biosynthesis pathway (Fig. 1A). In total, we mapped 79 unigenes to the terpenoid backbone biosynthesis pathway, while 47 were mapped to the sesquiterpenoid and triterpenoid biosynthesis pathways. The acetyl-CoA precursor is converted to IPP in the MVA pathway by 6 putative enzymes (AACT, HMGS, HMGR, MVK, PMVK, and MVD), with HMGR being essential in this pathway as it catalyzes the conversion of 3-hydroxymethyl-glutrylCoA into meralonic acid (Fig. 1A). A KEGG pathway analysis indicated that 14 unigenes encoding 6 enzymes were related to the MVA pathway. Of these, 5 unigenes were annotated as HMGR. However, in the plastid IPP is formed through the MEP pathway, which begins with pyruvate and glyceraldehyde-3-phosphate. A total of 22 unigenes encoding 9 enzymes (DXS, DXR, MEP-CT, COP-MEK, MECDPS, HMBPPS, HMBPPR, IDI, and GPS) were annotated as being involved in this pathway (Fig. 1A). Additionally, 28 unigenes encoding 4 putative enzymes (FPS, SS, SE, and bAS) were found to be associated with carbocyclic biosynthesis (Additional file 3: Table S1). For the majority of these genes, we were able to map >1 unigene to a given gene or gene family (Additional file 3: Table S1), suggesting that these sequences may correspond to multiple copies or partial fragments of a given gene . In each of enzymatic steps, those unigenes that were showed a high identity with functionally characterized genes and differentially expressed in different tissues were next selected and arranged into a heat map. As shown in Fig. 1A, significant differential expression of these genes was evident in different A.elata tissues, interestingly, all genes involved in the MEP pathway exhibited more robust expression in leaves relative to roots or stems, whereas genes involved in the MVA pathway other than HMGS, MVK, and SS were also expressed at higher levels in leaves. To confirm that this RNA-seq was accurate, 10 of these 19 genes were selected and their expression in a range of tissues was assessed via qRT-PCR. within line with our transcriptomic data, the selected genes were expressed at higher levels in leaves, with the exception of HMGS which was expressed at higher levels in roots (Fig. 1B). This indicates that key triterpenoid skeleton biosynthesis reactions mainly occurred in leaves, explaining why leaves contained higher saponin concentrations (Table 1).
A. elata CYP450 identification and phylogenetic analysis
Through our transcriptome analysis, we were able to identify 254 CYP450 genes in these A. elata samples. This number of total CYP450 unigenes (254) was lower than the number identified in a study of Arabidopsis thaliana (272) and Panax. ginseng (484). To obtain more comprehensive understanding for CYP450 gene family, we classified 150 CYP450s encoding proteins more than 300 amino acid by alignment with CYP450 database, using allelic, subfamily, and family variant cutoff values of 97%, 55%, and 40%, respectively . Prof. David Nelson named these CYP450s according to reference sequences within a carefully-annotated CYP450 reference database. We classified these CYP450s into 9 clades, 40 families and 75 subfamilies, with 53% being A-type CYP450s and 47% being non-A-type CYP450s (Additional file 4: Table S2). The CYP71 clan represented all A-type CYP450s, containing 79 genes belonging to 17 families (CYP71, CYP73, CYP75-CYP78, CYP80-CYP82, CYP84, CYP89, CYP92, CYP98, CYP701, CYP706, and CYP712, CYP736).
To further explore the functional roles and evolutionary relationships for these A. elata CYP450s, 150 CYP540s along with 57 CYP450s from A. thaliana (37) and P. ginseng (20). were used to generate NJ phylogenetic trees for A-type (Fig.2A) and non-A-type
(Fig. 2B) CYP450s, separately. As shown in Fig. 2A, the 79 CYP71 members that were assigned to 17 families forming a single clade along with 28 representative CYP450s. The phylogenetic tree for non-A-type CYP450s separated them into 8 clades, including three single-family clans (CYP51, CYP710 and CYP711) and five multifamily clans (CYP72, CYP74, CYP85-86 and CYP97), CYP85, CYP86, and CYP72 were the largest three clan, containing 23, 20, and 17 CYP450s, respectively, while CYP51, CYP710, and CYP711, containing only a single member each (Fig. 2B).
A. elata UGT identification and phylogenetic analysis
We annotated 122 unigenes in the A. elata transcriptome as UGTs, with these encoding 77-704 amino acid long proteins. We then omitted those unigenes that encoded proteins shorter than 300 amino acids, yielding 92 UGTs that underwent blastp functional characterization, with those having >40% protein sequence similarity being incorporated into same family (Additional file 5: Table S3). We separated 87 UGTs into 22 UGT families, while 5 UGTs (Unigene0019368; Unigene0025405; Unigene0034502; Unigene0034503; Unigene0060195) had a deduced amino acid sequences < 40% identical to representative sequences and thus could not be assigned. The UGT73 family was the largest family with 16 genes, with the next largest being the UGT 85 family with 13 members.
We aligned A. elata UGT amino acid sequences with those of other functionally characterized plant UGTs from Arabidopsis, Panax.Ginseng, Medicago truncatula, Oryza sativa, Zea mays, Cicer arietinum, and Crocus sativus in order to construct a phylogenetic tree(Additional file 6: Table S4). These A. elata UGTs were phylogenetically separated into 16 groups, including 14 conserved groups (A-N) that were identified in Arabidopsis and two novel groups (O and P; Fig. 3) identified in maize [25, 26]. No A. elata UGTs were incorporated into group Q. There were 16 UGT73 family members in group D which was the largest A. elata UGT group, while group C, group I and group N contained only one UGT each.
A. elata CYP450 and UGT gene tissue-specific expression patterns
In order to understand the tissue-specific expression of CYP450s and UGTs in A. elata, we next conducted an analysis of their expression in the three different tissues that had been subjected to RNA-seq analysis. Of these 150 CYP450s and 92 UGTs, we found that 132 (87.42%) and 78 (84.78%) exhibited patterns of differential expression across these tissues, respectively. A hierarchical clustering analysis was used to assess CYP450 and UGT coexpression patterns in these different analyses, with an expression profile heat map for these genes being constructed according to their RPKM normalized expression values. This clustering analysis led to the assignment of 150 CYP450s and 92 UGTs to six clustered (C1-C6; Fig. 4A and B). The CYP450s that were most highly expressed in roots (31 genes), leaves (29 genes), and stems (15 genes) were grouped into clusters C2, C5, and C1, respectively. In addition, those CYP450s in clusters C3 (31 genes), C4 (8 genes), and C6 (36 genes) were expressed at the lowest levels in leaf, stem, and root tissues, respectively. An assessment of the A. elata in different tissues indicated that CYP450s were expressed at high levels in leaves (48.67 %) and roots (41.33 %) (Fig.4A). However, a high proportion (63.04%) of these UGTs were expressed at high levels in leaves in relative to roots (19.57 %) and stems (17.39 %) (Fig.4B). This suggests that the modification of UGTs primarily occurs in leaves, explaining a greater variety of secondary metabolites in leaves.
A qRT-PCR approach was further used to validate these transcriptomic findings, with the expression of 12 randomly selected CYP450s being quantified (Additional file 7: Figure S3). These CYP450 expression profiles were consistent with the RPKM values, confirming the validity of our RNA-seq data. Previous research has shown that members of the CYP72A and CYP716A subfamilies are the primary CYP450s involved in pentacyclic triterpenoid saponin biosynthesis (Additional file 8: Table S5). As such, we next specifically focused on the expression patterns of the 3 CYP716A and 8 CYP72A genes identified in the A. elata transcriptome. The qRT-PCR profiles for these genes revealed that two CYP716A genes (CYP716A295 and CYP716A296) and two CYP72A genes (CYP72A762 and CYP72A764) expressed at a high levels in root tissues relative to stems and leaves. Expression of these four genes was consistent with measured oleanolic acid contents. In contrast, CYP72A759 , CYP72A763, and CYP72A776 were expressed at higher levels in leaves and at lower levels in stems and roots and the expression pattern of these two genes were was consistent with hederagenin contents (Fig. 5). These results provided a reference for selection of CYP450 candidates related to triterpenoid saponin biosynthesis.
Considering that the members of the UGT73 family mainly catalyzed glycosylation of oleanolic acid and hederagenin at the C-3 and C-28 sites (Additional file 9: Table S6), the tissue-specific expression of 16 UGT73 members identified in this study were analyzed via qRT-PCR. As shown in Fig. 6, 15 out of these 16 UGT73 members exhibited significantly different expression in three tissues. Of these, 7 UGTs (Unigene0000487, Unigene0006936, Unigene0016738, Unigene0043352, Unigene0045141, Unigene0045143, and Unigene0045144) were most highly expressed in leaves, while six UGTs (Unigene0003933, Unigene0033039, Unigene0034878, Unigene0047749, Unigene0060157, and Unigene0063855) were mostly expressed in roots with progressively lower levels in stems and leaves. These findings indicate that most UGT73 members were expressed well in A. elata roots and leaves.
Identification of candidate CYP450s involved in triterpenoid biosynthesis
Hederagenin aglycone and oleanolic acid are the major sapogenins in A. elata, and as such we specifically assessed CYP450s related to the synthesis of these sapogenins. To date, a total of 36 CYP450s have been found to play roles in triterpenoid biosynthesis (Fig. 7; Additional file 8: Table S5). The CYP716A and CYP72A subfamilies are the primary CYP450 gene families involved in pentacyclic triterpenoid saponin diversification, with the CYP716A family being the largest multifunctional C28-oxidase family involved in such oleanane-type triterpenoid saponins biosynthesis [16, 27, 28] (Fig. 8). We were able to identify 3 CYP716A genes (CYP716A295, CYP716A296, and CYP716A306) and 8 CYP72A genes (CYP72A759-764, CYP72A776-777) in the A. elata transcriptome. In order to identify the most relevant unigenes involved in pentacyclic triterpenoid saponin biosynthesis for further analysis, we conducted BLASTx searches that compared these A. elata CYP450s to those 36 CYP450s known to be involved in triterpenoid biosynthesis. This analysis revealed that the A. elata CYP716A295 and CYP716A296 exhibited 93.97% and 94.39% sequence identity with P. ginseng CYP716A52v2 respectively, which is a β-amyrin 28-oxidase enzyme involved in oleanolic acid production  (Fig. 8). Moreover, CYP716A295 and CYP716A296 were expressed at higher levels in roots relative to stems and leaves, in line with oleanolic acid contents (Fig. 5, Table 1). As such, we selected CYP716A295 and CYP716A296 as the best candidate CYP450s likely to be involved in oleanolic acid biosynthesis in A. elata. Two CYP450s (CYP72A397 and CYP72A68v2) have thus far been identified as oleanolic acid 23-oxidases, catalyzing oleanolic acid oxidation into hederagenin [30, 31] (Fig. 8). In the present study, we observed higher expression of CYP72A759, CYP72A763, and CYP72A776 in leaf tissues, with progressively lower levels in stems and roots, consistent with the observed hederagenin content distribution (Fig. 6). A BLASTx analysis further indicated that CYP72A763 and CYP72A776 encoded a protein with 54.92% and 70% identity to CYP72A397, respectively, which was a oleanolic acid C-23 hydrogenase. Given these results, we further selected CYP72A763 and CYP72A776 as the CYP450 most likely to be involved in hederagenin biosynthesis, although further functional validation will be necessary.
Phylogenetic analyses revealed CYP716A295 and CYP716A296 to be grouped in the CYP716A subfamily and to be most closely related to Maesa lanceolata CYP716A75 and P. ginseng CYP716A52v2, respectively, both of which encode β-amyrin 28-oxidase enzymes involved in oleanolic acid production [29, 32]. CYP72A763 and CYP72A776 clustered in the CYP72A group and was most closely related to M. truncatula CYP72A68v2 and Kalopanax septemlobus CYP72A397, respectively, both of which encode β-amyrin 23-hydroxylase enzymes [30, 31] (Fig. 7).
Identification of candidate UGTs involved in triterpenoid biosynthesis
In order to determine which UGTs were involved in araloside glycosylation, blastp was used to compare 92 A. elata UGTs to16 functionally characterized UGTs. We determined that five unigenes (Unigene0003933, Unigene0034878, Unigene0047749, Unigene0060157, and Unigene0063855) exhibited 50%-60% identity to Barbarea vulgaris UGT73C10-13, which catalyzed the 3-O-glucosylation of oleanolic acid and hederagenin  (Fig. 8). This suggests that these unigenes may encode enzymes important for catalyzing oleanolic acid hederagenin glucuronosylation at the C-3 position. We also used qRT-PCR to confirm that these five UGTs were expressed at high levels in roots, with lower expression in stems and leaves, consistent with observed oleanolic acid distributions in these tissues. This suggests that the 3-O-glucosylation of oleanolic acid occurs in roots. Phylogenetic analyses suggested that these five UGTs were grouped in the UGT73 family and were most closely related to Barbarea vulgaris UGT73C10 (Fig. 9). Were therefore identified these UGTs as candidates involved in glycosylating oleanolic acid at the C-3 position, although future functional assays will be necessary to confirm this result.
Assessment of the subcellular localization of three CYP450:GFP fusion proteins
Almost all CYP450s are membrane-associated proteins that localize to the ER, with relatively few localizing to chloroplasts and mitochondria . As detailed above, all 150 of the A. elata CYP450s identified in this study were predicted to localize to the ER. To confirm this prediction, we therefore conducted the PEG-mediated transient expression of Arabidopsis protoplasts co-transformed with CYP450-GFP reporter proteins and DEP2-RFP (a ER-targeted marker) , with CYP716A295, CYP716A296, and CYP72A763 all being selected for this subcellular localization analysis. Consistent with what was observed in Arabidopsis protoplasts expressing DEP2-RFP, we found CYP450-GFPe proteins to appear as a reticular ribbon upon microscopic examination with overlap between the CYP716A295, CYP716A296, and CYP72A763 GFP fluorescent proteins and DEP2-RFP fluorescence. We therefore concluded that CYP716A295, CYP716A296, and CYP72A763 were enzymes bound to the ER membrane, consistent with the predictions made by Cell-PLoc (Fig. 10).