SMRT- and Illumina‑based RNA‑seq Analyses Unveil Triterpenoid Saponin Biosynthesis and Related Key Genes in Platycodon Grandiflorus (Jacq.) A. DC.


 Background: Platycodon grandiflorus, a traditional Chinese medicine, contains considerable triterpene saponins with broad pharmacological activities. To date, information on the molecular mechanism of triterpenoid saponin biosynthesis in P. grandiflorus is limited. Here, single-molecule real-time (SMRT) and next-generation sequencing technologies were combined to comprehensively analyse the transcriptome and unveil triterpenoid saponin biosynthesis in P. grandiflorus.Results: We quantified four saponin monomers in P. grandiflorus, and found that the total content of the four saponins was the highest in the roots and the lowest in the stems and leaves. A total of 173,354 non-redundant transcripts generated from the PacBio platform were successfully annotated to seven functional databases, among which 1,765 transcripts were aligned to the "metabolism of terpenoids and polyketides" pathway in the KEGG database. Three full-length transcripts of β-amyrin synthase (β-AS), the key synthase of the β-amyrin, were identified. Furthermore, a total of 132,610 clean reads of BGISEQ sequences were utilised to explore key genes related to the triterpenoid saponin biosynthetic pathway in P. grandiflorus, and 96 differentially expressed genes (DEGs) involved were selected as candidates. Notably, 9 of the 96 DEGs showed the highest expression in the roots, which were considered key genes for synthesising triterpenoid saponins in P. grandiflorus. Furthermore, 3,469 genes encoding transcription factors (TFs) were identified and classified into 57 TF families, including MYB, bHLH, mTERF, and AP2-EREBP. The expression levels of genes were verified by quantitative real-time PCR.Conclusions: Our reliable transcriptome data provide valuable information on the related biosynthesis pathway and may provide new insights into the molecular mechanisms of triterpenoid saponin biosynthesis in P. grandiflorus.

Increasing investigations have focused on the transcriptome and genome levels with the understanding of plant secondary metabolic pathways. Concomitantly, next-generation sequencing (NGS) technology with high e cacy and low cost is widely used to investigate putative genes and construct genome and transcriptome resources [20][21]. RNAseq has been applied diffusely to identify gene function related to secondary metabolites and determine differential gene expression. Based on short-read RNA-seq, studies on the P. grandi orus transcriptome have been conducted [22]. Nevertheless, because of the short sequence reads of NGS, incompletely assembled transcripts may inevitably occur, as well as base mis-assembly and incorrect annotation. Fortunately, with the process of sequencing technology, singlemolecule real-time (SMRT) technology using the PacBio Iso-Seq protocol can compensate for the lack of NGS and has become a popular third-generation sequencing platform [23]. However, SMRT technology still has a high error rate, recti ed through short-read data [24,25]. Based on this, it provides an e cient combined approach to obtain full-length transcript sequences of diversi ed plants [26][27][28].
Here, the NGS and SMRT sequencing technologies were combined to characterise the comprehensive transcriptome of the roots, stems, and leaves (three duplicates) of P. grandi oras. Moreover, we also determined the contents of four saponin monomers to con rm the main accumulation organs of triterpenoid saponins. Comprehensive analysis was conducted to obtain key genes related to triterpenoid saponin synthesis in P. grandi orus. The reliability of the transcriptome data was veri ed using real-time quantitative PCR (qRT-PCR). The experiment may provide valuable information on the biosynthesis of triterpenoid saponins in P. grandi orus.

Determination of contents of saponin monomers
The raw dried extracts obtained from the roots, stems, and leaves of P. grandi orus were measured using high performance liquid chromatography with evaporative light scattering detection (HPLC-ELSD). The contents of four saponin monomers in three organs with three duplicates were expressed as the mean with standard deviation (Fig. 1). The contents of platycodin D, platycodin D3, and polygalacin D were higher in the root than in aerial parts. In addition, the sum of mean saponin contents in the root was also higher than that in the stem and leaf.

P. grandi orus transcriptome analysis via RNA-Seq and PacBio Iso-Seq
To obtain a comprehensive transcriptome pro le of P. grandi orus, BGISEQ-500 and PacBio Sequel platforms were combined. Nine RNA samples from different organs (stem, leaf, and root, each in triplicate) were sequenced separately using a BGISEQ-500 HiSeq platform, with total raw reads in three tissues produced (145.49 M, 143.74 M, and 140.15 M from stems, leaves, and roots, respectively). After data ltering, a total of 132,610 clean reads were obtained, and the number of total clean reads from stems, leaves, and roots was 131.57 M, 130.47 M, 127.65 M, respectively (Fig. 2a, Table S1). To obtain comprehensive coverage of the P. grandi orus transcriptome, a full-length transcriptome was generated using full-length cDNAs from pooled poly (A) RNA of three organs normalised and subjected to SMRT sequencing using the PacBio Sequel platform. Altogether, 717,560 polymerase reads were generated, and 23,173,163 subreads (32.64 Gb) were obtained from one SMRT cell after ltering, with a mean length of 1,408.61 bp and N50 length of 1,992 bp. A total of 685,102 reads of insert (ROIs) were obtained after data processing, with a mean length of 2,219 bp, mean read quality of 0.95, and 28 passes. Of these, 547,395 ROIs containing two primers and poly (A) tails were classi ed as full-length non-chimeric (FLNC) reads with a mean length of 1,341 bp (Fig. S1a). Furthermore, FLNC reads were clustered and corrected using the Interactive Clustering and Error Correction (ICE) algorithm and Quiver program. As a result, 180,602 full-length consensus isoforms were obtained, with a mean length of 1,449 bp and quality of 0.98 (Fig. S1b), of which, the high-quality isoforms were further combined, and redundant sequences were removed using CD-Hit. A total of 173,354 non-redundant transcripts were identi ed (Fig. 2b), with lengths ranging from 199 bp to 15,360 bp, N50 of 1,924 bp, N90 of 818 bp, and a GC content of 42.49% (Table 1). To explore the main biological processes in P. grandi orus, 111,280 transcripts were mapped to the KEGG database and classi ed into ve categories, including cellular processes, environmental information processing, genetic information processing, metabolism, and organismal systems (Fig. 3a, Table S2), where "transport and catabolism," "signal transduction," "translation," "global and overview maps," and "environmental adaptation" were the most abundant subcategories, respectively. Notably, 1,765 transcripts of P. grandi orus were enriched in "metabolism of terpenoids and polyketides"; speci cally, 493 transcripts were associated with "terpenoid backbone biosynthesis" (ko00900), followed by 268 transcripts and 106 transcripts involved in "sesquiterpenoid and triterpenoid biosynthesis" (ko00909) and "diterpenoid biosynthesis" (ko00904) pathways, respectively.
Based on the Nr annotation results, 77,618 transcripts were assigned to the GO database and classi ed into three functional categories, biological process, molecular function, and cellular component, using the Blast2GO program (Fig. 3c, Table S4). The cellular component was the largest cluster comprising 164,035 transcripts, whose major terms were "cell" and "cell part," followed by biological process containing 111,975 transcripts and molecular function with 91,034 transcripts. Under the biological process and molecular function categories, "metabolic process" and "catalytic activity" were the most abundant terms, respectively, indicating that the study might provide insight into the genes involved in secondary metabolite synthetic pathways.  (Fig. 4a). Moreover, among these DEGs, 14,029 common DEGs were identi ed in three tissues that might be vital for metabolism among the three organs. Comparing DEGs in diverse groups resulted in 8,233 speci c DEGs in leaf vs root, followed by 5,700 in leaf vs stem and 3,009 in stem vs root, indicating that the difference between leaf and root was the greatest (Fig. 4b).
For a thorough exploration of differential transcripts, KEGG and GO enrichment analyses were performed using the Phyper function. In KEGG analysis, 15,894 and 23,334 DEGs were identi ed in stem vs root and leaf vs root comparisons and mapped to 137 and 138 pathways, respectively, where the most enriched pathway for either comparison was "phenylpropanoid biosynthesis" (Fig. 4c, d).
There were 60,906 DEGs in the leaf vs stem comparison, mainly associated with "plant-pathogen interaction" (Fig. S2a). Moreover, 137 and 121 DEGs were identi ed and separately matched to "terpenoid backbone biosynthesis" and "sesquiterpenoid and triterpenoid biosynthesis", respectively, in stems compared with roots. There were 219 and 154 DEGs annotated to the above pathways in the leaf vs root comparison. The most expansive GO term among the different comparison groups was "cellular component" (Fig. S2b, c, d).

Identi cation of transcription factors
Transcription factors (TFs) play vital roles in the regulation of secondary metabolic processes. A total of 3,469 genes encoding TFs were identi ed through alignment with the Plant TF database and classi ed into 57 TF families, including 743 upregulated and 1,675 downregulated genes in the leaf vs root comparison and 878 upregulated and 1,168 downregulated genes in the stem vs root comparison (Table 3 and Table S5). A total of 313 genes encoded the mTERF family, accounting for the largest proportion, followed by 281, 262, and 259 genes encoding the MYB, bHLH, and AP2-EREBP families. We further annotated TFs with the KEGG functional database and obtained eight FHA TFs involved in terpenoid and polyketide metabolism; 14 zf-HD TFs, 5 MYB TFs, 2 RWP-RK TFs, and 1 C2C2-CO-like TF participated in the biosynthesis of other secondary metabolites. Among them, 4 zf-HD TFs, 2 RWP-RK TFs, and 1 MYB TF showed the highest expression in the roots, which are thought to be closely related to triterpene synthesis (Fig. S3, Table S6). Putative genes involved in triterpenoid saponin biosynthesis Triterpenoid saponins are the primary curative components of P. grandi orus. As previously described, many known enzymes participate in the synthesis of triterpenoid saponins, such as AACT, HMGS, DXS, and SQS. Although the biosynthetic pathway of platycodins has been widely studied, exhaustive information on the genes encoding relevant key enzymes is still limited. Platycodins are oleanane-type triterpenoid saponins synthesised from oleanolic acid through several modi cations by CYP450 (cytochrome P450 monooxygenase) and GTs. Subsequent modi cation pathways catalytically produce various platycodins (Fig. 5a).
Divergent gene expression has a vital in uence on platycodin biosynthesis. After screening of FPKM > 1, a total of 113 genes encoding 18 key enzymes, 96 of which were further identi ed as DEGs, were considered as putative genes related to triterpenoid saponin synthesis in P. grandi orus (Table 4 and Table S7).
DEGs distributed in MVA and MEP upstream pathways were greater than those in downstream pathways. In our transcriptome dataset, 22 DEGs encoding six enzymes relevant to the MVA pathway were identi ed, including nine PgAACT genes, eight PgHMGR genes, one PgHMGR gene, three PgMK genes, and one PgMDC gene. Moreover, 34 DEGs encoding seven enzymes involved in the MEP pathway were con rmed, including nine PgDXS genes, one PgDXR gene, two PgMCT genes, two PgCMK genes, seven PgMDS genes, six PgHDS genes, and seven PgHDR genes. The MEP pathway demonstrated that nearly all DEGs showed the highest expression levels in leaves and the least in roots. In comparison with the MEP pathway, highly expressed DEGs were concentrated in the root and stem in the MVA pathway.
Research has suggested that the biosynthesis of triterpenoids and sesquiterpenoids occurs via the MVA pathway, whereas monoterpenoids and diterpenoids are involved in the MEP pathway. Additionally, based on our transcriptome data, only one transcript for genes encoding PgHMGS and PgMDC identi ed as DEGs was obtained, and the expression level was the highest in the stem, followed by the root or leaf. For DEGs of PgHMGR, the expression levels of most transcripts in the stem and leaf were higher than those in the root, yet PgHMGR1 and PgHMGR3 showed their highest expression in the roots. As genes encoding key enzymes of the MVA pathway, PgAACT and PgMK displayed higher expression in stems or roots than in leaves. Among them, PgAACT2, PgAACT3, and PgMK1 all had the highest expression levels in the roots, especially for PgAACT3, which exhibited much higher expression than other PgAACTs, presumably vital for platycodin biosynthesis.
Most of the 13 PgFPPS DEGs were expressed at higher levels in leaves, except for PgFPPS9 and PgFPPS13 whose expression was the highest in the root and stem, respectively. IDI is a crucial enzyme that catalyses the conversion between the common precursors IPP and DMAPP. There were 11 PgIDIs found to be DEGs among the three tissues, and most were highly expressed in the stem and root; notably, PgIDI6 and PgIDI7 displayed the highest expression in the roots. Pertaining to "sesquiterpenoid and triterpenoid biosynthesis", 14 DEGs were found encoding PgSQS, PgSQE, and Pgβ-AS. Of the eight candidate genes of PgSQS, only PgSQS1 was identi ed as the DEG with the highest expression in the stems.
Additionally, differential transcripts of PgSQE and Pgβ-AS had similar expression trends, with the highest expression in the leaves. β-AS is a signi cant modi cation enzyme for 2,3-oxidized squalene for oleanolic acid formation, and Pgβ-AS1 was expressed at the highest level in the roots. Statistically, there were nine genes with the highest expression in the root, which were considered key genes for triterpenoid saponin biosynthesis in P. grandi orus (Fig. 5b). Characterisation of the β-amyrin synthase β-AS, belonging to the oxidosqualene cyclase (OSC) family, plays a vital role in regulating oleanane-type triterpenoid saponin biosynthesis. β-AS catalyses the conversion of 2,3-oxidosqualene into β-amyrin, the precursor of oleic acid. To date, ve types of OSCs have been isolated and identi ed from ginseng, including β-AS, dammarenediol synthase (DS), cycloartenol synthase (CAS), lupeol synthase (LUS), and lanosterol synthase (LS). From the screening, three full-length genes encoding putative β-AS proteins (isoform_10598, isoform_71380, and isoform 102853) in P. grandi orus were identi ed and characterised using the transcript library.
Domain analysis revealed that three β-AS isoforms contained one catalytic acid site and a conserved squalene cyclase domain belonging to the ISOPREN-C2-like superfamily. The ISOPREN-C2-like superfamily contains class II terpene cyclases, including squalene cyclase and 2,3-oxidosqualene cyclase (OSC) (Fig. S5). Ultimately, we selected two complete sequences encoding β-AS from ginseng (OSCPNY1 and OSCPNY2) to perform homology analysis with three isoforms. The comparison of ve amino acid sequences indicated that their consistency was 74.04%, and three conserved regions (motif QW, DCTAE, and MWCYCR) belonging to β-AS were also identi ed (Fig. 6d) [29][30]. The complete alignment result is shown in Fig. S6. The isoform_10598 has a mutation site, "L," in the DCTAE motif, and isoform_71380 has two mutation sites, "L" and "F," in the MWCYCR motif. Three isoforms were further selected to create 3D constructure models based on human OSC complexed with lanosterol (PDB ID: 1w6k.1.A) (Fig. 6a, b, c) [31].
The phylogenetic tree was divided into four branches, LUS, LS, CAS, and β-AS, where the typical genes were MtLUS, VrLS, CrCAS and GaβAS from Medicago truncatula, Vigna radiata var. radiata, C. reinhardtii, and G. arboreum, respectively (Fig. 7, Table S8). The results revealed that three isoforms were homologous to β-AS from other plants, consistent with our previous structure analysis and indicating that the three isoforms were β-AS genes from P. grandi orus. Additionally, isoform_71380 clustered together with β-AS from G. arboreum, and isoform_102853 clustered together with β-AS from Panax ginseng. These two sub-branches and isoform_10598 were further clustered.
qRT-PCR validation of DEGs from the RNA-seq analysis To validate the reliability of transcriptome analysis data, 20 DEGs related to triterpenoid saponin biosynthesis were veri ed using qRT-PCR with β-actin as the reference; all primers used are listed in Table S9. The results of RNA-seq and qRT-PCR revealed a high-ranked consistency, indicating that the RNA-seq data are dependable and accurate (Fig. 8).
More importantly, PgAACT2, PgHMGR1, and Pgβ-AS1 displayed higher expression in roots than the tested genes.

Discussion
As an important medicinal plant, P. grandi orus is extensively used worldwide for its therapeutic effects. Many studies have focused on the biological effects of P. grandi orus, especially of the inherent triterpenoid saponins. The contents of four saponin monomers, namely, platycoside E, platycodin D, platycodin D 3 , and polygalacin D, have been determined previously, with the contents of platycodin D, platycodin D 3 , and polygalacin D being the highest in the roots. Notably, platycoside E, platycodin D, and platycodin D 3 have the same nuclear structure as platycodigenin, and polygalacin D is enzymatically modi ed from polygalacic acid.
Platycoside E and platycodin D 3 are precursors of platycodin D with the conversion occurring under β-D-glucosidase hydrolysis, and the three components above belong to one synthetic pathway. Studies have recently focused on the genome level [32]; Previously, research on genes involved in platycodin biosynthesis was carried out based on RNA-seq.
Here, SMRT sequencing and RNA-seq of three tissues (root, stem, and leaf) were combined to generate a more comprehensive transcriptome of P. XXXrandi oras. A total of 685,102 CCS reads were obtained from PacBio ISO-seq, yielding 173,354 non-redundant transcripts (N50 = 2,517 bp) after correction using RNA-sEq. The BUSCO database was used to evaluate transcript quality, compared with conserved genes, indicating the integrity of transcriptome assembly. The combination of RNA-Seq and ISO-seq could provide a thorough understanding of triterpenoid saponin synthesis in P. grandi orus at the molecular level.
In this study, clean reads obtained from RNA-seq were further identi ed based on a Poisson distribution to generate DEGs. According to the KEGG annotation results, the differential genes were classi ed into various biological pathways. A total of 477 genes were mapped to the "metabolism of terpenoids and polyketides" pathway. We further screened the DEGs related to triterpenoid saponin biosynthesis and obtained 96 DEGs encoding 18 related key enzymes. Among all the DEGs, 34 were found to be related to the MEP pathway and 13 to the MVA pathway. By further analysing these genes, we obtained nine DEGs encoding AACT, HMGR, MK, IDI, FPPS, and β-AS expressed the highest in the roots. Four of these nine genes in roots were veri ed using qRT-PCR, where PgAACT2, PgHMGR1, and Pgβ-AS1 also showed the highest expression in the roots. Previous chemical experiments have con rmed that triterpenoid saponins of P. grandi orus primarily accumulate in the root. Therefore, we speculate that the nine genes expressed the highest in the roots are involved in triterpenoid saponin biosynthesis in P. grandi orus. The genes identi ed above are candidate genes for triterpenoid saponin biosynthesis in P. grandi orus, and the mechanism of their action in triterpenoid saponin biosynthesis regulation warrants further investigation.
Previous research also analysed the expression of one AACT and β-AS genes from P. grandi orus in different organs, including the roots, young stems, leaves, and owers, indicating that the unigene encoding β-AS was expressed at a much higher level in the leaves than in the roots, young stems, and owers. In contrast, the unigene encoding β-AS showed the highest expression in the roots, consistent with our study [22]. Furthermore, tissue-speci c expression pro les in different P. grandi orus tissues (leaf, root, stem, seed, petal, pistil, sepal, and stamen) were also revealed, and of the 24 bAS genes, four genes showed signi cantly higher expression in the roots than in other tissues [33]. β-amyrin is the oleanane-type backbone and the precursor of oleanolic acid, with synthesis controlled by β-AS [34,35]. One EsBAS has been isolated and cloned from the leaves of E. senticosus, and EsBAS was functionally characterised via heterologous expression in yeast and tobacco [36]. In addition, one PlgOSC1 gene encoding β-AS from P. grandi orus was cloned via RACE-PCR and then successfully expressed in heterologous yeast cells [37]. Methylome data of β-AS in P. grandi orus have also been presented in control and methyl jasmonate (MJ) treatments. The relative dominance of hypo-CG-DMCs in the MJ treatment was detected only at 48 h with bAS genes; the hypomethylation of bAS genes may affect platycoside biosynthesis [33]. We screened out three full-length sequences from the differential genes encoding Pgβ-AS. The length of the three full-length sequences was approximately 2,200 bp. Three transcripts were con rmed to encode β-AS after bioinformatics and phylogenetic analyses.
TFs, as sequence-speci c DNA-binding proteins, have a drastic in uence on plant metabolism and regulation [38]. In our study, a total of 3,469 genes were classi ed into 57 TF families, with the largest proportion of the mTERF family (313 genes). mTERFs are key regulators of organellar gene expression in mitochondria as well as chloroplasts and implicated in all organellar gene expression steps ranging from transcription modulation to tRNA maturation, hence translation [39]. Previous studies have demonstrated that VvMYB5b TF overexpression in tomato can upregulate terpenoid metabolism [40]. In our study, 281 MYB TFs were identi ed, ve of which participate in the biosynthesis of other secondary metabolites, after classi cation using the KEGG database. More notably, one of the ve MYB TFs had the highest expression in roots, which is speculated to play an essential role in triterpenoid saponin synthesis in P. grandi orus. Likewise, functional studies of some other TF families have also been conducted. Research has shown that overexpression of two bHLH TFs in Medicago truncatula hairy roots led to increased transcription levels of known triterpenoid saponin biosynthesis genes and dramatically increased the accumulation of triterpene saponins [41]. The overexpression of AP2/ERF gene clusters in tobacco and hairy roots activated nicotine and terpenoid indole alkaloid pathway genes [42]. In our study, 262 candidate TFs were identi ed as belonging to the bHLH family. Through further differential expression analysis, 74 and 100 bHLH TFs were downregulated in the leaf vs root and stem vs root comparisons, respectively. The 259 AP2-EREBP family TFs were also determined, and 71 AP2-EREBP TFs were upregulated in roots compared with stems, and 120 TFs were upregulated in roots compared with leaves. These TFs upregulated in the root may play an essential role in triterpenoid saponin biosynthesis.

Conclusion
Transcriptome analysis of P. grandi orus was performed using PacBio Iso-Seq combined with RNA-SEq. A total of 173,354 full-length transcripts and 3,469 TFs were obtained, and 173,354 full-length transcripts were successfully annotated to seven databases to obtain an improved annotation. Three full-length genes were con rmed to encode β-AS, providing a basis for subsequent functional research. Among the 3,469 TFs, 281, 262, and 259 TFs of the MYB, bHLH, and AP2-EREBP families, respectively, were analysed. Furthermore, we screened DEGs involved in the triterpenoid saponin biosynthetic pathway based on RNA-seq, and nine DEGs with the highest expression in the root were identi ed, encoding key enzymes, including PgAACT, PgHMGR, PgMK, PgIDI, PgFPPS, and Pgβ-AS. This article reports on the integrated transcriptome sequencing analysis of P. grandi orus; moreover, the sequencing results were demonstrated to be reliable and accurate using qRT-PCR. The obtained transcript information may provide a varied and well-grounded candidate pool to study functional genes involved in secondary metabolite biosynthesis.

Methods
Plant materials and total RNA extraction P. grandi orus (Jacq.) A. DC. was collected from Tongcheng city (Anhui province, China) and separated into leaves, stems, roots with three replicates per organ. The samples were snap-frozen in liquid nitrogen after a quick rinse with sterile water and stored at -80 °C for subsequent sequencing and saponin analysis. According to the manufacturer's instructions, total RNA isolation was performed using the ethanol precipitation protocol and CTAB-PBIOZOL. The

Analysis of contents of saponin monomers
Dried stem, leaf, and root samples (with three replicates per organ) were ground into a powder and sifted using an 80 holes per inch sieve. Nine weighed powder samples (1.0 g) were extracted with 5 mL of 70% methanol using an ultrasonic method for 1.5 h (80 W, 40 Hz). After ultrasonic treatment, the samples were ltered through a 0.45-µm syringe lter for subsequent HPLC-ELSD analysis [43]. Finally, four saponin monomers (platycoside E, platycodin D, platycodin D 3 , and polygalacin D) were chromatographically separated on an Agilent Eclipse XDS-C18 (4.6 mm × 250 mm, 5 µm) column (Agilent Technologies) at 35 °C; the mobile phase was acetonitrile-water at a ow rate of 1.0 mL/min. ELSD was used as the detector with a gas ow rate of 1.6 L/min, a drift tube temperature of 85 °C, and a nebulisation temperature of 50 °C.

BGISEQ-500 RNA-Seq library construction and sequencing
For sequencing, mRNA with Poly (A) of three organs (three replicates) was enriched from total RNA using oligo (dT) magnetic beads [44][45][46]. The DNA probe was digested using DNase I after hybridisation with rRNA to obtain the puri ed RNA. RNA was then converted into short fragments using a fragmentation buffer. First-strand cDNA was synthesised using random N6 primers, followed by second-strand cDNA synthesis. The ends of double cDNA were repaired, 5′ ends were phosphorylated, and 3′ ends formed cohesive ends with A-Tailing. Then, cDNA was ligated to the sequencing adapters. The ligation products were ampli ed using PCR to build a cDNA library and sequenced on the BGISEQ-500 platform (Beijing Genomics Institute, Shenzhen, China). After RNA-seq, raw reads containing sequencing adapters with more than 5% unknown base or more than 20% low-quality base were removed using SOAPnuke software (version 1.5.2) to obtain clean reads [47]. Furthermore, clean reads were aligned to the PacBio reference sequence, taking the comparison ratio and the distribution of the reference sequence as conditions for further analysis.
PacBio Iso-seq library construction and sequencing According to the PacBio Sequencing protocol, a Clontech UMI base PCR cDNA Synthesis Kit (BGI-Shenzhen) was used to synthesise rst-strand cDNA. The CDS Primer was rst annealed to the polyA + tail of transcripts, followed by fulllength rst-strand synthesis with Reverse Transcriptase. Subsequently, double-stranded cDNA was produced by largescale PCR. The cDNAs were measured using a Qubit HS (Life Technologies, Carlsbad, CA, USA) and an Agilent 2100 Bioanalyzer (Agilent DNA 12000 Reagents; Agilent Technologies) and then used for sequencing using a PacBio Sequel sequencer (BGI-Shenzhen) with a Sequel Sequencing Kit 2.1 and a Sequel SMRT Cell 1M v2 Tray.
Iso-Seq data processing The raw data from the PacBio Sequel (Paci c Biosciences, Menlo Park, CA, USA) were processed by the SMRT analysis suite (version 2.3.0) [48]. ROIs were recognised from sub-read data, and a circular consensus sequence (CCS) was generated according to the condition that the minimum predicted consensus accuracy was 0.75. Then, 3′ and 5′ primers of CCS were detected and sequence-oriented by internal scripting. After CCSs shorter than 300 bp in length were ltered, the rest were classi ed as full-length (FL) and non-full-length sequences depending on whether the 3′ and 5′ ends and poly (A) tail were simultaneously observed. The FL reads were corrected with RNA-seq reads using the LSC software [24]. Subsequently, the FL reads were clustered to generate de novo consensus isoforms via the iterative clustering for error correction algorithm and then polished using the Quiver quality-aware algorithm [49]. Two strategies were employed to improve the accuracy of consensus transcripts. First, the Quiver algorithm was used for isoform recti cation to distinguish between high-quality (the expected Quiver accuracy ≥ 0.95) and low-quality isoforms. Second, the de novo consensus isoforms of high quality were merged to remove redundancy using CD-HIT obtaining nal unique isoform sequences [50].

Functional annotation
To derive more integral annotation information, the nal unique full-length isoforms were mapped to seven functional databases, including NCBI  [51,52]. Isoforms were annotated in the NT database using the Blastn software. GO annotation and classi cation were performed in the Blast2GO program based on the Nr annotation results [53].

Analysis of differentially expressed genes
Quantitative analysis of gene expression levels in different organs was performed using RSEM software (version 1.2.8). Brie y, clean reads obtained from RNA-seq were mapped onto reference full-length transcripts using the Bowtie2 software (version 2.2.5) [54]. Subsequently, the expression level of each sample was calculated using RSEM software (version 1.2.12), and the read counts were normalised using fragment per kilobase of transcript per million fragments mapped (FPKM) [55]. DEGseq2 (version 1.4.5) was used to screen differentially expressed genes (DEGs) with a Q value of ≤ 0.001 [56,57]. DEGs were mapped to GO and KEGG databases to obtain annotated information by Phyper based on a hypergeometric test for further enrichment and classi cation analyses. The P-values were corrected to Q-values with a threshold Q value of ≤ 0.05 using a Bonferroni correction.
Identi cation of β-amyrin synthase (β-AS) genes The full-length sequences encoding β-AS were con rmed after alignment of their amino acid sequences to the NCBI BLAST database (https://blast.ncbi.nlm.nih.gov/Blast.cgi), and their conserved domains were identi ed using NCBI Online tools (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The β-AS genes were further selected to create the 3D constructure models using SWISS-MODEL (https://swissmodel.expasy.org/) and PyMOL software. To further determine the β-AS genes, a phylogenetic tree was constructed in MEGA7.0 software using the neighbour-joining method. A bootstrap of 1000 replications was also used to test the con dence levels, and the scale bar represented a 0.2 amino acid substitution per site.

Quantitative real-time PCR validation
To verify the expression of genes from combined sequencing platforms, 20 DEGs were arbitrarily selected for qRT-PCR validation. Reverse transcription of isolated total RNA was processed using PrimeScript™ II 1st Strand cDNA Synthesis Kit (Waryong, China). A qRT-PCR analysis was performed in 96-well plates on an Agilent Mx3000P system (Agilent Technologies) using a TB Green® Premix Ex Taq™ II kit. The reaction conditions were as follows: 95 °C for 30 s, 40 cycles of 95 °C for 15 s, and 60 °C for 30 s. The melting curve was generated by heating the amplicon from 60 °C to 72 °C. The relative expression of each selected transcript was normalised to the expression of the internal reference gene β-actin and calculated via the 2 −ΔΔCt method [58]. Primers for all transcripts were designed using Primer software (version 5.0). Functional annotations for P. grandi orus. a. KEGG pathway classi cation. b. KOG functional classi cation. c. GO functional classi cation