Long-read sequencing of Chrysanthemum morifolium cv. ‘Hangju’ transcriptome reveals flavonoid biosynthesis and regulation

Abstract Background : The capitulum of Chrysanthemum morifolium cv. ‘Hangju’ has been widely used in China for antioxidant and anti-inflammatory. Flavonoids as one of the bioactive components in C . morifolium have a poor understanding in their biosynthesis and regulation. Nowadays, transcriptome sequencing as an effective method was used in capturing the transcripts information. So, single-molecule real-time (SMRT) sequencing was performed to obtain the full length of genes involved in flavonoid biosynthesis and regulation in C . morifolium . Results : The high-quality RNA was extracted from the capitulum of C . morifolium at different development stages, and it was constructed into two libraries (0-5 kb and 4.5-10 kb) for sequencing. Finally, 125,532 non-redundant isoforms with mean length of 2,009 bp were captured. Of which, 2,083 transcripts were annotated in the pathway related to the flavonoid biosynthesis and 56 isoforms were annotated as CHS , CHI , F3H , F3’H , FNS Ⅱ , FLS , DFR and ANS genes. Based on the gene expression level at different stages, we predicted the major genes involved in the flavonoid biosynthesis. And we found two candidate MYB factors (CmMYBF1 and CmMYBF2) activating the flavonol biosynthesis by phylogenetic analysis. Conclusions : Based on the full-length transcriptome data and further quantitative analysis, the major genes involved in flavonoid biosynthesis and regulation in C . morifolium were predicted in our study. The results provide a valuable theoretical basis for introduction and cultivation of C. morifolium cv. ‘Hangju’.

The variety the 'Hangju' of Flos Chrysanthemi, which has been cultivated for centuries and developed into several genotypes, and the genotype 'Zaoyang' has the largest cultivated area and yield among all the genotypes [9]. Furthermore, the genotype 'Zaoyang' from Tongxiang county (the traditional area) is considered to be advantage in higher contents of luteolin than the other cultivated areas [10], which suggest that flavonoids are the vital components in genuine medicinal materials.
Flavonoids is a large group of secondary metabolites in plants, whose biosynthetic pathway is well established as a useful model for studying metabolic regulation. The genes involved in the pathway were mainly divided into early and late biosynthetic genes.
Early biosynthetic genes (EBGs) including CHS, CHI, F3H, F3'H and FLS could catalyze the flavonol biosynthesis, whereas late biosynthetic genes (LBGs: DFR, ANS and ANR) lead to both anthocyanin and proanthocyanidin biosynthesis [11][12][13]. There are also other genes like FNS and IFS could catalyze the flavone and isoflavone biosynthesis [14,15]. Furthermore, the transcriptional factors regulating the biosynthesis were identified in many species. The AtMYB11, AtMYB12 and AtMYB111 in Arabidopsis thaliana could activate the expression of LEGs to produce flavonol [16,17] , the high expression of MdMYB1 and MdMYB10 increase the content of anthocyanin in Malus × domestica [18]. However, the flavonoid biosynthesis and regulation are poorly understood in C. morifolium because of lacking reference genome database. So, the transcriptome sequencing offered an alternative approach to obtain genes information. Of which, second-generation sequencing (SGS) was usually used to identified the differentially expressed genes in C. morifolium 4 under stress or at different tissues [19][20][21][22]. However, short read lengths generated by SGS decrease the accuracy of transcripts assembly, especially for the species without genome information. Single-molecule real-time (SMRT) sequencing as the third-generation sequencing (TGS) technology has an advantage of generating high-throughput long reads, so it avoids the inaccuracy resulting from assembling short sequences [23,24].
In this study, SMRT sequencing was preformed to get the full-length transcripts for better understanding the flavonoid biosynthesis and regulation in C. morifolium. The key genes were screened based on the annotation analysis, and their expressional profile were also analyzed in different development stages. Furthermore, the candidate MYB factors activating the flavonol biosynthesis were identified, which not only provide an understanding of flavonoid biosynthesis and regulation in C. morifolium, also better for the introduction and cultivation of C. morifolium.

The full-length transcriptome of C. morifolium
To identify the transcripts in C. morifolium during the development stages, SMRT sequencing technology was used to capture the complete and full-length transcriptome.
Totally, we obtained 15.56 Gb and 13.93 Gb raw data from 0-4.5K and 4.5-10K libraries, respectively. About 591,513 and 560, 454 ROIs (Reads of insert) were recognized with mean length of 3,328 and 3,587 bp ( Table 1). The clean data was upload to Sequence Read Archive (SRA) with accession SRR10054190 and SRR10054977. Of which, more than 50% reads were classified as full-length (FL) non-chimeric reads, less than 5% reads were classified into chimeric reads, which indicated the SMRTbell libraries have a high quality (Supply figure 1). After clustered and polished, 129,610 and 112,142 HQ isoforms were finally merged into 125,532 non-redundant isoforms with mean length of 2,009 bp and N50 of 2,274 bp (Table 1).

Functional annotation
All non-redundant isoforms were searched against to 7 functional databases, and about 115,807 (92.25%) transcripts could be annotated with at least one functional database ( Figure 2A). There were 107,700 (85.79%) isoforms were assigned to NR databases, and most of them have high similarity to the genes in Helianthus anuus (60.13%) Cynara cardunculus var. scolymus (24.18%), Chrysanthemum × morifolium (14.02%) and other species ( Figure 2B (Table 2). And every gene has many isoforms, especially for the PAL and 4CL gene families having more than 30 isoforms. Interesting, no sequence was similar to F3'5'H, which could catalyze to be as the substrate of blue anthocyanin. So, it corresponds the fact that the C. morifolium have no blue varieties. And ANR and LAR genes were also not found in the database, which suggest there were probably no proanthocyanins in the C. morifolium. Further analyzing the gene involved in the luteolin and quercetin biosynthesis, the isoforms were classified into 5 CHS , 3 CHI, 1 F3H, 1 F3'H, 1 FNS Ⅱ and 2 FLS genes based on the sequence similarity, (Supply table 1). isoforms were grouped into subgroup 4, which could repress the biosynthesis of phenylpropanoid metabolism, like lignin, phenolic acid and flavonoid. However, no isoforms were clustered into subgroup 5, which could activate the biosynthesis of proanthocyanidin. It is consistent with the results that no genes were annotated to LAR and ANR. Furthermore, we found that Isoform 90494 and 90874 have high similarity to same one gene different from Isoform 65995, so we infer that there were two genes activate the flavanol biosynthesis. Meanwhile, the isoforms in subgroup 4 and 6 were classified into 2 and 1 genes based on the sequence similarity (Supply table 3). When comparing the expression level of these two genes in the capitulum, we found CmMYBF1 Comparing the 63,854 unigenes with mean length of 741 isolated from'Chuju' by SGS technology, we found 125,532 isoforms in 'Hangju' with mean length of 2,009 bp, which was significantly long. And we found more genes in flavonoid biosynthesis pathway than 'Chuju' [25]. It obviously that SMRT sequencing was more accurate than SGS technology in the process of getting transcripts information. Furthermore, TGS was also an effective method to detect alternative splicing (AS) events [26,27]. In this study, we found there were many isoforms have high similarity with some fragments insert or loss. So, these isoforms perhaps were the AS of one gene.

Different expression of genes involved in luteolin and quercetin biosynthesis
According the gene annotation, we found 5 CHS and 3 CHI genes in the database of C. morifolium, which were less than the 17 CHS and 8 CHI genes identified in the genome of Chrysanthemum nankingense [28]. However, there were also no LAR and ANR genes like our results, which suggest that proanthocyanidins was not the functional component in the C. morifolium and C. nankingense. And C. morifolium have similary genes number with other species in Asteraceae. For instance, sunflower also have 3 CHI, 2 FLS and 1 F3H genes same as C. morifolium [28]. In our previous study, there was one FLS gene was reported in C. morifolium have a function of catalyzing the dihydroquercetin to quercetin [29]. Besides, we found another FLS gene have high similarity with predicted gene in Lactuca sativa and Nicotiana tabacum, it needs to verified at our next work. In the different stages, the genes have similar expression tendency with the genes in 'Chuju' [25], and have a high expression level at bud stage.
In the MYB factor analysis, the 5 transcripts in subgroup 6 were annotated as CmMYB6 gene (No: AKP06190), which was reported function as the activator of anthocyanin biosynthesis in C. morifolium [30]. To date, the ternary MYB-bHLH-WD40 (MBW) transcriptional regulator were recognized regulating the flavonoids biosynthesis, especially for anthocyanin and proanthocyanidin biosynthesis [31,32]. And no factors were clustered with bHLH (AtEGL3, AtGL3 and TT8) interacting with MYB to regulating anthocyanin biosynthesis (Supply figure 4) [33,34]. Besides, no gene were annotated to the CmbHLH2 gene (No: KT724056), which was reported had a function of activating anthocyanin biosynthesis in C. morifolium for ornament [35]. The TT8 also could promoting proanthocyanidin biosynthesis [31], the reason of no gene was similar to TT8 were a small amount of anthocyanin and proanthocyanidin in 'Hangju'. So, it reliable to find a candidate factor by phylogenic analysis. According the way, we found two genes belong to SG7 MYB, gene, which was proposed have function of regulating flavonoid biosynthesis but not promoting the DRF gene expression in C. morifolium for ornament [30], so it very likely regulate flavonol biosynthesis. But CmMYBF1 have a higher expression level than CmMYBF2, CmMYBF1 also might be the transcription factor regulating flavonol biosynthesis. So, which one was the major factor needs to be further verified.

Conclusions
In our study, the TGS platform (PacBio Sequel) was used to capture the full-length transcriptome of C. morifolium cv. 'Hangju'. A total of 29.49 Gb raw data and 125,532 deredundant isoforms were obtained. After gene annotation, there were 144 isoforms involved in the pathway of flavonoids biosynthesis. We also predicted the major gene participated in the luteolin and quercetin biosynthesis by expression analysis. The candidate genes had a similar expression pattern with the highest expression level at the bud stage. Furthermore, the MYB factors activating flavonol biosynthesis were screened and analyzed. All of these results provided a theoretical basis for further studying the quality formation of 'Hangju' and also help for its introduction and cultivation.

Plant material and RNA extraction
The 'Zaoyang' genotype of Chrysanthemum morifolium cv. 'Hangju' was formally identified by professor Qiaosheng Guo and collected from germplasm resource preserving center at Nanjing Agricultural University. And the plant material in our study was recognized by

PacBio SMRTbell library preparation and sequencing
The 800-1000 ng RNA was used to synthesized cDNA by Clontech SMARTer PCR cDNA Synthesis Kit with oligo-dT primers. After large-scale PCR amplification, the doublestranded cDNA was size-selected into 0-5 kb and 4.5-10 kb fractions by the BluePippin system (Sage Science, MA, USA). Additional PCR were preformed post size-selection to produce adequate templates, then size-selected cDNA was ligated with blunt adapter for SMRTbell library construction. Finally, a total of two SMRT cells Sequencing reactions were performed by the PacBio Sequel sequencer (BGI-Shenzhen, China).

Bioinformatics analysis of isoform sequencing
Reads of insert were generated by removing adapters and artefacts from raw SMRT sequencing data, and they were further classified into FL non-chimeric, chimeric, non-FL, Transdecoder was used to identify the candidate coding area and extract the longest ORF, then the Pfam protein homologous sequences was searched by blast to SwissProt and Hmmscan to predict the coding region.

Expression analysis of flavonoid biosynthesis genes in C. morifolium
According the functional annotation analysis, genes involved in luteolin and quercetin biosynthesis pathway were selected (including CHS, CHI, F3H, F3'H, FLS, FNS). Then the predicted CDS sequences were used to designed primers (Supply table 4

Availability of data and materials
The datasets generated and/or analyzed during the current study are not publicly available due to the data was analyzing for other research, but are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
This study was supported by grants from the National Nature Science Foundation of China    Figure 1 The  Table.xlsx Supply Figure 1.pdf