The full-length transcriptome of C. morifolium
To identify the transcripts in C. morifolium during the developmental stages, SMRT sequencing technology was used to obtain the complete and full-length transcriptomes. In total, we obtained 15.56 Gb and 13.93 Gb of raw data from the 0-4.5K and 4.5-10K libraries, respectively. Approximately 591,513 and 560,454 ROIs (reads of insert) were recognized, with mean lengths of 3,328 and 3,587 bp, respectively (Table 1). The clean data were uploaded to the Sequence Read Archive (SRA) with accession numbers SRR10054190 and SRR10054977. More than 50% of the reads were classified as full-length (FL) non-chimeric reads, and fewer than 5% of the reads were classified as chimeric reads, which indicated that the SMRTbell libraries were of high quality (Supplementary figure 1). After clustering and polishing, 129,610 and 112,142 HQ isoforms were finally merged into 125,532 non-redundant isoforms with a mean length of 2,009 bp and an N50 of 2,274 bp (Table 1).
Functional annotation
All non-redundant isoforms were searched against 7 functional databases, and approximately 115,807 (92.25%) transcripts were annotated in at least one functional database (Figure 2A). There were 107,700 (85.79%) isoforms assigned to NR databases, and most of them had high similarity to genes in Helianthus annuus (60.13%), Cynara cardunculus var. scolymus (24.18%), Chrysanthemum × morifolium (1.66%) and other species (Figure 2B). Approximately 33,901 (27.01%) transcripts were assigned to 53 subcategories in the GO database, and the largest subcategories of the three primary GO categories were cellular process, cell and catalytic activity (Supplementary figure 2). We also obtained annotations for 91,179 (72.63%) isoforms from the KEGG database, 2,411 of which were associated with the Biosynthesis of other secondary metabolites subgroup. Furthermore, we found that 1,447, 480, 54 and 102 transcripts were annotated as belonging to the Phenylpropanoid biosynthesis, Flavonoid biosynthesis, Anthocyanin biosynthesis, Isoflavonoid biosynthesis, and Flavone and flavonol biosynthesis pathways, respectively (Figure 2C).
The candidate genes involved in the flavonoid biosynthesis pathway
Combining the annotation results from seven databases, 143 isoforms were annotated as PAL, C4H, 4CL, CHS, CHI, F3H, F3’H, FLS, FNS, DFR and ANS genes in the flavonoid biosynthesis pathways (Table 2). Every gene had many isoforms, especially those in the PAL and 4CL gene families, with more than 30 isoforms. Interestingly, no sequences were similar to F3’5’H, which catalyses the substrate of blue anthocyanins. This finding is consistent with the lack of blue varieties of medicinal C. morifolium. ANR and LAR genes also were not found in the database, which suggested a lack of proanthocyanins in medicinal C. morifolium. Further analysis of the genes involved in luteolin and quercetin biosynthesis revealed that the isoforms belonged to 5 CHS genes, 3 CHI genes, 1 F3H gene, 1 F3’H gene, 1 FNS Ⅱ gene and 2 FLS genes based on sequence similarity (Supplementary table 1).
Differential expression of genes involved in luteolin and quercetin biosynthesis
Luteolin and quercetin are the marker compounds of ‘Hangju’ in the Chinese Pharmacopoeia; thus, the CHS, CHI, F3H, F3’H, FLS and FNS genes were selected and analysed. The expression profile of every gene at the different flowering stages is shown in Figure 3. Most of the genes had the highest expression level at the bud stage (stage A), which is an important stage for flavonoid biosynthesis. Expression of CHS1 was significantly higher than that of the other CHS genes at every stage, while CHS4 and CHS5 were barely expressed. Therefore, CHS1 might be a major gene controlling flavonoid production in the inflorescence of C. morifolium. Similarly, CHI1, CHI3 and FLS1 may also be key genes. Interestingly, the CHS2 gene had a high expression level at stage A but was not expressed at the other stages. The change in expression level suggested that CHS2 has significant stage-specific expression.
The MYBs regulating flavonoid biosynthesis in C. morifolium
Based on their protein domains, a total of 4,068 isoforms were identified as transcription factors, and MYB was the largest family, with approximately 444 isoforms (Figure 4A). After alignment with MYB transcription factors isolated from A. thaliana, we found that 17 isoforms were clustered into R2R3-MYB subgroups 4-7 (Supplementary figure 3). The members of all four subgroups also aligned with other MYBs isolated from Malus × domestica, Vitis vinifera and other species (Figure 4B); the information on the genes is provided in Supplementary table 2. Finally, three isoforms were clustered into R2R3 MYB subgroup 7, members of which are regarded as activators of flavanol biosynthesis. Five isoforms were grouped into subgroup 6, members of which have an active function in anthocyanin biosynthesis. Nine isoforms were grouped into subgroup 4, members of which repress the biosynthesis of phenylpropanoids, such as lignin, phenolic acid and flavonoids. However, no isoforms were clustered with subgroup 5, members of which activate the biosynthesis of proanthocyanidins. This is consistent with the result that no isoforms were annotated as LAR and ANR genes. Furthermore, we found that isoforms 90494 and 90874 had high similarity to a single gene, which was not similar to isoform 65995; therefore, we inferred that two genes activated flavanol biosynthesis. In addition, the isoforms in subgroups 4 and 6 were assigned to 2 genes and 1 gene, respectively, based on sequence similarity (Supplementary table 3). When comparing the expression levels of these two genes in the inflorescence, we found that CmMYBF1 was more highly expressed than CmMYBF2 at every stage (Figure 4C) and CmMYBF1 had an expression profile similar to those of FLS genes. Therefore, CmMYBF1 probably activates flavonoid biosynthesis in the inflorescence of C. morifolium.