Transcriptome sequencing and sequence assembly
In total, 56 million raw reads (8.28G) and 55 million clean reads (8.24G) were obtained from six transcriptomes Z175 and R184 varieties in three biological replicates, respectively. In Z175, the percentage of reads with q20 and q30 quality scores was 97.32% and 92.61%, compared with 97.61% and 93.26% for R184 (Fig.1). After trimming adapters, removing low-quality raw sequences using Trimmomatic (http://www.usadellab.org/cms/index.php?page= trimmomatic), splicing and assembly (using Trinity), we obtained 187 million transcripts from the six leaf transcriptomes of A. carmichaelii , with an N50 value of 1,309 bp, an average length of 853 bp, and a maximal length of 11,085 bp. The six raw read datasets have been deposited in the NCBI SRA database under GenBank accession number SUB5660984 (biosample, RAW data is submitting to NCBI SRA with some submitting problem).
Homology analysis and Gene Ontology (GO) annotation
From 108,477 unigenes (61.35%), we obtained matches to entries in the NCBI non-redundant (nr) protein database using BLASTx with an E-value cut-off of 1e-5 (Fig. 2a). Meanwhile, 54,936 (31.07%) were aligned to the NCBI nucleotide sequences (Nt) database, 82,039 (46.40%) were aligned to the Swiss-Prot database, 77,612 (43.89%) were aligned to the protein family (Pfam) database, 78,500 (44.4%) were aligned to the GO database, and 27,522 (15.56%) were aligned to the Clusters of euKaryotic Ortholog Groups (KOG) database (Fig. 2a). Interestingly, 1,837, 24,380 and 40 unigenes were uniquely annotated to Nt, nr and KOG, respectively (Fig. 2c). Furthermore, 176,793 unigenes were annotated to at least one of the searched databases, and 35.4% of unigenes were successfully aligned to Nelumbo nucifera genes using BLASTx, followed by Vitis vinifera (12.5%), Theobroma cacao (2.9%), Jatropha curcas (2.4%) and Citrus sinensis (2.0%), and 44.8% were aligned with other species (Fig. 2b).
We used GO annotations to classify the 78,500 unigenes into functional groups using BLAST2GO where p-values calculated by hypergeometric distribution tests and E-values were <1×10–5. In the A. carmichaelii transcriptome data, biological process accounted for the majority of GO annotations (47.33%, 179,968 genes), followed by cellular component (28.10%, 106,856 genes) and molecular functions (24.57%, 93,447 genes). In the biological process category, GO terms related to cellular process (23.43%) and metabolic process (22.29%) were most abundant, followed by single-organism process (16.62%), biological regulation (7.14%) and regulation of biological process (6.61%). In the cellular component category, cell (19.88%) and cell part (19.87%) were dominant, followed by organelle (13.39%), macromolecular complex (12.69%) and membrane (10.48%). In the molecular function category, binding (48.70%) and catalytic activity (38.03%) were the most represented, followed by catalytic activity and transporter activity (4.67%), structural molecule activity (2.50%) and nucleic acid binding transcription factor activity (2.11%; Fig. 3a).
A total of 30,657 transcripts were classified into 26 KOG categories. General function prediction (7,602 transcripts, 24.80%) was the dominant KOG category, followed by posttranslational modification, protein turnover and chaperones (4,267 transcripts, 13.91%) and general function prediction only (3,335 transcripts, 10.88%; Fig. 3b).
To elucidate active biosynthesis pathways in A. carmichaelii, annotation of nr data using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database identified 43,695 genes assigned to five main categories representing 130 biological pathways. The highest number of KO identifiers were linked to metabolism (18,951 genes) followed by genetic information processing (9,809 genes), cellular processes (1,813 genes), organismal systems (1,410 genes) and environmental information processing (1,372 genes). Pathways with the largest number of KO identifiers were: translation (4,070); carbohydrate metabolism (3,808); folding, sorting and degradation (3133); overview (2,895); amino acid metabolism (2,210) and lipid metabolism (1,974; Fig. 3c).
A total of 4,866 transcripts (2.75%) were assigned to 80 TF families. Among these, MYB (308), C3H (249), AP2-EREBP (199), NAC (195), Orphan (190) and C2H2 (185) were the most abundant. Furthermore, 322 (6.62%) differentially expressed genes (DEGs) encoding TFs potentially involved in the regulation of various physiological processes and biochemical pathways in A. carmichaelii were identified (Additional file1,table S1).
Tissue-specific differential gene expression
Expression of read-mapped genes was analysed based on the fragments per kilobase of transcript per million mapped reads (FPKM) value for each clean read. Transcripts with |log2FoldChange| >1 and padj <0.05 were considered DEGs. Among the 176,793 identified unigenes, ~6,873 DEGs were detected, of which 3,470 were down-regulated and 3,403 were upregulated in the two A. carmichaelii varieties. A total of 4,175 DEGs were linked to 356 metabolic pathways (p-value <0.05). Based on DEGs between Z175 and R184 varieties, GO analysis showed that 2,076 GO accessions matched biological process, 454 matched cellular component, and 993 matched molecular function categories (Fig. 4). KEGG analysis identified 320 pathways related to 4,175 annotated DEGs (p-value <0.05; Additional file1, t able S2).
Characterization and expression analysis of unigenes involved in C19-type diterpene biosynthesis in Z175 and R184.
Based on the results of previous studies [13, 21], putative unigenes involved in the aconitine biosynthesis pathway were identified, including 49 unigenes encoding 1-deoxy-D-xylulose-5-phosphatesynthase (DXS), two of which were more highly expressed in Z175 than R184, 1 1-deoxy-D-xylulose-5-phosphate reductoisomerase (DXR), 10 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase (ISPD), five 4-(cytidine-50-diphospho)-2-C-methyl-D-erythritol kinase (ISPE), two 2-C-methyl-D -erythritol2,4-cyclodiphosphate synthase (ISPF), 1I (E)-4-hydroxy-3-methylbut- 2-enyldiphosphate synthase (ISPG), two (E)-4-hydroxy-3-methylbut-2- enyldiphosphate reductase (ISPH), and four isopentenyl diphosphate isomerase (IPPI) in the MEP pathway (Additional file 1,table S3). Furthermore, 14 acetoacetyl-CoA thiolase (AACT) and seven 3-hydroxy-3-methylglutaryl-CoA reductase (HMGR) involved in the MVA pathway were identified, including two more highly expressed in Z175, along with two 3-hydroxy-3-methylglutaryl-CoA synthase, (HMGS), three mevalonate kinase (MVK), six phosphomevalonate kinase (PMK), and two mevalonate diphosphate decarboxylase (MVDD; Additional file 1,table S4). Five geranylgeranyl pyrophosphate synthase (GGPPS), seven CPS, 24 KS (Cluster-219.18055 more highly expressed in Z175), 21 KOX, of which four are more highly expressed in Z175, 84 cyclases, of which Cluster-219.66822 and Cluster-219.62280 were higher expressed than Z175, and Cluster-219.80709 and Cluster-219.73820 higher than R184, 131 aminotransferases, of which Cluster-219.46073 was higher expressed than R184, and Cluster-219.51635 higher than Z175, 190 monooxygenases, of which Cluster-219.114355, Cluster-219.51666, Cluster-219.28553 were higher expressed in Z175 than R184, and Cluster-219.140395, Cluster-219.35262 , Cluster-219.119209 , Cluster-219.2681, Cluster-26892.0, Cluster-219.131122 higher than Z175 , 73 BAHD acyltransferases of which Cluster-219.50148, Cluster-219.20417, Cluster-219.64576, Cluster-219.19319, Cluster-219.90737, Cluster-20871.1, Cluster-219.44720, Cluster-219.48909, Cluster-219.4883 were higher expressed in Z175 than R184, and Cluster-219.10123, Cluster-219.138567, Cluster-219.32670, Cluster-219.138566, Cluster-219.99337 higher than Z175, and 200 methyltransferases, of which Cluster-219.14882, Cluster-219.17602 were higher expressed in R184 than Z175, and Cluster-22914.0 higher than R184, were also identified (Additional file 1,table S5). Expression of all the above higher or lower unigenes was statistically significant (log2 fold-change >1.0 and pdaj <0.05), and the sequence of them was in Additional file 2. Expression levels of unigenes involved in the aconitine biosynthesis pathway in the two varieties are shown in Additional file 1, table S3S5. Expression levels of unigenes in the isopentenyl diphosphate pathway were highest, followed by MEP and MVA pathways (Additional file 1, table S3S5).
Candidate genes encoding enzymes involved in the salsolinol biosynthesis pathway
Salsolinol is another important aconitine in A. carmichaelii. A schematic diagram of the proposed pathway for the biosynthesis of salsolinol is shown in Fig. 5. Precursors for the biosynthesis of salsolinol are derived from D-glucose through glycolysis that generates phosphoenolpyruvate, then through the shikimate pathway that synthesizes chorismate, then through dopamine, finally yielding salsolinol. In more detail, the glycolysis pathway includes 19 hexokinase (HK) with Cluster-219.117566 higher expressed in Z175, 13 glucose-6-phosphate isomerase (GPI) with Cluster-219.75274 higher expression in Z175 than R184, 30 6-phosphofructokinase (PFK) 31 fructose-bisphosphate aldolase (ALDO), of which Cluster-219.73817, and Cluster-28961.0 were higher expression in Z175 than R184, and Cluster-219.98335, Cluster-219.14890, Cluster-219.9488 and Cluster-219.71706 higher expressed in R184, nine glyceraldehyde-3-phosphate dehydrogenase (GAPHD), 32 phosphoglycerate kinase (PGK) with Cluster-219.80323 and Cluster-219.80326 higher expression in Z175, one 2,3-bisphosphoglycerate-dependent phosphoglycerate mutase (PGAM) and 23 enolase (ENO) enzymes (Additional file 1,table S6). In the shikimate pathway, there are 22 phospho-2-dehydro-3-deoxyheptonate aldolase (AROF) enzymes, four of which, Cluster-219.50402, Cluster-219.4979, Cluster-219.95746, Cluster-219.78492 and Cluster-219.78493, were more highly expressed in Z175, two 3-dehydroquinate synthase (AROB), nine 3-dehydroquinate/shikimate dehydratase (AROD/AROE), 13 shikimate kinase (AROK), of which Cluster-219.54689 and Cluster-219.77462 were more highly expressed in R184 and Z175, two 3-phosphoshikimate 1-carboxyvinyltransferase (AROA), and 27 chorismate synthase (AROC; Table 1). The chorismate pathway includes seven chorismate mutase (CHMU), five bifunctional aspartate/glutamate aminotransferase (PAT), seven prephenate dehydrogenase (TYRC), 23 polyphenol oxidase (POX), including seven unigenes, Cluster-22607.0, Cluster-219.82880, Cluster-219.54918, Cluster-219.54917, Cluster-219.50283, Cluster-17242.0 and Cluster-219.92608 more highly expression in Z175 than Z184, and six tyrosine decarboxylase (TYDC) enzymes (Table 2). All sequences of expressed statistically significant unigenes between Z175 and R184 were in Additional file 3. Comparison of unigene expression levels in glycolysis, shikimate and chorismate pathways revealed highest expression levels in glycolysis, followed by shikimate and chorismate pathways.