Comparative analysis of transcriptional regulation of betalain biosynthesis based on SMRT sequencing of full-length transcriptome in two pitaya cultivars (red pulp and white pulp)

Background: In order to gain more valuable genomic information involved in betalain biosynthesis, the full-length transcriptome of pitaya was analyzed using Single-Molecule Real-Time (SMRT) sequencing corrected by RNA-seq in the present study. Two pitaya cultivars, ‘Zihonglong’ (red pulp) and ‘Jinghonglong’ (white pulp) were selected to analyze betalain transcriptome in four fruit developmental stages. Results: A total of 65,317 and 91,638 genes coding proteins were identified in ‘Zihonglong’ and ‘Jinghonglong’, respectively. A total of 11,377 and 15,551 genes with more than two isoforms were investigated from ‘Zihonglong’ and ‘Jinghonglong’, respectively. Also, 156,955 genes were acquired after elimination of redundancy , of which, 120,604 genes (79.63%) were annotated, and 30,875 (20.37%) sequences without hits to reference database were probably novel genes in pitaya. Totally, 31,169 and 53,024 SSRs were uncovered from the genes of ‘Zihonglong’ and ‘Jinghonglong’, and 11,650 lncRNAs in ‘Zihonglong’ and 11,113 lncRNAs in ‘Jinghonglong’ were obtained herein. Further, 104 genes involved in betalain metabolism were identified, and HpCYP76AD4 and HpDODA probably responded to betalains biosynthesis. Conclusions: Conclusively, this is the first study to perform SMRT sequencing of the fulllength transcriptome of pitaya, which provides a useful genomic clue for exploring the structure and function of genes in pitaya, particularly for betalain biosynthesis. Data from RNA-seq were mapped to the non-redundant SMRT reference by RSEM software. The expression abundance of unigene was represented as value of FPKM, and differential expression gene (FDR<0.01 and FC≥2) were obtained using EBSeq [46]. The differential expression gene venue was named by means of “A_vs_B”, among which, the expression level of the up-regulated gene in sample A was higher than that in sample B, and the reverse was the downregulation gene.


Background
Pitaya (Hylocereus),, originated from Latin America and West Indies [1], is one of economical and nutritional fruits cultivated in tropical and subtropical regions. H.
polyrhizus (with red pulp and peel) and H. undatus (with white pulp and red peel) are the two major species [2], and the red pigment in pitaya is mainly attributed to betanin pigments [3]. Betalains are tyrosine-derived pigments that occur solely in the order of caryophyllales, which largely replaced the anthocyanins in a mutually exclusive manner [4]. Betalains also have high nutritional value and positive effects in health and disease prevention by high antioxidant and anti-inflammatory capabilities [5][6]. Therefore, betalains synthesis has become a research hot spot on scientific interest as well as economic significance [7]. Curently, the metabolic pathway of betalains is more clearly defined [4], the identification of genes is important for betalains biosynthesis. Sequencing platforms is an efficient approach to illustrate putative genes, 9 key transcripts involved in betalains synthesis were identified based on second-generation sequencing (SGS) in pitaya [6]. Whereas, many questions remain open for limited sequence data of pitaya [4].
SGS is a powerful tool for the description of gene expression levels . However, it is difficult to identify full-length transcript using the SGS data [10]. High quality transcript sequences is crucial for plant biology research, fortunately, full-length transcriptome is being employed as an effective approach to obtain high quality transcript sequences.
Currently, a combination of SMRT sequencing and RNA-seq was used for red and white pulp to generate full-length and high quality pitaya transcriptome, and the PacBio transcriptome sequencing of pitaya was reported firstly. Based on the obtained transcriptome data, transcript functional annotation, SSRs analysis and lncRNAs prediction were performed. This study might be a valuable resource for further investigation of pitaya, and might provide a better understanding of betalains biosynthesis in pitaya fruit.

Pulp color change
Obviously, there is larger difference in pulp colour between the two cultivars. There were not significant differences in L*, a*, b*, C*, h° value of 'Zihonglong' and 'Jinghonglong' pulps on 22 DPA. The L*, a*, b*, C*, h° value of 'Jinghonglong' were relatively stable, while those of 'Zihonglong' changed significantly during the fruit development stages (Table 1).
With the development of 'Zihonglong' fruit, L* value was decreased gradually, a*, C*, h°i ncreased prominently on 25DPA and reached the highest values (27.90, 28.53, 343.05 respectively) on 28DPA, and then appeared a slight decline on mature fruit. b* value decreased from 2.11(yellow pulp) on 22 DPA to -6.73 (blue pulp) on 25DPA.

Transcriptome analysis using PacBio Sequel
The full-length transcriptome of pitaya fruit was generated by PacBio Sequel on 'Zihonglong' and 'Jinghoglong'. As shown in Table 2 Comparison of SMRT sequencing and next-generation sequencing The number of SMRT gene obtained from SMRT sequencing were less than that of unigene assembled from NGS reads, whereas, the mean length of SMRT gene is much longer than that of unigene assembled from NGS reads. Additionally, about 65% of the assembled transcripts from NGS reads were <500 bases in 'Zihonglong', while about 74% of that in 'Jinghonglong'. However, only 12% in 'Zihonglong' and 6% in 'Jinghonglong' of the transcripts from the PacBio Sequel reads were <500 bases. Approximately 80% of the transcripts from the PacBio Sequel reads ranged from 500 bases to 2000 bases (Table 3).
Hence, the SMRT sequencing offered significant advantages over NGS in the length of reads.

Clustering analysis
Multiple transcripts corresponded to one gene in the transcriptional group. PacBio long reads clustering analysis demonstrated that 65,317 and 91,638 genes were generated from polished consensus sequences transcripts in 'Zihonglong' and 'Jinghonglong', respectively. Various isoforms generated by a single gene were widely found among the tested samples. A total of 17.42% genes had more than one isoform in 'Zihonglong' pulp, a slightly higher than that (16.97%) of 'Jinghonglong'. In the former, 11,377 genes showed more than two alternative splice forms (isoforms), of which the majority corresponded to two-to-three isoforms, accounting for 74.78% of the total, and 516 genes contained over 10 isoforms. To the latter, 15,551 genes had more than two isoforms, among which the majority were two-to-three isoforms, accounting for 67.99% of the total, and 767 genes with over 10 isoforms were obtained ( Figure 1). Therefore, when alleles and associated homologs were grouped against these results, they typically shared the same alternative splicing patterns [9], indicating that a gene might generated different transcripts via alternative splicing.

Function annotation
Function annotation of pitaya non-redundant FLNC transcripts (genes) was investigated using different databases. As shown in Table 4 The homologous species of Hylocereus were predicted by sequence alignment on the basis of the NR database. Of all the genes hits to NR plant proteins from BLASTx, the pitaya genes gave the highest number of hits to the Beta vulgaris (51,879 hits), followed by Spinacia oleraces (21,876 hits) and Vitis vinifere (2,010 hits) (Fig. 2a). Most hits found in Beta vulgaris were probably for the reason that pitaya and Beta vulgaris belong to Caryophyllales, and Beta vulgaris database is better annotated than those of other species. As shown in Figure 2b Within these functional groups, the highest number of sequences were annotated with the metabolic process (35,263 sequences, 11.40%), cellular process (28,379 sequences, 9.18%) and catalytic activity (27,507 sequences, 8.89%). A total of 117 pathways with 28,796 genes were annotated by KEGG, associated with 23.88% of the whole annotated dataset (120,640 genes). Among these, 237 genes were identified in phenylalanine, tyrosine and tryptophan biosynthesis pathway, however, KEGG path way involved in betalain biosynthesis was not founded.

Genes involved in betalain biosynthesis
Taken account of the different expression levels of genes between 'Zihonglong' and 'Jinghonglg', the genes from PacBio sequel were used as the reference dataset. It was shown that 44,109 DEGs were found between 'Zihonglong' and 'Jinghonglong' during four development stages, among which, the most DEGs were investigated in R2_vs_W2,

Discussion
Fruit is a major product for plant-derived pigments, and the formation of pigment is closely related to the process of fruit development. The L* values decreased and a*, b* values increased with apple fruit development [22]. L* values declined with increasing betacyanin contents, the h° shift from 88.2 (yellow stems) to 350.3 (purple stems) [23].
High anthocyanin content may lead to the decrease of fruit brightness [22][23]. In the present work, variation tendency of color data were consistent with reports [22][23]. The color data were in accordance with the visual pigments appearance of 'Zihonglong' pulps.
It's worth mentioning that color data varied predominantly on 25 DPA in red pup cultivar while the fruit pulp was in color initiation stage, which indicated that the stage was a crucial period for the accumulation of red pigment.
Transcripts from RNA-seq require assembly and full-length transcripts proportion is very small, and inaccuracy in gene structure characterization resulted from mis-assembly is a common problem, which is exacerbated in the species without a reference genome sequence for the prediction of gene models [24]. Recently, SMRT sequel as a new TGS platform was carried out by PacBio sequencing, non-assembled long-read transcripts with error rate (10%) can be generated by SMRT sequel, and that of error rate can be overcome by correction of Illumina reads [25].For example, the mapping rate of long reads in maize can be increased from 11.6% to 99.1% after correction with Illumina read [14], whereas, the report of reference genome sequence or SMRT sequence on pitaya has not yet been found so far. In the present case, 65,312 and 91,638 genes (non-redundant reads) were generated by SMRT from pooled-stage pulp and corrected by Illumina RNA-seq, and the mean length of SMRT gene were up to 1,175 bp and 1,337 bp in 'Zihonglong' and 'Jinghonglong', whereas, the mean length of unigene from Illumina were only 681bp and 696bp in 'Zihonglong' and 'Jinghonglong' (Figure 3), respectively. Also, the pitaya genes had the highest number of hits to the B. vulgaris (50.63%), whereas, the species distribution with the greatest number of H. polyrhizus was Vitis vinifera (50.1%) by RNA-Seq [6]. Both pitaya and B. vulgaris belong to Caryophyllales plants, therefore, the result illustrated that SMRT data are of higher quality than unigenes from RNA-Seq.
Transcriptome reconstruction and annotation has been improved significantly with the development of sequencing techniques [24]. Longer reads are capable of sequencing complete transcripts and qualifying gene features PacBio corrected reads can provided an efficient reference sequences for plants without reference genome [9]. Different transcription isoforms in pitaya pulp were detected without a reference genome. A total of 17.42% and 16.97% of genes in the red and white pulp were identified, respectively, including more than 10 isoforms in red pulp (516, 0.79%) in comparison with that of the white pulp (767, 0.84%). LncRNAs has been found to be a key functional molecules that can regulate gene expression, which has been a hot topic in biology [26][27], and 11,046 lncRNAs were predicted in Salvia miltiorrhiza [13]. In maize, 867 transcripts with a mean length of 1.1 kb were identified as novel high-confidence lncRNAs [14]. A total of 417 and 531 lncRNAs were identified in sweet potato and I. trifida, respectively [26], 223 and 205 lncRNAs were obtained in leaf and root of Astragalus membranaceus respectively [24], 2,426 transcript sequences including 1,220 non-ORF transcript sequences candidate lncRNAs in sugarcane [17]. Currently, 11,650 and 11,113 lncRNAs were identified with four analytical methods inform the red and white pulp, respectively, which were more than that from other documented species. However, their functions in pitaya required further investigations.
The molecular mechanism of betalains synthesis has been well documented in previous studies. First, tyrosine is hydroxylated to L-DOPA and subsequently converted to cyclo-DOPA by one or more cytochrome P450s [28][29][30][31][32], alternatively, L-DOPA can be converted to betalamic acid by DOPA 4, 5-dioxygenase (DOD) [33]. Next, betalamic acid condensates spontaneously with cyclo-DOPA to form betanidin or with amino acids and other amines to form betaxanthins. Betanidin is further glucosylated by a betanidin glucosyltransferase to form the basic betacyanins betanin or gomphrenin [4]. BvMYB1 is the only currently known betalain-related transcription factor, which has an essential role as a positive regulator of betalain biosynthesis through activation of the CYP76AD1 and BvDODA1 genes [7].
Betelains, an important pigment in most Caryophyllales plants, can be used as a natural colorant in food [34], cosmetics and pharmaceuticals [35]. Intensive attempts have been focused on the betalain biosynthesis and genes function, and much more betalain-related candidate genes, such as TYR [36][37], BvMYB1 [38], CYP76AD1 [28] and BvDODA1 [39] were identified. Even so, research about betalain-related genes, especially on pitaya has been reported little so far. In current study, 104 DEGs involving in betalain biosynthesis were identified and analyzed by both SMRT and SGS. Betalamic acid is the chromophore molecule of both betacyanins and betaxanthins, and cDOPA as well as its derivatives are essential to produce betacyanin [40]. The formation of betalamic acid and cDOPA are crucial in betacyanin synthesis, especially, the absence of betalamic acid may block the production of betalain. As a white pulp cultivar, neither red betacyanins nor yellow betaxanthins were detected in white pulp. Hence, we hypothesized CYP76AD and DODA were crucial genes to the formation of betalains. In present case, the two novel pitaya genes HpCYP76AD4 (i1_HQ_R _c13003/f5p0/1979) and HpDODA (i1_LQ_R_c96099 /f1p0/1004) probably respond to the presence of betalains.

Conclusions
In summary, combining SMRT with SGS provide an efficient process to the research of genes. Currently, full-length transcripts were generated by both SMRT Squel and Illumina RNA-seq from pooled-stage pulp, and full-length transcriptome of pitaya pulp provided reference sequence of pitaya. Compared with the reported genes in previous studies on pitaya using cDNA cloning or RNA-Seq, the date from our study have higher quality and much more complete annotation, so as to provide a valuable resource for pitaya research and facilitate the identification of additional betalain-related genes in the near future. Up to now, the mining and utilization of the data was limited in this study, further research will focus on the structure and function of genes involved betalains biosynthesis.

Pitaya pulps
The cutting seedling of H. polyrhizus cvZihonghlong (certificate number, Qian guo shen no. Chromaportable colorimeter (Konica Minolta Sensing, inc., Osaka, Japan). All determinations were performed in duplicate. L* value represented the relative lightness of colors ranging from 0 (black) to 100 (white). Values of a* and b* ranged from −60 to 60, where a* was negative for green color and positive for red color, and b* was negative for blue and positive for yellow [22,41]. Chroma values (C*) expressing color saturation, higher C* value means less color type and brighter color, to the contrary, lower C* value means more color type and bleak color. Hue angle (h°) is expressed on a color wheel, where 0°/360° = red-purple, 90° = yellow, 180° = green, and 270° = blue [23].

Library preparation and SMRT sequencing
The same amount of RNA from each variety sample (pulp of four development stages) were mixed for SMRT sequencing ananlysis. Firstly, mRNA was enriched by Oligo (dT) magnetic beads. Then the enriched mRNA was reverse transcribed into full length 1st strand cDND using Clontech SMARTer PCR cDNA Synthesis Kit. PCR cycle optimization was used to determine the optimal amplification cycle number for the downstream large-scale application. The optimized cycle number was used to generate double-stranded cDNA, followed optional size ( 4kb) selection using the BluePippinTM for combined SMRT bell library. Full length cDNAs were performed DNA damage repaired, end repaired, and ligated to sequencing adapters, and then digested with exonuclease. Qualified libraries were sequenced on the PacBio Sequel (Pacific Bio-science Inc.) platform according to the effective concentration and data output requirements of the library.

Preprocessing of SMRT reads
First, the subreads were acquired from raw sequencing reads using the SMRT Link v5.0 (minLength = 200 minReadScore = 0.75 pipeline supported by Pacific Biosciences, and CCS reads were extracted out of subreads BAM file. Through RS_IsoSeq (minPasses = 1, minPredicted Accuracy = 0.8), CCS reads were classified into full-length non-chimeric (FLNC), non-full-length (NFL) based on cDNA primers and polyA tail signal. Subsequently, the FLNC reads were clustered by Iterative Clustering for Error Correction (ICE) software to generate the cluster consensus isoforms. Followed, NFL reads were used to polish the above obtained cluster consensus isoforms by Quiver software to finally obtain the FLNC polished high quality consensus sequences (accuracy≥99%). After corrected by SGS using LoRDEC, non-redundant high-quality full-length transcripts was generated by CD-HIT (c = 0.99) for further analysis.

Functional annotation of genes
Non-redundant transcript sequence as genes obtained after CD-HIT deduplication were grouped and mapped to nine protein and nucleic acid database to obtain the annotation information of the gene. These databases included NR, Nt, Swiss-Prot, GO, COG, KOG, Pfam, TrEMBL, and KEGG. GO annotation was analyzed by Blast2GO software with Nr annotation results of genes. Genes ranking the first 20 highest score and no shorter than Then, functional classification of genes was run using WEGO software.

lncRNAs prediction
The coding potential of transcripts were predicted by predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme(PLEK) [42] and Coding-Non-Coding Index (CNCI) [43], and then, transcriptional sequences predicted from PLEK and CNCI were blasted with the known protein database using Coding Potential Calculator (CPC) software [44]. The transcriptional sequences predicted by PLEK, CNCI and CPC software were performed hmmscan homologous search with Pfam [45] database, finally the LncRNA sequences were obtained.

Second-generation sequence
Total RNA (5μg) was digested by using DNase I (NEB, Frankfurt, Germany). The sample was purified with Agencourt RNAClean XP Beads and fragmented into 130-170 nt. First-

Different expressed genes (DEGs) analysis
Data from RNA-seq were mapped to the non-redundant SMRT reference by RSEM software.
The expression abundance of unigene was represented as value of FPKM, and differential expression gene (FDR<0.01 and FC≥2) were obtained using EBSeq [46]. The differential expression gene venue was named by means of "A_vs_B", among which, the expression level of the up-regulated gene in sample A was higher than that in sample B, and the reverse was the downregulation gene.

Phylogenetic analysis of DEGs involved in betalain synthesis
The sequences of 5 CYP76AD and 8 DODA genes were blasted in NCBI, 7 CYP76AD genes and 7 DODA genes from NCBI were selected according to sequence similarity(Query cover≥70% and ident≥80%), and an unrooted phylogenetic tree was constructed with     Figure 1 Isoform number per gene X-axis: isoform number Y-axis: gene number .  The distribution characteristics of SSRs motifs in 'Zihonglong' (a) and 'Jinghonglong' (b). The x-axis represents the SSR motif unit; and the y-axis represents the repeat counts; z-axis represents repeat type.   Neighbor-joining phylogenetic tree of CYP76AD (a) and DODA (b) using closely related genes from NCBI.