Unigene assembly and functional annotation
RNA-seq datasets of six samples from purple flowers (P1, P2 and P3) and white flowers (W1, W2 and W3) of phalaenopsis were sequenced on Illumina HiSeq platform, and generated 60,293,026, 48,567,124, 57,043,822, 53,586,804, 60,652,854, and 55,117,678 raw reads in P1, P2, P3, W1, W2 and W3, respectively (Table 1). The total raw data (335,261,308) was treated and removed joints, impurities and low-quality reads, then 329,491,250 clean reads were retained. Sample P2 and W2 had the most (58,673,380) and least (48,567,124) number of clean data, respectively (Table 1). Moreover, the GC content of each sample was around 45%, and the nucleotide bases with Phred (Phred = -10log10(e)) values greater than 20 and 30 account for 97% and 93% of the total nucleotide bases (Table 1). These results showed that the quality of clean data obtained from sequencing was adequate enough for subsequent analysis.
Table 1
Sequencing row data of the six samples
Sample
|
Raw reads No.
|
Clean reads No.
|
Clean bases (G)
|
Q20 (%)
|
Q30 (%)
|
GC (%)
|
W1
|
60,293,026
|
58,137,588
|
8.72
|
97.44
|
93.5
|
45.47
|
W2
|
48,567,124
|
48,567,124
|
7.29
|
98.13
|
95.09
|
45.20
|
W3
|
57,043,822
|
57,043,822
|
8.56
|
98.11
|
95.03
|
45.39
|
P1
|
53,586,804
|
53,586,804
|
8.04
|
98.16
|
95.14
|
45.56
|
P2
|
60,652,854
|
58,673,380
|
8.80
|
97.48
|
93.62
|
44.89
|
P3
|
55,117,678
|
53,482,532
|
8.02
|
97.39
|
93.43
|
44.78
|
Total of 215,191 unigene sequences with total nucleotide base number of 54,354,436 bp were sequenced. Among them, the maximum and minimum length of the sequences were 8,698 bp and 201 bp, respectively (Table 2). And the average and N50 length of assembled sequences were 253 bp and 703 bp, respectively (Table 2). Moreover, the length of these unigene sequences was mainly distributed between 200 bp and 4,000 bp, which accounted for 98% of the sequences reads (Figure S1). Then we used Nr, Swissprot, GO, KEGG and COG databases to obtain the functional annotation of the Amino acid sequence of the unigene. The number of annotations obtained in the Nr, Swissprot, GO, KEGG, and COG databases were 21,143(90.7%), 15,062 (64.6%), 10,827 (46.4%), 9,854 (42.3%) and 18,954 (82.7%), respectively (Table S1). |
Table 2
The statistical results of unigene sequences
Items
|
Assembled transcripts
|
Total sequence number
|
215,191
|
Total sequence base (bp)
|
54,354,436
|
GC%
|
43.84
|
Median contig length (bp)
|
139
|
Largest length (bp)
|
8,698
|
Smallest length (bp)
|
201
|
Average length (bp)
|
253
|
N50 length (bp)
|
703
|
Total Amino Acid Sequence number
|
23,314
|
COG database distribution
To further annotate homologous proteins of unigene, the 18,954 unigenes from COG database were further analyzed. About 12,436 (61.84%) unigene sequences were grouped into three categories included (1) Information Storage and Processing (3,947), (2) Cellular Processes and Signaling (4,432) and (3) Metabolism (4,057), which accounted for 19.63%, 22.04% and 20.17% of all sequences, respectively (Table 3 and Figure S2). Besides, 32.41% of the sequences were still functional unknown.
Identification of differentially expressed genes
The correlation between paired samples from six datasets was calculated by using the FPKM values of the transcripts. The three biological reduplications from purple and white flowers had a high degree of similarity (Fig. 2A, R2 > 0.95). There were 1,175 differentially expressed genes (DEGs) were detected between purple and white samples, 718 genes of them were up-regulated and 457 genes were down-regulated (Fig. 2B and C, Table S2). And DEGs were annotated in Nr, Swiss-Prot, GO, KEGG, and COG databases, and the numbers were 1,163 (98.9%), 898 (76.4%), 474 (40.3%), 431 (36.7%), and 882 (75.1%), respectively (Fig. 2D and Table S3).
Table 3
COG Categories Distribution
Information Storage and Processing
|
|
Transcription (K)
|
1,649
|
Replication, recombination and repair (L)
|
878
|
Translation, ribosomal structure and biogenesis (J)
|
699
|
RNA processing and modification (A)
|
555
|
Chromatin structure and dynamics (B)
|
166
|
Total
|
3,947 (19.63%)
|
Cellular Processes and Signaling
|
|
Posttranslational modification, protein turnover, chaperones (O)
|
1,871
|
Signal transduction mechanisms (T)
|
1,393
|
Intracellular trafficking, secretion, and vesicular transport (U)
|
1,522
|
Cytoskeleton (Z)
|
188
|
Cell wall/membrane/envelope biogenesis (M)
|
175
|
Cell cycle control, cell division, chromosome partitioning (D)
|
139
|
Defense mechanisms (V)
|
137
|
Extracellular structures (W)
|
1
|
Nuclear structure (Y)
|
6
|
Cell motility (N)
|
0
|
Total
|
4,432 (22.04%)
|
Metabolism
|
|
Carbohydrate transport and metabolism (G)
|
1,021
|
Energy production and conversion (C)
|
961
|
Secondary metabolites biosynthesis, transport and catabolism(Q)
|
526
|
Amino acid transport and metabolism (E)
|
512
|
Lipid transport and metabolism (I)
|
392
|
Inorganic ion transport and metabolism (P)
|
391
|
Coenzyme transport and metabolism (H)
|
157
|
Nucleotide transport and metabolism (F)
|
97
|
Total
|
4,057 (20.17%)
|
Poorly Characterized
|
|
Function unknown (S)
|
6,518
|
General function prediction only (R)
|
0
|
Total
|
6,518 (32.41%)
|
GO and KEGG pathway Enrichment analysis of DEGs
To further annotate the function of DEGs, GO enrichment analysis was performed on the DEGs. There were 1,247 genes involved in biological processes, 134 genes related with cell components and 314 genes involved in molecular functions were represented by DEGs (Fig. 3A and Table S4). The biological processes were significantly enriched in response to organonitrogen compound, in response to chitin, and in response to wounding and the secondary metabolite biosynthetic process etc (Fig. 3A). The enrichment of cell component included apoplast, peroxisome, microbody, plastoglobule and peroxisomal membrane etc (Fig. 3A). The significant enrichment of the molecular function mainly involved in calcium ion binding, CoA-ligase activity, ligase activity and forming carbon-sulfur bonds etc (Fig. 3A).
Furthermore, we used KOBAS v3.0 to analyze KEGG pathway enrichment from DEGs (Fig. 3B and Table S5). The results indicated that these differential genes were involved in 76 signaling pathways, which included Biosynthesis of secondary metabolites, Metabolic pathways, Phenylpropanoid biosynthesis, and Flavonoid biosynthesis and Carotenoid biosynthesis etc. Among them, the Biosynthesis of secondary metabolites pathway was the most significant (Fig. 3B).
Total of 4 DEGs (F3'H, C4H, CCoAOMT and UA3'5'GT) were identified and annotated in the Flavonoid biosynthesis metabolic pathway by KEGG, which had been reported to play an important role in flower color formation [18]. Among them, F3'H gene was up-regulation expression and C4H, CCoAOMT and UA3'5'GT were all down-regulation (Table 4 and Table S2). Besides, 10 DEGs were identified and annotated in the phenylpropanoid biosynthesis of KEGG pathway. And the expressions of CCoAOMT, C4H, PAL, 4CL, CCR, CALDH and Bglx were down-regulation except CAD, SGTase and E1.11.1.7 with up-regulation (Table 4 and Table S2). The phenylpropanoid biosynthesis was also an important bio-pathway in flower color formation [18].
Table 4
The annotation pathway of identified unigenes of phalaenopsis
Unigene id
|
Gene name
|
Definition
|
Pathway
|
LOC110024427
|
C4H
|
trans-cinnamate 4-monooxygenase
|
Flavonoid/ Phenylpropanoid biosynthesis
|
LOC110023518
|
CCoAOMT
|
caffeoyl-CoA O-methyltransferase
|
Flavonoid/ Phenylpropanoid biosynthesis
|
LOC110022396
|
F3’H
|
flavonoid 3'-monooxygenase
|
Flavonoid biosynthesis
|
LOC110030623
|
UA3'5'GT
|
anthocyanidin 5,3-O-glucosyltransferase
|
Flavonoid biosynthesis
|
LOC110031047
|
PAL
|
phenylalanine ammonia-lyase
|
Phenylpropanoid biosynthesis
|
LOC110038424
LOC110018262
LOC110026381
|
4CL
|
4-coumarate–CoA ligase
|
Phenylpropanoid biosynthesis
|
LOC110024447
|
CCR
|
cinnamoyl-CoA reductase
|
Phenylpropanoid biosynthesis
|
LOC110028720
|
CAD
|
cinnamyl-alcohol dehydrogenase
|
Phenylpropanoid biosynthesis
|
LOC110037950
LOC110019312
|
CALDH
|
coniferyl-aldehyde dehydrogenase
|
Phenylpropanoid biosynthesis
|
LOC110031840
|
SGTase
|
scopoletin glucosyltransferase
|
Phenylpropanoid biosynthesis
|
LOC110024203
|
bglx
|
beta-glucosidase
|
Phenylpropanoid biosynthesis
|
LOC110029660
|
E1.11.1.7
|
peroxidase
|
Phenylpropanoid biosynthesis
|
RNA SNP identification
Total of 207,759 SNPs were identified in transcripts (Table S6), and the transition (G-A and C-T) frequency was higher than transversion sites (Fig. 4A). The G->A, A->G, C->T, and T->C ratio of transition sites were 13.45%, 13.41% and 13.35%, 13.32%, respectively (Table S7), which was about 1.15 times of the transversion sites (T->A: 8.60%, A->T: 8.49%, A->C: 5.72%, G->T: 5.67%, C->A: 5.66%, T->G: 5.55%, G->C : 3.40% and C->G: 3.37%) (Table S7).
In addition, a total of 44,808 SNPs were identified in annotated genes (Table S8), among which 2,260 SNPs belonged to DEGs and 42,548 SNPs belonged to non-DEGs. And the average mutation frequencies of DEGs and non-DEGs were 0.0024 and 0.0029, respectively (Table S9 and Table S10). In 12 key genes, the total number of SNPs was 20, and the average mutation frequency was 0.0015 (Table S11), which was lower than that in DEGs and non-DEGs.
To further study whether SNP mutations in transcripts were associated with flower color, the mutation location of genes were identified (Table S8) and the mutation frequency of DEGs and non-DEGs from the unigene sequences were compared. The results showed that the mutation frequency of SNPs between DEGs and non-DEGs was extremely significant (P = 0.0021) (Fig. 4B). In DEGs group, the mutation frequency of SNPs in up- and down-regulation genes were highly significant (Fig. 4C), which was benefit for pigmentation based on the foldchange value. Furthermore, the genes of each group were divided into four categories: A, C, T and G in base level. And the mutation frequencies in four categories were compared between the two groups (DEGs and non-DEGs). The SNPs mutations in A, C, T and G categories between DEGs and non-DEGs were all significant (A: P = 0.0013, C: P = 0.0022, T: P = 0.039, G: P = 0.019) (Figure S3A, B, C and D). All above results illustrated that the RNA SNP mutations was strongly associated with color formation in Phalaenopsis.