Embryo developmental features and lignin accumulation during seed development
Zygotic embryos of P. armeniacum began to develop at 45 days after pollination (DAP) as previously described.3 At 66 DAP, the T-shaped pre-embryo with four cells was formed (Table 1). At 87 DAP, the globular embryo was formed. Around 108 DAP, there is a rapid lignification of the seed coat, the inner testa disappears, and lipid globule and starch accumulate. At 122 DAP, the embryo is fully mature with a compact testa and no further observed morphological changes. Five developmental stages (66, 87, 108, 122, and 150 DAP) of P. armeniacum embryos were selected for further lignin structure characterization and transcriptome profiles on the basis of lignin accumulation pattern and embryo developmental characteristics.
The germination was scored together with the lignin content during the seed development (Figure 1A). During the early stages of seed development (45-95 DAP), the germination rate gradually increased as the embryo formation. At 95 DAP, the globular embryo has been fully developed, and the germination rate reached 96.2 %. During this stage, lignin was maintained at a relatively low level (5-9 % of the dry mass). After that, even though the starch and lipid globules continue to accumulate as described in a previous study,3 the germination rate dropped sharply from 96 % to 5 %. Meanwhile, the amounts of lignin started to increase rapidly, with the seed coat changing from white to brown (Figure 1B). Lignin accumulation reached a plateau (~39 % of the dry mass) at 150 DAP, with the seed coats becoming dark and hard.
Identification of non-methylated lignin in seeds
To analyze the monolignol composition from various tissue types, the samples were subjected to pyrolysis-gas chromatography/ mass spectrometry (Py-GC/MS). The identities and relative molar abundances of the released compounds are listed in Table 2. The results showed that the lignin in the seeds contain high levels of C units and H units without any traces of G and S units. In contrast, lignins present in other tissues (stems, leaves, and pods residues after seed isolation) were composed mainly of G and S units with essentially no C units.
The composition of monolignol structure in seeds was further characterized for the assignments for C lignin and G/S lignin in Fourier transform infrared spectroscopy (FTIR) and Nuclear Magnetic Resonance (NMR) spectra in orchid seeds as previously established.11-13 FTIR spectra from the seeds at five developmental stages indicated C lignin deposited in the early stage with distinct C lignin bands at 1154 cm-1 and 823 cm-1 (Figure 2A). The rapid development of the two bands was observed at 108 DAP. In agreement with Py-GC/MS analysis, no typical G/S lignin was detected in P. armeniacum seeds during the five developmental stages. Analysis of the aromatic region of the 2D NMR spectra of mature seeds confirmed the presence of H and C lignin (Figure 2B). The results indicated that the monolignol composition varied among tissue types in P. armeniacum, and that the non-methylated C lignin and H lignin specifically deposited in the seed.
RNAseq analysis, de novo assembly, and functional annotation
RNA-seq analysis of P. armeniacum seeds was performed at five developmental stages (66, 87, 108, 122, and 150 DAP) to characterize the transcriptome dynamics of P. armeniacum seed development and to identify genes involved in the seed-specific lignification. Three biological replicate sequencing libraries were prepared from each stage. A total of 104.12 Gb clean data were generated from each library after filtering out low-quality data. The transcriptome details for each sample are shown in Table S1. Q30 of the raw data ranged from 92.10 % to 94.11 % indicating high-quality reads worthy of further analysis. Since no reference genome is available for Paphiopedilum, a de novo assembly of all 347,075,856 reads into 433,854 transcripts with an N50 length of 1180 bp and 183,737 unigenes with an average length of 860 bp. The transcripts and unigenes length distribution are shown in Table S2. The biological reproducibility was assessed using the Pearson’s correlation coefficient, which showed that the correlation between samples among the same biological replicates was high (Figure S1A). Principle component analysis (PCA) of all samples during seed development was also performed (Figure S2B). Consistent with their distinct developmental stages, samples from different biological replicates were clustered separately.
For annotation, 183,737 unigenes were subjected to BLASTX searches against the sequences in the NCBI non-redundant protein sequences (NR), Swissprot, Gene Ontology (GO), the Clusters of Orthologous Groups (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. As a result, a total of 92,235 unigenes (50.20% of all unigenes) could be assigned at least one putative function from one of these databases (Table S3). A total of 89,285 unigenes were annotated from the NR database. Further analysis of the matched sequences showed that the P. armeniacum transcript was highly similar to Dendrobium catenatum (29.86%), Apostasia shenzhenica (8.38%), and Phalaenopsis equestris (5.45%) as shown in Figure S2A. For GO annotation, a total of 29,301 unigenes from P. armeniacum seeds were annotated into three GO pathways, as shown in Figure 3B. The functions of unigenes in biological process classifications contained cellular process, metabolic process, and biological regulation. Cell part, cell, and organelle were the most abundant functions in terms of cellular component classifications. The most abundant biological process functions are metabolic process and cellular process. In the molecular function classification, binding and catalytic activity were more abundant. The main GO entries were consistent with the fact that the cells divided frequently during seed development. Other entries such as catalytic, metabolic and binding activities were relatively high.
Differential gene expression and KEGG enrichment analysis
With the restrictive conditions of False Discovery Rate (FDR) < 0.05 and log2 ratio ≥ 1.0, unigenes that were differentially expressed in the seeds at the five developmental stages were identified. In total 8,722 differentially expressed genes (DEGs) were identified among all the libraries. The number of DEGs increased with seed development and peaked among 122 DAP and 66 DAP seeds (Figure 3A). A total of 576 common DEGs were identified in 87 DAP, 108 DAP, 122 DAP, and 150 DAP among all developmental stages, implying that these DEGs might be associated with seed development (Figure 3B).
KEGG enrichment analysis with the DEGs in 87 DAP vs 66 DAP, 108 DAP vs 66 DAP, 122 DAP vs 66 DAP, and 150 vs 66 DAP (Figure 4 and Table S4) were performed to provide more insight into the DEGs regulating lignin deposition at the later stages of seed development. The top-enriched KEGG pathways of these DEGs were secondary metabolite biosynthesis pathways, including flavonoid, phenylpropanoid and flavone, and flavonol biosynthesis pathways, etc. Among them, flavone and flavonol biosynthesis (ko00944) and flavonoid biosynthesis (ko00941) are highly enriched throughout the five key developmental stages, suggesting these pathways are potentially important for seed development such as with pigment accumulation, but not specifically related to lignin deposition. Our premise is that genes involved in non-methylated lignin accumulation would be induced primarily during the lignin rapid accumulation stage (from 101 DAP to 150 DAP). The analysis showed that phenylpropanoid biosynthesis (ko00940) and phenylalanine metabolism (ko00360) are indeed highly enriched in 108 DAP vs 66DAP, 122DAP vs 66DAP, and 150 vs 66DAP. Other highly enriched pathways were DNA replication (ko03030) in 87DAP vs 66DAP, starch and sucrose metabolism (ko00500) in 87DAP vs 66DAP and 108DAP vs 66DAP, and plant hormone signal transduction (ko04075) in 150DAP vs 66DAP.
Identification of genes potentially involved in non-methylated lignin biosynthesis
The monolignol biosynthesis pathway is well established, and a series of enzymatic reactions catalyzed by specific enzymes have been identified, including phenylalanine ammonia lyase (PAL), cinnamic acid 4-hydroxylase (C4H), 4-coumarate: CoA ligase (4CL), cinnamoyl-CoA reductase (CCR), cinnamyl alcohol dehydrogenase (CAD), hydroxycinnamoyl CoA: shikimate hydroxycinnamoyl transferase (HCT), 4-coumarate 3-hydroxylase (C3H), caffeoyl shikimate esterase (CSE), ferulic acid 5-hydroxylase (F5H), caffeic acid O-methyltransferase (COMT), and caffeoyl-CoA 3-O-methyltransferase (CCoAOMT)10,15 (Figure 5). The annotated lignin pathway gene in the KEGG database identified 20 Paphiopedilum genes that are homologous to genes potentially involved in monolignol synthesis (Table S5). Phylogenetic trees for the individual protein families of Arabidopsis, Dendrobium, and Phalaenopsis were constructed using the CDS sequenced listed in Table S10. The results showed that the identified genes in lignin biosynthetic pathways had a close genetic relationship with those reported in other plant species (Figure 6).
Most lignin-related genes displayed a very similar transcript level between 66 DAP and 87 DAP, prior to the rapid accumulation of lignin (Figure 6 and Figure7). Several monolignol synthesis related genes PAL, 4CL, HCT, and CSE demonstrated increased expression levels from 87 DAP to 150 DAP. Despite the high expression levels of these genes, most CCoAOMT, COMT, and F5H were absent or down-regulated from developing seeds, and a few displayed a significant decrease in the expression level. CCoAOMT expression level is relatively high at 66 DAP, but the low expression level of F5H made the conversion of caffeoyl moieties to feruloyl moieties inefficient, which resulted in no G and S lignin production at the early stage of the seed development. Later, the decrease in the expression level of CCoAOMT and upregulation of upstream genes in monolignol pathway caused the accumulation of H and C lignins instead of G and C lignins.
qRT-PCR validation of candidate genes involved in lignin biosynthesis
Contigs corresponding to PAL, 4CL, HCT, C4H, CCoAOMT, F5H, and COMT in five development stages of seed development were selected for qRT-PCR validation to verify the accuracy and reproducibility of the RNA-seq results. Methylated lignin-rich tissue stems were also included to verify our predicted monolignol pathway. The sequences of the primers used are given in Table S6. Reliable estimation of gene expression by qRT-PCR requires stable reference genes for data normalization. Six housekeeping genes ACTIN1, ACTIN2, TUBLIN, UBIQUITIN, RPS3A1, and RPS3A2 were tested for suitability as reference genes in the five development stages of seeds, leaves, and stems. qRT-PCR was performed using the primers listed in Table S8 to evaluate the expression variation of these candidates across seeds and vegetative tissues. The average threshold (CT) values of the candidate genes ranged from 17.25 to 23.66 in seeds and vegetative tissues (Table S9). The coefficient of variation (CV) and ΔCT are parameters used to identify suitable reference genes. The lower the values, the more stable the gene expression. ACTIN1 and ACTIN2 were the most stable genes among the six candidate reference genes. Therefore, ACTIN1 was chosen as a reference gene for this study. The additional experimental details on qRT-PCR are listed in Fig S3 to demonstrate the reliability of the assay.
Consistent with the transcriptomic data, PAL, 4CL, and HCT showed increased expression level as the seed approached maturity (Figure 7). CCoAOMT displayed relatively high transcript levels at the early stages of seed development and low transcript levels during the later stages. F5H and COMT were also down-regulated during seed development. The results confirm the reliability of the transcriptomic data. Stems displayed relatively high transcript levels of CCoAOMT in contrast to the mature seeds, which suggests that down-regulation of the first lignin methylation gene CCoAOMT is crucial for non-methylated lignin production.