3.1 Dramatic morphological and reserves changes during seed development
Whole soybean developing seeds at four sequential time points (20, 25, 30, 40 days after flowering (DAF)) that cover major stages of the seed development were sampled in triplicate (Fig. 1A). This period of seed development is coincident with seed growth stages from R5 (beginning seed) to R6 (full-size seed) of soybean development, which are morphologically comparable to early and mid-maturation stages as described in JY Lin, BH Le, M Chen, KF Henry, J Hur, TF Hsieh, PY Chen, JM Pelletier, M Pellegrini, RL Fischer, et al. [7]. It is also cooccurring with rapid accumulation of lipid and storage proteins [20, 33]. thus the seed size and seed weight increased significantly as it grew towards a fully developed seed (Fig. 1A). Two major storage reserves, protein and oil, in soybean seeds were quantified during this period. As shown in Fig. 1B, accumulation of storage oil was approximately 14.11% in seed at D20 and continuously increased to 17.85% at D40, with the greatest increase by 1.85% occurring in the period from D20 to D25 relative to the later periods at D25-D30 by 1.11% and D30-D40 by 0.78%. In contrast, as the seed matured, the storage protein content was 50.87% in seed at D20 and decreased to 45.83% (5.04% decrease) in seeds at D25, and continuously declined to 42.26% at D30 and to 41.77% at D40. Thus, oil content in the developing seeds increased more dramatically at early and less dramatically at the late period of seed filling, and vice versa for protein decrease. Reduction in the relative percentage of protein content over the period is due to its negative correlation with oil content in soybean seeds [34], on the contrary, its actual protein levels increased rapidly between 10–40 DAF [35]. Hence, this period of seed development with rapid morphological changes (such as size, weight) and major metabolic changes (such as oil, protein) should be governed by a variety of cellular processes/genes in seeds [17, 20].
3.2 Transcriptomic abundance shifts during seed development
An in-depth understanding of biological processes associated with seed development and reserves accumulation in soybean seeds is important for improving seed quality and yielding potential. Given the multiplicity of genes controlling seed development coupled with seed filling, time-course transcriptomic profiling of four-stage seeds that covered the representative period of seed development was carried out with an aim to obtain a global view of regulatory mechanism underlying this complex development process. In total, ~ 313.4 million sequencing reads from all sampled seeds were generated, with 26.1 million reads per sample (Table S1). Over 97.43% (25.4 million), on average, were kept after trimming the low-quality reads and used for read alignment. Uniquely mapped reads were used to quantify gene expression abundance as normalized by Count Per Million (cpm). After removing reads/transcript < 5, we identified that, an average, 3,5000 (62.5%) out of 56,044 gene models were expressed in developing seeds. We observed that a gradient reduction (approximately 700 genes, 2% of total) in the number of expressed genes over the course of seed development. The reduction could be related to the shutdown in many transcription processes as the seeds grew toward dormancy [36] and it would be intriguing to determine if this reduction is associated with a programmed increase of methylation levels during soybean seed development [7].
Both clustering dendrogram analysis and multi-dimensional scale indicated that gene expression levels among the replicated samples were generally highly related relative to those between time-point samples (Fig. 2A, B), indicating the gene expression specificity to the time points. Based on transcript abundance, approximately 61.07% expressed gene were in the range of 0–10 cpm, 36.41% were in the range of 10–100 cpm, and only 2.53% of genes are highly expressed with cpm ≥ 100 (Fig. 2C). Interestingly, as seeds enlarged, the number of genes with an abundance ≥ 10 cpm generally decreased while the number of those with an abundance ≤ 10 cpm gradually increased. Among the expressed genes, 32,326 were expressed across all four-stage developing seeds. There were 942, 329, 200, and 546 genes that are specifically expressed at D20, D25, D30, and D40, respectively (Fig. 2D). These results suggested a temporal shift in transcript abundance for many highly expressed genes as seeds developed, which might be reflective of the foregoing morphological and/or metabolic changes during seeds development. All the raw sequencing data were deposited to the Sequence Read Archive of the National Center for Biotechnology Information (BioProject: PRJNA551573).
A cluster dendrogram analysis showed that replicates from each condition were closely clustered, supported by a similar separation pattern in a multidimensional scaling (MDS) plot of log-FC values. We further performed a qRT-PCR for randomly selected 16 differentially expressed genes (Table S4) to validate and a high correlation (r = 0.87) between the RNA-Seq and qRT-PCR indicates the robustness of our RNA-Seq result.
3.3 Genes with high transcript abundance during seed development and filling
We next characterized the genes that were constitutively, highly expressed across all four time points. In total, 443 genes with mRNA abundance cpm ≥ 100 were identified [37] (Table S3). GO terms enrichment analysis indicated that biosynthetic processes (BP) essential for nutrients accumulation were enriched, such as amino acid metabolism, carbohydrate metabolism, glycolysis, and hexose metabolism (Table S3). Accordingly, this gene set was also enriched for molecular function (MF) terms of nutrient reservoir activity cellular component (CC) terms related to lipid storage body, endoplasmic reticulum (ER), ribosome where fatty acids were primarily stored and synthesized [38, 39]. If we relaxed the threshold to cpm ≥ 50, the number of newly defined highly expressed genes was increased by three folds to 1,386 genes (Table S3). Accordingly, the numbers of the expressed genes that fell in the majority of those enriched terms were approximately doubled or tripled, such as hexose metabolic process (8 to 22), glycolysis (7 to 18) cellular nitrogen compound metabolic process (18 to 36), cellular amino acid biosynthetic process (15 to 32), Ribosomal proteins (47 to 132). It should be noted that no change in the numbers of the genes with the terms related to monolayer-surrounded lipid storage body (7), nutrient reservoir activity (11), and sulfur amino acid metabolic process (6 to 7) was observed, suggesting the importance of maintaining high transcript abundance of these genes for seed filling. Additional enriched GO terms for the 1,343 genes were identified, protein localization, glutamine family amino acid biosynthetic process, intracellular transport, gene silencing by RNA, vesicle-related components (COP-II-, COP-I, coated vesicle, transport vesicle, ER to Golgi transport vesicle, membrane-bounded vesicle).
A close investigation of the 443 genes (cpm ≥ 100) revealed several genes with demonstrated roles in the accumulation of storage reserves, such as oil, protein, and carbohydrates. It is not surprising that some genes participating metabolism of protein and oil were identified. However, a many of the highly abundant genes during seed filling was rarely reported (Table S3), such as oil accumulation-related genes, FAD2 (467.83–2000.69) catalyzing oleic acid accumulation [40] and OLEOSIN proteins (373.03–6190.57 cpm) encoding oil body surrounded lipid storage body [29, 39], metabolic genes for biosynthesis of leucine, thiamin (vitamin B1), and Ribosomal proteins. It is noted that genes encoding storages proteins such as cupin family proteins rank the highest transcript abundance, ranging 34000–100000 cpm, in developing seeds, followed by xylem bark cysteine peptidase 3 (Glyma.08G116300 with 28399.81 cpm), trypsin inhibitor 1 (222.79 ~ 18559.51 cpm), and oleosin family proteins. Oleosins were highly abundant (373.03–6190.57 cpm). T hose proteins in soybean appear to be redundant and serve as a surfactant of the oil body [41, 42]. A recent study on the oleosin on chr20 [29] indicates that it is likely regulated by LEC2 [15], by which oil production is enhanced through increasing oil body turnover. Investigation of the cis-regulatory sequences in the promoter region could be helpful to use this set of genes to achieve oil improvement. In contrast, a set of high-abundant trypsin inhibitor genes (222.79–18559.51 cpm) could be silenced by gene editing to inhibit their expression or mutate them for better seed quality because trypsin inhibitors have an antinutritional effect [43].
More attention could be given to nutrient transporters that mediate nutrient flux into the developing embryo, such as sugar transporter SWEETs (274.98–943.19 cpm), and proton pump interactor 1. SUCROSE SYNTHASEs (121.5–1186.21 cpm) controlling starch synthesis. Recently studies highlighted the essential role of SWEET proteins in transporting sugar from maternal to embryonic tissues to support the seed development and oil accumulation [12, 44] and indicated the agriculturally critical role in causing high oil accumulation in cultivated soybean seeds relative to low oil in wild soybean [11]. High abundances of these transporters together with those related studies suggested that the rate of sugar transport and sugar accumulation is important for nutrient reserves accumulation in seeds. Further investigation of downstream affected steps/components may help in identifying the major route of carbon flow for oil accumulation or carbon partitioning between oil and protein in soybean, and perhaps other plant species such as maize and rice [45, 46]. Several LOX1 genes (541.27–16226.05 cpm) were in high transcript abundance here but the homologous proteins were not detected during seed filling in wheat and maize endosperm [47, 48], suggesting a lineage-specific role of LOX proteins in soybean seeds, possibly related to seed storage protein and lipids such as unsaturated fatty acid [49, 50] or special tastes and odors [51].
Transcription factors such as ABA INSENSITIVE 3 (ABI3; Glyma.08G357600; Glyma.18G176100) and its interacting regulator bZIP67 were identified in high abundance. High abundance of both regulators may be due to increasing needs for their involved cell growth and accumulation of storage reserves as seeds enlarged. Indeed, both regulators have funcitons paritally in common and were essential for the initiation of gene expression changes for biological processes critical for seed development such as gibberellic acid signaling, photosynthesis and maturation related processes [16, 17, 52, 53]. Abundance of LEC2s (cpm = 0) were undetectable in this study, possibly because they are highly specific to certain seed compartments and barely detected in whole seeds [15]. The well-documented LEC1 homologous genes, a central regulator of seed development [17], were not in high abundance but showed decreased expression over the period. Other high-abundance regulators with demonstrated roles in Arabidopsis but were less explored in soybean seeds were also identified, such as PLASTID TRANSCRIPTIONALLY ACTIVE 9 (PTAC9), TANDEM CCCH ZINC FINGER PROTEIN 4 (TCZF4), circadian rhythm regulators including TIME FOR COFFEE (TIC), LATE ELONGATED HYPOCOTYL 1 (LHY1), CONSTANS-LIKE 4 (COL4). Determination of the function of the regulators and the cognate cis-regulatory elements may facilitate identification of seed regulatory networks that, in part, contribute to the regulation of genes temporally and spatially in seeds [17, 18, 54]. All highly expressed genes and the functional annotations were listed in Table S2.
3.4 Genes were temporally regulated over the seed development
To comprehensively identify genes involved in developmental processes during seed filling, we conducted a pairwise comparison of transcriptomic profiles among the different-stage seeds. We observed a gradient increase in the number of differentially expressed genes (DEGs) between the two neighboring time points as the seeds developed, with 523 in D25vsD20, 1,486 in D30vsD25, and 2,485 in D40vsD30 (Fig. 3). The numbers of DEGs increased when comparisons were made across non-adjacent time points (over 10 d interval), including 4,079 in D30 vs D20, 5,617 in D40 vs D25, and 7,670 in D40 vs D20. In total, the pairwise comparison identified 8,925 non-redundant DEGs that are required to program soybean seed development from early to full seed development (Table S2).
Clustering analysis was performed for all 8,925 DEGs based on expression and shown in a heat map (Fig. 4). The analysis classified the DEGs to seven groups (A to G) exhibiting different gene expression patterns. The difference in the temporal expression patterns suggested stage-specific roles of these grouped DEGs throughout seed development. Groups A and G represented the two largest clusters (2,182 and 4,945 DEGs, respectively) while exhibited contrast expression patterns, with the DEGs in cluster A displaying predominant and highest expression level at D40 and those in group G showing a continuous decrease in gene expression as seeds matured. Enrichment analysis for cluster A genes showed that D40-specific genes were enriched for cell communication (GO:0007154), transcription (GO:0006351), and the biological process enriched occurred in the cell periphery and plasma membrane (Fig. 4B). This stage represented the later stage of producing and storing reserves and the active period of the shift to maturation in which multiple programming might be initiated, such as cessation of organ differentiation and cell division, preparation for seed desiccation [55]. A representative gene with specifically high expression at D40 is ABI3 (Fig. 4C) [16], which is an essential regulator in ABA (abscisic acid) signaling primarily involved in seed maturation and dormancy in Arabidopsis [56]. Its high abundance during D20-D30 and dramatic increase toward D40 suggested its critical role during seed development, especially in the later stage of maturation-related processes. Indeed, ABI3 was recently demonstrated that it is a key transcription factor modulating the transition from the morphogenesis to the maturation phase by combinational interaction with other seed transcription factors [17]. Further study of the co-expressed genes or binding sites [17, 21] would provide an opportunity to understand how the soybean seeds are regulated. Identification of the key seed development regulators (such as ABI3) indicated the robustness of our analysis and suggested that other seed development-associated genes identified in our study deserve further attention. For example, a similar expression pattern was also observed in two SUSs (Sucrose synthase) which involved in cellulose and starch biosynthesis in seeds [57].
Genes classified in cluster B (810) were highly expressed at D40 that has a similar expression pattern as cluster A, except for those genes that were up-regulated beginning at D30. Enriched GO terms for cluster B are transmembrane transport (GO:0055085) and nutrient reservoir activity (GO:0045735), suggesting that those biological processes involved in nutrient transport such as transporters for sugar and amino acids and nutrient accumulation were active beginning at D30 and maintained at D40. The developmental period of D30-D40 is approximately coincident with the middle-maturation stage which represented the major period of accumulation for storages such as protein and lipids [7]. Expression pattern analysis showed that, despite varying expression patterns observed, those genes involved in the foregoing biological processes showed obvious upregulation during seed development and filling (Fig. 4C), such as cupin proteins, oleosins, amino acid permeases (APPs), FAD2, nitrate transporters, Seed Maturation Proteins (SMPs). The stark upregulation in expression is undoubtedly reflective of the critical roles of the related biological processes for storage reserve accumulation. Transcription factors involved in seed development were also upregulated in the cluster B, such as AUS/IAA transcription factor, YABBY, uncharacterized PLATZ. Interestingly, the genes involved in flowering (TFLs, Tof12, PHYA) showed an upregulation-downregulation shift during the course, suggesting their expanded roles during seed development but have yet been explored. This finding was partially supported by a recent study where Arabidopsis TLF1 not only functioned as a shoot identity gene but also served as a signaling molecule that is essential in determining seed size, and loss-of-function mutant tfl1-20 producing larger developing seeds than wild type by regulating endosperm cellularization [58]. The expression patterns for those DEGs during the seed development were illustrated in Fig. 4C.
In contrast, top GO terms enriched for the genes in cluster G were mainly related to photosynthesis (GO:0015979) and microtubule-based process (GO:0007017). Down-regulation of these gene sets mainly reflected the reduction in the photosynthetic process as seeds developed, as evident by the downregulation of the bulk of genes encoding varying processes of photosynthesis, such as Chlorophyll A-B binding family proteins, high chlorophyll fluorescent proteins, and light-harvesting complex photosystem II (Table S1). Other family genes exhibiting downregulation expression patterns were also identified, such as GA2OXs, motor proteins, LACS9s, stay green-like, LEC1-like genes (Table S1). The decrease in the expressions of these genes with diverse roles associated with phytohormones (such as GA) [59], chlorophyll breakdown during senescence [60], seed developmental processes [54], which might be corresponding to multiple processes of gradual cessation of cell division or commencement of maturation as seeds developed and matured. Clusters C-F comprises relative fewer numbers of DEGs that display relatively high expression levels in two of four selected time points, and no GO term was significantly enriched.
3.5 DEGs have tissue-specificity in expression and exhibit tissue-specific functionality
To better understand the change or the DEGs, we examined the spatial expression profiles of all the DEGs in different developmental tissues using the Harada-Goldberg soybean RNA-seq dataset (http://seedgenenetwork.net). This dataset contains 10 tissues of Williams 82, the same accession as used in our study, including 6 reproductive tissues (floral buds (FB), five-stage developing seeds including whole-mount globular (GLOB), heart (HRT), cotyledon (COT), early-maturation (EM), dry seed (DS)) and four vegetative tissues (leaves (LF), roots (RT), stems (STM), and seedlings (SDL)). This independent research is complementary to our study, and an integration of our result with spatial analysis using multiple tissues may provide additional insight into the mechanism.
The DEGs exhibited clear tissue-specific (TS) patterns based on the expression similarity. This analysis grouped the DEGs into seven large (TS1-TS7) groups each containing over 400 genes and the rest three groups (not shown) comprising less than 30 genes (Table S2). The seven large groups were indicated in Fig. 5A and further investigated. Enrichment analysis for the seven TS groups reveals that none of the top five enriched GO terms were in common, indicating the tissue-specific roles. Interestingly, genes that were specifically expressed in developing seeds were divided into three groups, TS1 grouped genes that are highly expressed in three early-stage developing seeds, GLOB, HRT, and COT, TS2 in the EM-stage seeds, and TS3 in the later-stage seeds, DS. Approximately 2,647 (29.7%) of the DEGs were clustered to the TS group 1 (TS1) consisting of genes primarily expressed in GLOB-, HRT-, and COT-stage seeds and these genes showed continuous expression either up-regulation or downregulation. The majority (79.1%, 2,094 of 2,647) of the TS1 fell into the G group (down-regulation group) (Fig. 5A, 5C) exhibiting downregulation as seeds grew. Genes (585, 6.6% of total) in TS2 were predominantly expressed in the EM-stage seeds. 970 (10.9%) of DEGs were exclusively expressed in DS in TS7. This obvious spatial difference in expression among the different-stage seeds, suggesting the distinct roles of the two gene sets in seeds. Indeed, TS1 genes mainly involved in processes related to cell cycle and DNA replication, which might account for the early seed compartment morphogenesis and growth (Fig. 5B), and TS2 (EM) appeared to be more specialized in metabolic processes such as fatty acid biosynthesis and metabolism of carboxylic acid, and organic acid. In contrast, DS-specific genes were involved in the response to environmental stimuli, such as temperature, heat, alcohol, and plant hormones that regulate plant growth such as abscisic acid, and over half (67.7%) of this gene set were upregulated, (Fig. 5C). For example, LEA14 (LATE EMBRYOGENESIS ABUNDANT 14) from TS7 involved in wounding and light stress as well as perhaps protection against desiccation [61, 62]. Thus, the DS-specific DEGs might be involved in preparing transcripts for dormancy and possibly seed germination once in a favorable condition. Recent studies indicated that expression for many of these seed-expressed genes were highly specified spatially in seeds, and the specificity represents the functional relevance. For example, a seed coat-specially expressed SWEET39 involved in sugar delivery from maternal plant to developing embryo [63], endosperm-specific TFL1 involve in endosperm cellularization by which affects seed size [58], and endosperm-exclusive SWEET15 involved in seed development by regulating sugar delivery into developing embryo [12]. These studies suggested that many of the seed-specific genes identified here (Table S2) can be prioritized for detailed investigations to gain more comprehensive understanding, such as YABBA, GA3OX1, NF-YB6, NF-YA9, FAD8, KAS I/III, SWEETs, OLEOSINs, AATs (amino acid transporters), and uncharacterized LEC1-LIKE (Glyma.07G268100, Glyma.17G005600) that serves as a key regulator of fatty acid biosynthesis in Arabidopsis [64].
A total of 1357 (15.2%) DEGs were clustered in TS3 that were mainly expressed in leaves and were involved in many aspects of photosynthesis, and over 80% of them were downregulated as seeds matured (Fig. 5C). This result indicated that this gene set maintained the photosynthetic roles and was involved in seed developmental processes. It is likely that expression for many of these photosynthesis-related genes in seeds might be influenced by mosaic interactions of different transcription factors such as LEC1-ABI3-bZIP67-AREB3 [17, 54]. Genes that were specially expressed in RT and STM respectively shared many DEGs in common and were clustered in TS5, in agreement with the tissue functionality that are both involved in nutrient translocation [65], these genes mainly involved in carbohydrate and sugar metabolism and nutrients/ion transport to meet the large demand for nutrient uptake and transport (Fig. 5B). TS6 is the smallest group (400) that clustered the genes highly expressed in SDL and involved in processes such as oxidation-reduction. These results indicated that genes that have expression specificity in non-seed tissues such as LF, FB, and RT also represent its tissue-represented function in developing seeds (Fig. 5B). These genes remained the expression in relatively low abundance in developing seeds, suggesting their indispensable roles to maintain tissue-representative function for seed development. For example, AAP2 from TS6 (stem and leaf) affect levels of nitrogen and carbon assimilation and seed yield by altering xylem-phloem transfer of amino acid transfer [66]; FRD3 (MATE efflux family protein) from TS6 functions in root xylem for efficient iron uptake from the xylem into leaf cells [67]. Therefore, identification of these genes with non-seed tissue-specificities in developing seed was reflective of necessity of basic functionalities of the non-tissue genes in maintaining needed processes to support seed development. Further investigation of them may provide deep insight into their supportive roles during seed development.
3.6 Co-expression network inferred seed regulatory network
We next extracted the co-expression connection of all differentially expressed genes from a global SoyNet database (http://www.inetbio.org/soynet/) that was constructed with transcriptomes from different seed compartments at different developmental stages along with other tissues [30, 54]. Only a small fraction of the DEGs can be reconstructed into modules. In total, 134 modules were identified with the node (genes) ranging from 2 to 151. After color assignment based on the foregoing TS groups, 11 modules consisting of relatively tissue specific patterns were identified (Fig. 5D). It is interesting that many genes from TS1 (GLOB, HRT, COT) and TS3 (LF), rather than other TS groups, were respectively clustered together as individual TS-specific modules. Two modules specific for TS1 were identified. TCX2 (TESMIN-LIKE CXC2) controls stem cell division by regulating stem cell-type-specific networks [68]. It was slightly deviated from the main module, suggesting its role in orchestrating the connected genes of this module for a coordinated division of different cell types for seed development, in agreement with the function of GLOB/HRT/COT where ontogenesis of different seed compartments occur during this stage (http://seedgenenetwork.net/). Similarly, TDM1 plays important role in meiosis termination [69] and it appears to be extensively associated with genes in the TS1-dominated modules. We identified that the uncharacterized TKL2 (TRANSKETOLASE 2) might be a key connector bridging two modules comprising LF and GLOB/HRT/COT genes respectively. Likely, the link between RBCS3B (RUBISCO SMALL SUBUNIT 3B) and BLH1 (EMBRYO SAC DEVELOPMENT ARREST 29) connected the two individual modules. These results suggest a functional relevance of these modules in seed development and these genes that were under-explored in soybean deserved more attention.