Hierarchical clustering of somatic embryo transcriptomes
In the present study, transcriptome datasets generated through microarray experiments were retrieved from the NCBI Gene Expression Omnibus (GEO) database (GSE48915) covering four somatic embryo developmental stages (with two replicates for each stage), referred to herein as Stage I (zygotic embryos), II (proliferating tissues at 7 days of induction), III (proliferating tissues at 14 days of induction) and IV (mature somatic embryos). The hierarchical clustering of samples (Fig. 1a) confirmed that the sample replicates of each stage have a higher degree of correlation with each other than with other developmental stages; sample outliers were not detected in the dataset. The clustering heatmap clearly distinguished four discrete clusters of related expression patterns corresponding to the stages of somatic embryo development (Fig. 1b). Further, Stage I showed a poor correlation with the other three stages. This suggests that Stage I may have a distinct expression profile as compared to other somatic embryo developmental stages.
Filtering of genes for the GCN construction and downstream analysis
As recommended by Langfelder and Horvath (2008) [32] genes were filtered by the variance for the GCN construction; filtering genes for variance greater than 0.25 quantile identified a total of 17,059 genes (Fig. 2a; see Additional file 1: Table S1). This included 445 EMBRYO-DEFECTIVE (EMB) genes [37], 10 SE marker genes [38] and 1,711 Arabidopsis transcription factors (TFs; 65.3%).
In addition, DEGs were identified by a pairwise ratio of expression between consecutive stages of development. A total of 2,244 genes were identified by threshold filtering based on log Fold Change (FC) (log2 FC ≥ 2.0 | log2 FC ≤ 2.0) and p-value < 0.05. 64 EMB genes [37], four SE marker genes [38] and 458 TFs were present within the DEGs identified (see Additional file 1: Table S2). Only Photosystem II Subunit Q gene (AT4G05180) was differentially expressed throughout SE (Fig. 2b).
Construction of GCN
The expression profiles of the filtered 17,059 genes were used to construct a scale-free gene expression network with a soft threshold of 15 (Fig. 3a). The dynamic hierarchical clustering approach integrated with the WGCNA pipeline distinguishes groups of genes with co-expression patterns and clusters them into network modules. In total, 26 distinct co-expression gene modules were detected with the module size ranging from 35 to 3,418 genes (Fig. 3b and c); each module was assigned with a unique colour. The module comprising most genes was the turquoise (3,418 genes) followed by the blue (2,973 genes) and brown (2,437 genes) (Fig. 3b). The expression profiles of co-expressed genes clustered in each module were summarized as ‘module eigengenes’ (MEs). Among the filtered genes, 13 genes that failed to fit within a distinct group were assigned to the grey module and removed from the downstream analysis. Module preservation analysis indicated high module preservation, confirming that the modules generated here can also be found in diverse independent datasets (Fig. 3b). Each module was exported and visualized using Cytoscape (Fig. 3d).
Identification of stage-related modules
The relationships between the gene modules and different somatic embryo developmental stages were determined by assessing the Pearson correlation coefficient (r) between the MEs and developmental stages. Module-trait correlation analyses revealed that multiple modules are related to SE (Fig. 4a). A total of 18 modules were significantly associated with the somatic embryo developmental stages (|r| > 0.8 and p-value ≤ 0.01; Fig. 4), and these modules were “stage-specific”, i.e., the module was significantly associated with only one particular developmental stage of SE; tan, turquoise, dark-orange and green to Stage I, grey60, magenta, brown, and light-yellow to Stage II, green-yellow, dark-gray, dark-green, orange, blue, light-green and light-cyan to Stage III, and pink, dark-turquoise, salmon and yellow to Stage IV. Gene significance, the correlation between modular gene expression and each stage, is shown in Fig. 4b.
Functional enrichment analysis of ‘stage-specific’ gene modules
Gene ontology (GO) enrichment analysis performed on ‘stage-specific’ modules showed that the genes in green and turquoise modules which exhibited a significant association with Stage-I were mainly enriched in the biological processes being involved in post-embryonic development, hormone-mediated signaling pathway, biosynthesis pathways (sterol and fatty acids), DNA methylation and transcription regulation. Genes in brown, light-yellow and magenta modules, which showed significant association with Stage II, were mainly enriched in the biological processes involved in root and shoot development, ATP synthesis, response to the metal ions and DNA replication whereas genes in blue and light-cyan modules, which showed significant association with Stage III, were enriched for the biological processes involved in transition post-embryonic and seed development, hormone- and sugar-mediated signaling pathways, cell differentiation, protein modification and RNA processing. Moreover, the yellow module, which showed a significant relationship to Stage IV, was mainly enriched in biological processes involved in ion transport, post-embryonic development, signal transduction, lipid localization, response to oxidative and water stress as well as response to phytohormones (abscisic acid, gibberellin, cytokinin and jasmonic acid).
Analysis of hub genes
Hub genes are nodes in a network often hypothesized to be functionally significant due to their high degree of intra-modular connectivity. A total of 260 genes (top 10 genes of each module with high connectivity) were identified as potential hub genes; the hub gene with the highest degree of connectivity in each module is given in Table 1 (the complete list of hub genes is given in Additional file 2: Table S3). GO enrichment analysis of the hub genes revealed that they are mainly enriched for biological processes such as metabolic processes (mRNA and cellular amino acid), oxidation-reduction, protein folding and post-embryonic development.
Among the hub genes, only 234 genes were functionally annotated; of these, 13 were TFs: AUXIN RESPONSE FACTOR 9 (ARF9), FLOWERING BHLH 4 (FBH4), BASIC HELIX-LOOP-HELIX 39 (BHLH39), BASIC LEUCINE-ZIPPER 44 (bZIP44), bZIP19, ZIM-LIKE 2 (ZML2), AT5G60820, AT4G01270, KANADI 3 (KAN3), HOMEODOMAIN GLABROUS 4 (HDG4), CELL DIVISION CYCLE 5 (CDC5), NAC DOMAIN CONTAINING PROTEIN 80 (NAC080) and SALT TOLERANCE (STO)). In addition, five genes encoding transposable elements (i.e., AT2G11560, AT3G33066, AT5G32430, AT3G42820 and AT4G28900) were identified.
In silico analysis of the promoter sequences (1000 bp upstream from the transcription start site) of the hub genes using the Multiple Em for Motif Elicitation (MEME) tool identified four significant motifs ranging in length from 15 to 29 bp (Table 2). Motifs 1, 2 and 3 were detected across 229, 245 and 121 hub genes, respectively. Further analysis of the predicted motifs using the GOMo (Gene Ontology for Motifs) tool provided in the MEME suite indicated that motifs 1 and 3 may be involved in the DNA endoreduplication, polarity specification of axial/abaxial axis and hormone-mediated signaling pathways; motif 1 and 3 seem to function in association to cytokinin and gibberellic acid, respectively.
Validation of hub genes
Comparison of hub genes and DEGs showed that 31 hub genes are differentially expressed in SE (the expression values of differentially expressed hub genes are given in Additional file 3: Table S4). Further, expression analysis of these genes using Arabidopsis eFP browser demonstrated that two hub genes, AT1G19540 and AT5G44380 exhibit a seed-specific pattern of expression (Fig. 5).
Moreover, analysis of the expression profiles of hub genes in the Arabidopsis somatic embryo transcriptome dataset (E-MTAB-2465) published by Wickramasuriya and Dunwell (2015) revealed that 62 hub genes are differentially expressed in somatic embryonic tissues compared to leaf tissues (|log2 FC| ≥ 2.0 and p-value < 0.05; Fig. 6). Of these, 15 genes were identified as DEGs in the present analysis. For instance, CYSTEINE-RICH TRANSMEMBRANE MODULE 7 (ATHCYSTM7/AT2G33520), HEPTAHELICAL TRANSMEMBRANE PROTEIN2 (AT4G30850), INDOLE-3-ACETIC ACID INDUCIBLE 30 (IAA30/AT3G62100), RPS9C, VASCULATURE COMPLEXITY AND CONNECTIVITY (AT2G32280), AT2G21820, AT2G38900 and AT5G43770 showed marked expression in somatic embryonic tissues as compared to leaf tissues. Expression analysis using the Arabidopsis eFP browser further showed that AT2G29300, AT2G21820, AT2G38900, AT5G43770, ATHCYSTM7 and AT1G19540 exhibit a seed-specific pattern of gene expression.
As expected few hub genes highly expressed in leaf tissues were repressed in somatic embryos indicating the importance of gene regulation in SE (Fig. 6); for instance, CELLULOSE SYNTHASE-LIKE B4 (AT2G32540), CHOLINE/ETHANOLAMINE KINASE 3 (AT4G09760), GLUTAMATE DECARBOXYLASE 2 (AT1G65960), ISOPROPYLMALATE ISOMERASE 2 (AT2G43100), PEROXIREDOXIN Q (PRXQ/AT3G26060), PHOTOSYNTHETIC NDH SUBCOMPLEX L 4 (PnsL4/AT4G39710), PLASTID RIBOSOMAL PROTEIN S20 (AT3G15190), STO (AT1G06040), SINAPOYLGLUCOSE 1 (SNG1/AT2G22990), THYLAKOID RHODANESE-LIKE (TROL/AT4G01050), TONOPLAST INTRINSIC PROTEIN 2 (TIP2/AT3G26520), AT3G50685, AT4G33666, AT5G16010 and AT5G54540 genes showed marked repression in somatic embryos compared to leaf tissues.
In summary, the present study identified a total of 78 hub genes as potential regulators of SE (Fig. 7), including genes showing marked over-expression as well as repression in SE. Of these, 41 genes have not been functionally annotated thus far. Analysis of the promoter sequences of these uncharacterized hub genes using the Plant cis-acting regulatory DNA elements (PLACE) database identified a total of 215 different plant cis-acting regulatory elements (CREs); ARR1AT, CAATBOX1, CACTFTPPCA1, DOFCOREZM, GATABOX, GT1CONSENSUS, POLLEN1LELAT52 and WRKY71OS observed in all 41 functionally uncharacterized potential hub genes. Moreover, several CREs related to embryogenesis were identified (Fig. 8). The functions of the predicted CREs are included in Table 3.
Distribution of embryogenesis-related genes across network modules
Further exploration of genes mapped to each network module found that 10 key regulators of SE including LEC1, FUSCA3 (FUS3) and ABSCISIC ACID INSENSITIVE 3 (ABI3) are present among the highly connected genes in the network (Table 4); SE-related marker genes, LEC2, SERK1, WUS, BBM and WUSCHEL RELATED HOMEOBOX 2 (WOX2)) showed low variance in the present dataset and thus were not included in the GCN analysis. We also observed that the majority of previously published EMB genes [37] are localized to the blue and turquoise modules, which showed significant association with Stage I and Stage III, respectively (Fig. 9; see Additional file 4: Table S5).
In addition, we observed that 1,711 Arabidopsis TFs are distributed across all the gene modules except in light-green and royal-blue modules, with the highest number of TFs present in the turquoise module (the complete list of TFs included in the GCN is given in Additional file 5: Table S6). Notably, AP2/EREBP (APETALA2/Ethylene-responsive element binding proteins), bHLH (basic helix–loop–helix), bZIP, C2H2 (Cys2-His2), HB (Homeobox), NAC (NAM, ATAF, and CUC), MYB (MYB-domain), C3H and WRKY TF families were highly represented (Fig. 10a). Of these, members of AP2/EREBP, bHLH, C2H2, HB, NAC, MYB and WRKY TF families were involved in early SE (Fig. 10b). Interestingly, TFs that are targets of several microRNAs (miRNAs) were also recovered from the GCN (Additional file 5: Table S7).
Notably, several genes encoding epigenetic regulators were localized in network modules (Fig. 11). This included 14 genes involved in DNA modification, 51 genes involved in histone modification, 34 genes involved in chromatin remodeling, 15 genes encoding polycomb-group proteins and 55 genes associated with RNA silencing (see Additional file 6: Table S8). Each of these genes directly interacted with numerous modular genes forming a complex network.