Time-series expression pro le analysis to identify important modules and biomarkers in early phase of acute lung injury based on WGCNA


 Background: The morbidity and mortality associated with ALI continue to be significant. Few medical therapies have demonstrated efficacy in curbing the progression of ALI or improving its outcomes. However, time-series expression data has enhanced our ability to query dynamical processes, and WGCNA and maSigPro have emerged as a promising approach for processing large datasets. Therefore, it is possible for us to explore the molecular mechanism in the progression of ALI.Methods: Downloaded time series gene expression dataset GSE2565 from the Gene Expression Omnibus (GEO), and normalized the dataset using the “sva” R package. maSigPro were used to screen Differential expressed genes (DEGs) and weighted gene co-expression network analysis (WGCNA) were performed to identify hub modules. Gene Ontology and pathway enrichment analyses of genes identified in the hub module were conducted by Metascape. Integration of module analysis and CytoHubba application for identifying hub genes. Finally, receiver operating characteristic (ROC) curve analysis was to measure the predictive accuracy of the hub genes.Results: In our study, 3005 DEGs were included in WGCNA and 8 modules were identified. Module-trait analysis presented that red module with the most negative correlation with 8hr mainly involved in the regulation of circadian rhythm; pink module with the most positive correlation with 8hr mainly involved in regulation of cell death. Five hub genes (Bnip3, Cdh11, Fam134b, Sult1a1 and Zbtb16) were identified followed by ROC curve analysis.Conclusion: Our analysis based on time-series expression data identified significant co-expression modules and pathways correlated with early phase of acute lung injury. The hub genes identified may contribute to provide new insights for the molecular mechanisms in acute lung injury.


Introduction
As the primary target of a diversity of internal and/or external environmental insults including microbe's infection, autoantibodies, toxic gasses, pollutants, gastric acids and so on, the lung often manifest itself the form of acute lung injury (ALI) after exposing to the above mentioned insults. Acute lung injury can evolve to a more severe condition known as acute respiratory distress syndrome (ARDS), which would result to severe dysregulation of function and lung damage [1][2]. Despite in-depth investigation in treatment methods, the overall mortality of ALI and ARDS , in the United States, is still a remarkable 38.5 % and 41.1 %, respectively and in Shanghai, the mortality rate of ARDS for patients >15 years is 70% [3].
Hence, therapeutic strategies to inhibit the disease progression or improve prognosis and their underlying molecular mechanisms need to be further studied.
The Progress of ARDS comprised of three phases, which is a continuum rather than a strict chronologic phase including exudative, proliferative and brotic. The exudative phase (day 1-7) is considered to play a fundamental role of initiating ALI. In the absence of recovery during the exudative phase, the condition in some patients may evolve to a brotic phase which is characterized with brosis and other irreversible pathological changes. So, timely source control should be put at the core in the treatment of ALI.
Understanding the pathogenesis of ALI is important for current and future treatments as it is linked to outcomes in ALI [3][4][5].
Earlier studies have suggested that various biological process and multiple molecular factors involved in the progression of pathogenesis of ALI/ARDS such as in ammatory response, oxidative stress, apoptosis, autophagy, CCN1 and HMGB1 were included [6][7][8][9][10][11][12]. Therefore, the pathogenesis of ALI/ARDS was affected by a complex network rather than of a single molecular factor and/or pathway because of the phenotypic complexity of ALI/ARDS engendered by different insults, which indicates that researches into more in-depth and comprehensive understanding of molecular mechanisms underlying ALI/ARDS are in demand [13]. Although several researches have individually focused on the role of the potential biomarkers for the process of ALI and identifying the association of genetic factors with susceptibility to ALI based on Differentially Expressed Genes (DEGs), which revealed that some particular genes have been signi cantly affected by the progression of acute lung injury such as MMP3 ,Timp1, Ly6i, Cxcl1 [14][15]. However, limitation to the context of a speci c gene could not fully reveal the mechanism in the relatively whole progression of ALI and some signi cant molecular regulators or pathways during disease progression might be neglected. In our current study, as for gene expression dataset GSE2565 [16], it is archives of change at the early stage(day1-3) in the progression of the acute lung injury over time. To understand the role of gene cluster affected signi cantly in the process of ALI, the dataset need to be explored, and patterns of change analyzed. Genes in Dataset GSE2565 belong to the big data domain and present time-related pattern, hence traditional methods are not in adequate ability to process them.
Weighted gene co-expression network analysis (WGCNA) has emerged as a promising approach for evaluating correlation between gene clusters and traits based on co-expression network. Further analysis can be performed to recognize hub modules and con rm key genes in the module, thus identifying potential candidate biomarkers or worthwhile targets to enhance the ability of rapid intervention and improve clinical outcomes [17][18][19].
In this study, the DEGs were identi ed from gene expression dataset GSE2565 by using maSigPro approach [20][21]. The WGCNA application was selected for processing DEGs and hub gene modules associated with the progression of ALI were identi ed, and pathway analysis on hub modules was also presented. Hub genes were distinguished from the hub modules after CytoHubba analysis based on Cytoscape [22]. To assess the predictive ability of the hub genes, receiver operating characteristic (ROC) curve analysis was preformed.

Materials And Methods
Acquisition of the GEO datasets Microarray data sets of acute lung injury were screened from database(http://www.ncbi.nlm.nih.gov/geo). GSE2565 Series on the GPL339 platform(Affymetrix Mouse Expression 430A Array) was extracted, which was submitted by Sciuito AM [16]. This acute lung injury associated with GSE2565 contains 48 phosgene-exposed samples and 56 air-exposed samples. To increase the stability of our analysis, biological duplicates were selected. So, 48 paired biological samples from Male CD-1 mice(at 0. 5,1,4,8,12,24,48, and 72 hr post-exposure) were obtained and analyzed by WGCNA (Additional le 1: Table S1).

Microarray data preprocessing
Series matrix les provided by the GEO website were used for further analysis after log2-transformation.
Probe sets were annotated with Ensembl gene IDs using the data tables of Annotation File downloaded from the GEO website. The R application, the ComBat function of sva package, was used for time-batch normalization to remove batch effects of the gene expression data from the 48 samples processed on 8 consecutive time points [23][24].
The differentially expressed genes analysis in time-course dataset MaSigPro software from bioconductor was utilized for identifying differentially expressed genes (DEGs) across different time points in mice ALI [20][21]. MaSigPro is an R package for the analysis of time course microarray experiments and it was used to nd genes with signi cant expression changes with a two step regression approach. The rst step, gene selection, performs the least-squares technique to select statistically signi cant genes between the control group and any other experimental group. The Second step, variable selection, is to identify the conditions for which genes shows statistically signi cant expression changes. In this work, the rst step for gene selection rather than the second step was only performed so that the loss of information was as low as possible. For current work, a four-element regression model was de ned and a false discovery rate (Q < 0.05) was applied to select signi cant genes.

Co-expression network construction
In this work, genes selected by masigpro approach was performed to construct a weighted gene coexpression network via the "WGCNA" R package in R Studio (version 3.6.1). An adjacency matrix of genes'similarity by pairwise Pearson correlation analysis was conducted. Using the 'pickSoftThreshold' function, a appropriate soft threshold of β = 8 was obtained for strengthening the matrix to a scale-free co-expression network. The topological overlap matrix (TOM) was transformed from the adjacency matrix, and genes were clustered into different modules which was detected by the dynamic tree cutting algorithm of WGCNA.

Identi cation of signi cant hub modules
Module eigengenes (MEs) calculated by principal component analysis(PCA) represented the major component of each gene module. the correlation between MEs and clinical traits was estimated by the module-trait relationship analysis, which allowed identifying modules related to external traits. Gene signi cance (GS) was de ned by the correlation between the gene and the external trait. Module membership (MM) was measured by the correlation between each gene expression and each ME. The module signi cance (MS) was calculated as the average absolute gene signi cance (GS) of the genes within the module, which can be used to identify the signi cant modules with clinical trait.

Functional Enrichment Analysis of the hub Module
To understand the biological signi cance of the hub module identi ed by WGCNA, the genes of hub module were mapped into Metascape (http://metascape.org) for Pathway enrichment analysis and Gene Ontology (GO) enrichment analysis. Terms with a P value <0.01, a minimum overlap of 3, and an minimum enrichment factor >1.5 were chosen to be cutoff value. In this work, the top 20 terms were chosen for visualization when more than 20 terms for GO or pathway annotations were identi ed [25].

Identi cation of Hub Genes with Module analysis
The function of "exportNetworkToCytoscape" was used to export a PPI network by edge and node list les generated from hub modules in a format suitable for importing to Cytoscape. Based on the MCC sores, the top 15 highest-scored genes identi ed by the MCC algorithm of Cytoscape were referred as candidate hub genes. Meanwhile, For WGCNA, Candidate hub genes were also determined by absolute MM ≥ 0.8 and absolute GS ≥ 0.2. As such, common genes shared in the PPI network and in the WGCNA were considered to be hub genes. Finally , the area under the curve (AUC) of the ROC was calculated via "pROC" package to evaluate the prediction accuracy of the candidate genes. Genes with AUC >0.70 in ROC analysis were referred as the real hub genes [26].

DEGs in ALI
Having been adjusted batch effects, gene expression values of 48 samples from GSE2565 were standardized and the results of boxplot analysis before and after normalization were presented in Figure  1. DEGs were identi ed by the MaSigPro approach based on the details described in the Methods section.
To this end, a total of 3005 DEGs were screened out for further analysis.

Construction of Weighted Gene Co-Expression Network
After evaluation for the microarray quality by sample clustering, no outliers were detected in the clusters and the 48 tissue samples were suitable for constructing a hierarchical clustering tree (dendrogram) (Figure 2). The soft-power threshold β caculated by the function "sft$powerEstimate" was set as 8 to ensure a scale-free network ( Figure 3). The result of power estimation was shown in Additional le 2: Table S2. As a result, 3005 genes were clustered in to 9 modules based on the TOM matrix. Among the 9 modules for the ALI data, genes in the grey module were not co-expressed, they were excluded before subsequent analysis [27].

Construction of Module-Trait Relationships and Detection of hub Modules
The modules associated with time point after ALI exposure were con rmed by the analysis of the moduletrait relationships (Figure 4). we found that 8hr (R = -0.58, P = 2e−5), 12hr (R = -0.40, P = 0.005), 48hr (R =0.37, P = 0.01) and 72hr (R =0.41, P = 0.004) were all signi cantly correlated with the red module based on the module-treat relationship analysis ( Figure 4) and a correlation in upward trend or downward trend between the MEs and the time point was observed. Especially, the correlation between module and trait exhibited that red module had the most signi cantly correlation with ALI of 8hr, and then gradually increased from 8hr to 72hr (Figure 4). The correlation of pink module, 8hr (R = 0.51, P = 0.005), 12hr (R = 0.15, P = 0.3), 48hr (R =-0.34, P = 0.02) and 72hr (R =-0.49, P = 4e− 04) , gradually decreased. Further more, both red module and pink module exhibited higher association with 8hr ( Figure 5). As a consequence, they were considered to be hub modules linked to the development of ALI for further analysis.

Pathway Enrichment Analysis of Genes in the hub Module
To further evaluate the affected biological functions of the genes of the hub module, GO and KEGG pathway analyses were performed and the results of The GO-BP terms and KEGG pathways of the red module were presented respectively in Figure 6 and Figure 7. Annotation information of gene in hub modules can be seen in Additional le Table S3 and Table S4. The analysis of the top three GO-BP terms showed that genes in the red module were mainly signi cantly enriched: circadian regulation of gene expression, response to hormone and regulation of cellular response to stress. The terms of KEGG pathway analysis presented in Figures 6b. Circadian rhythm, Propanoate metabolism and PPAR signaling pathway were mainly signi cantly enriched. For the pink module (Figures 7a and Figures 7b), we identi ed 17 GO terms and 2 KEGG pathways, which were mainly related with regulation of cytokine production, positive regulation of cell death and glycerophospholipid metabolism.

Module analysis for Screening hub genes
The PPI network of the genes in blue module and pink module was respectively constructed by using the function "exportNetworkToCytoscape" based on WGCNA. The Maximal Clique Centrality (MCC) algorithm of Cytoscape was performed to select candidate hub genes from the PPI network. Based on the MCC sores, the top 15 highest-scored genes were referred as the hub genes. The result of the candidate hub genes detected by MCC from blue and pin module was visualized respectively in Figure 8a and Figure 8b. In addition, using a gene signi cance (GS) > 0.2 and module membership (MM) > 0.8 in co-expression network, Genes selected from the given modules were referred as hub genes (Table 1 and Table 2). As a consequence, 11 genes shared both in PPI network and co-expression network were chosen as hub genes. Genes shared by both networks are marked in blue color. In addition, to explore the prediction of the candidate hub genes as biomarkers of ALI , ROC curve analysis was performed ( Figure 9). Finally, 5 genes with AUC values greater than 0.7 presented a high predictive accuracy for the development of ALI, wich included Bnip3, Cdh11, Fam134b Sult1a1 and Zbtb16. So, these ve genes were considered to be the real hub genes .

Discussion
In this work, key modules and genes involved in the development of ALI with different time points were identi ed by multiple bioinformatics method. Firstly, a total of 3005 signi cant genes were selected by maSigPro approach. Then, WGCNA was performed to explore the relationship between modules and traits and key modules that were signi cantly relevant to ALI were con rmed. Finally, hub genes were also con rmed after module analysis. masigpro as a methodology to deal with the analysis of gene expression changes over time manifested itself in a unique statistical advantage [20]. The maSigPro approach in our study was used to select statistically signi cant genes. Especially, only the rst step was performed for the present work when applying the maSigPro procedure in order to minimise the loss of gene information. In addition, WGCNA as an comprehensive method to identify co-expression modules based on the similarity of expression patterns of genes was utilized for identifying key modules and hub genes related to the development of ALI.
Unlike previous works, which had for the most part used a static or a combined static time perspective, this work made use of a dynamic or time-series gene expression data to establish which speci c genes can be affected by clinical traits and obtain the number of genes of ALI participation for biological process to the utmost extent. The earlier studies took a static approach in exploring the mechanisms and implications of temporally regulated gene expression of ALI, focusing on either a few gene from a certain time or on gene data set from a speci c time frame which was a new data set that integrated data sets at different points in time. The previous works analyzed the genes associated with acute lung injury based on different data sets with different points in time or a new data set that integrated data sets at different points in time. For instance, Chen et al. analyzed ALI related genes by combining dataset GSE2411 containing attribute at 4 hours post-intervention, GSE18341 at 2 hours and GSE17355 at 0,1,4 and 10 days and considered the 3 different datasets with different attribute as a whole. As we known, most biological processes are dynamic and gene expression presents a high degree of temporal and spatial speci city, time-course gene expression data may have the ability to fully explore the patterns of gene expression underlying disease states. It is generally believed that time series gene expression data with the same size, compared to static gene expression data contain more information of gene regulatory network is derived from the perspective of biology. Static gene expression experiment may fail to fully exhibit the temporal trends of gene expression resulting in loss of some co-expressed genes [28].
However, our study identi ed the key modules and genes that representing the developmental process of ALI based on an 8 time points expression experiment, where we can get the most signi cantly genes over time.
In current study, by using WGCNA, we con rmed two gene modules that highly correlated to the development of ALI and carried out enrichment analysis for each modules. Since module red and module pink both displayed signi cant correlation In Module-trait relationships analysis, meanwhile both modules also had greater MS, we treated both modules as hub modules. In our current study, red module and pink module indicated negatively or positively correlation with 8hr, respectively. GO and KEGG enrichment analyses displayed that the genes in the red module had different roles and both were signi cantly associated with regulation of circadian rhythm. Previous studies have indicated that cellular and molecular circadian rhythms are elicited by in ammation in the setting of acute lung injury, indicating its important role in the fate of progression related to ALI [29,30]. being related to a variety of biological functions and biological dysfunctions, circadian rhythm related molecules has emerged as a promising target for seeking to rectify biological disorder [31].
For pink module, genes in pink module was mainly related with regulation of cytokine production, positive regulation of cell death based on GO analysis. Regarding to KEGG analysis, genes of pink module revealed enrichment in Glycerophospholipid metabolism and HTLV-I infection. Phosphatidic acid (PA), one of the elements of glycerophospholipid metabolism, is involved in the regulation of mTOR pathway, which affect the process of protein synthesis, autophagy and mitochondrial metabolism [32,33]. Autophagy, a form of programmed cell death, presents dual roles in the pathophysiologic process of ALI for progression/inhibition of the disease according to different cell types, different insults and progression for the disease and so on [34]. As a result, more comprehensive and in-depth evaluation for the role of autophagy in ALI are in demand.
For our current work, we identi ed 5 susceptibility genes for ALI: Bnip3, Cdh11, Fam134b, Sult1a1 and Zbtb16. Bnip3 have the ability to trigger cell death by the pathway of necrosis, apoptosis and autophagy [35]. Bnip3 exerts a dual effect through autophagy or apoptosis, Homeostasis of mitochondria is achieved to maintain cell Cell survival through autophagy while apoptosis is mediated by the interaction between Bnip3 and mitochondrial fusion protein-optic atrophy1 to induce cell death [36]. Further researches also implied that Bnip3 expression level was signi cantly increased following acute lung injury [37]. Cdh11 refers to a type II classical cadherin and participates in the regulation of calcium dependent cell-cell adhesion. Cdh11 has been reported to be associated with brosis, in ammation, cancer and other pathologic processes [38]. Flow cytometry was used to disclosed that the overall proportion of CDH11+cardiac mesenchymal cells increased from days 3 to days 7 and remained unchanged over time in sham group. Moreover, the signi cant increase of the expression of Cdh11 is also observed as early as 3 days following injury which in alignment with the time course of in ammation resolution. This observation implicates Cells associated with in ammatory in ltration such as macrophages neutrophils, and monocytes -may result to Cdh11-mediated brotic remodeling in heart [39]. This nding, the expression of Cdh11 presents time dependency, indicates that targeting Cdh11 may provide opportunities for timely intervention to early in ammation and curbing the progression to pulmonary brosis.
Since its discovery in mammalian cells, the molecule encoded by Fam134b has been widely recognized as a molecular receptor for endoplasmic reticulum-speci c autophagy (ER-phagy). AAA demonstrated that Fam134b activation exerts an important role are in the regulation of ER-phagy. it as a selective autophagy may present a dual function in cell survival. when cells are subjected to diverse insults such as oxidative stress, chemical stimulation and calcium overload, ER-phagy may help ER restore homeostasis or induce cell death according to the timing and magnitude of stimuli [40,41]. The study of Melchiotti R also implicated that Fam134b is a potential molecular target in response to in ammation, and refers to cytokine secretion [42]. As noted above, being involved in common pathway, It is reasonable to speculate that Fam134b has a strong connection with ALI. However, it remains to be explored whether Fam134b play pivotal role in the progression of ALI.
Sulfotransferase 1a1 (Sult1a1) , known as drug-processing gene, is a phase II metabolic enzyme that exerts extensively functions of metabolism and detoxi cation of multitudinous drugs and chemicals [43]. Lianxia Guo provided the evidence that the expression and activity of Sult1a1 was manifested in circadian rhythmicity, which was directly regulated by the clock protein Bmal1 [44]. Further more, there is a growing body of evidence that drug-processing genes for the most part were expressed in circadian rhythms and rhythmic expressions of which present close association with time dependency of toxicity and tolerance of drug in body [45]. Thereby, elucidating the molecular mechanisms of Sult1a1 in acute lung injury assume considerable signi cance in working out the best timing for drug administration. Zbtb16 (zinc nger and BTB domain containing 16), also known as the promyelocytic leukemia zinc nger protein (PLZF), was reported to be primarily expressed in the apical membrane of bronchioles and involved in restraining Toll-like receptor-induced in ammation in the hyperglycemic states, therefore, Zbtb16 may serve as moluculer marker in lung tissues [46,47].
The current study has some limitations. The microarray data for analysis were extracted from open access databases with a small sample size, so further studies on time-course are needed to improve the reliability of the result. Although, the mechanism of lung injury induced by various insults in animal models share common biological pathway, the current analysis was limited to the elucidation of animal models exposed to photogas. Thereby, other types of time series studies related to acute lung injury need to be further investigated. To this end, more representative results can be achieved by the analysis for common genes screened out from integrated dataset. In addition, animal model can not fully generate all the pathophysiological feature of ALI in humans, and most of which could only represent partial characteristics of human ALI [48]. Therefore, further researches are required to validate these results.

Conclusion
According to our knowledge, this was the rst study that selected the WGCNA method to identify the completely activated gene set in the progression of ALI based on time-series expression data. Circadian rhythm and autophagy were identi ed as the essential components of pathophysiological mechanisms in the early phase for ALI. Five hub genes, Bnip3, Cdh11, Fam134b, Sult1a1 and Zbtb16, were identi ed followed by module analysis, PPI network, and ROC analysis, which may contribute to provide new insights for the molecular mechanisms in acute lung injury.

Declarations Acknowledgments
We acknowledge GEO database for providing their platforms and contributors for uploading their meaningful datasets.
Authors' contributions LJ and YYM conceived the work. LJ wrote the manuscript. FLL, LJ, and YYM participated in data analysis and interpretation. FLL revised the manuscript and participated in discussion. CZJ supervised the study.
All authors approved the submission of the manuscript.

Funding
The present work was supported by grants from the National Natural Science Cooperation Foundation of China (grant no. U1604188) Availability of data and materials The dataset analyzed in this study can be derived from public repositories: Genes marked in bold black are common genes shared in PPI network  Figure 1 Box plot analysis for unnormalised and normalised data.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. TableS1InformationOf48Samples.xlsx TableS2TheResultOfPowerEstimation.xlsx TableS3Geneanotationinredmodule.xlsx TableS4Geneanotationinpinkmodule.xlsx