Hub Genes of Stroke Identi ed by Weighted Gene Co-expression Network Analysis

Junhong Li First A liated Hospital of Guangxi University of Chinese Medicine https://orcid.org/0000-0002-21548695 Yang Zhai GuangXi University of Chinese Medicine Peng Wu GuangXi University of Chinese Medicine Yueqiang Hu GuangXi University of Chinese Medicine Wei Chen GuangXi University of Chinese Medicine Jinghui Zheng GuangXi University of Chinese Medicine Nong Tang (  nongtang0125@yeah.net ) https://orcid.org/0000-0002-1479-2438


Background
Stroke is the second leading cause of death across the globe, after ischaemic heart disease and is predicted to be among the leading causes of death globally in the next 15 years [1,2]. It accounts for 11.3% of all deaths each year with more than 85% of stoke-related deaths occurring in low-and middleincome countries [3,4]. Previous studies report an increase in global burden of stroke has been reported despite declining mortality rates over the past two decades [5]. Current epidemiological data shows thatapproximately 16.9 million people are affected by stroke each year [6].Further, epidemiological studies predict that by 2030, the number of stroke survivors will rise to 77 million [7]. Stroke is an archetypical complex diseasemediated by genetic and environmental factors, therefore, the proportion of subtypes of stroke vary with age, race and ethnic origin [4,5].Differences in subtypes among various target groups is the main therapeutic challenge in stroke management.
Microarray-based gene expression pro ling is widely used in biomedical researchespecially in chronic non-communicable diseases including neurological diseases [8],cardiovascular disease [9], diabetes [10] and cancer [11].In previous studies, most of microarray analysis methods focus on the comparison between normal and diseased conditions [12]. Identi cation of differential gene expression is a widely used analytical strategy for screening of potential biomarkers in diseased state.Several studies microarray-based gene expression have been carried out, however, single microarray dataset have high false positive rates.Therefore, integration of multiple datasets would signi cantly increase the reliability and sensitivity of ndings and reduce false positives.Gene co-expression network analysis, the recent systems biology approach is widely used for analysis of correlation patterns among microarray analysis [13][14][15].
Weighted gene co-expression network analysis (WGCNA) [16,17], a tool for exploring the system-level functionality of genes, is widely used in gene expression data analysis. It de nes gene co-expression by determining similarity in gene expression between expression pro les and identi ed groups of genes that are highly correlated across the samples [17]. WGCNA is useful in understanding various biological processes. It helpsunravel the interactions between genes in different modulesand hence can be used foridenti cation of candidate biomarkers or therapeutic targets [18,19]. In addition,WGCNA links microarray data directly to clinical traits thus revealing mechanisms of drug resistance [20].Further, WGCNAis used for identi cation of factors for predicting pathological stage and prognosis of disease [21].
The aim of this study was to construct co-expression modules using blood samples from stroke patients.
Differentially expressed genes (DEGs) between expression pro les were identi ed.Further, biological function andGene ontology (GO) enrichment analysis were carried out. Hub genes were identi ed and effects of biomarkers for stroke were con rmed though receiver operating characteristic (ROC) curve analysisand literature validation. Identi ed modules and hub genes contributes to understanding of the molecular mechanisms of stroke and are potential biomarkers for stroke gene therapy.

Microarray data
Gene expression pro les of GSE22255 [22], GSE16561 [23] and GSE58294 [24]were obtained from Gene Data preprocessing and differentially expressed genes analysis Data analysis work ow of this study is shown in sFig. 1. Analysis was performed using the R software (version 3.5.1). Raw data were processed using a Robust Multi-array Average (RMA) algorithm based on a precompiled C language function in limma package.Data preprocessing included background correcting, normalizing and calculation of expression levels [25]. Missing values in the raw data were imputed using the knn function in the impute package, and any probe absent from all CEL les was eliminated. Further, the probe-sets were annotated using annotate package. After annotation, we used the built-in match function in R to matchprobe-sets to their gene symbol. In cases where multiple probes matched a single gene, the probes with the highest interquartile range (IQR) were selected as described in previous studies [26]. An expression matrix with 23,495 genes was generated for subsequent analysis. The top 50% most variant genes (11,747 genes) [27] were considered to be DEGs and selected for WGCNA analysis.

WGCNA analysis
Co-expression network analysis of expression data from GSE22255 was conducted usinga convenient one-step network construction method in the WGCNA package [16] to nd modules associated with stroke. To ensure the reliability of expression data, samples were clustered for detectionof outliers. Further, thresholding power β based on the criterion of approximate scale-free topology was selected for constructing a weighted gene network. The soft threshold calculates adjacency ranging from 0 to 1, so that the constructed network conforms to the power-law distribution and re ects the real biological network state [28]. In addition, the scale-free gene network was constructed and genes with similar patterns of expression (modules) were identi ed using blockwiseModules function in the WGCNA package. This function uses a dynamic tree-cutting algorithm to divide the hierarchical clustering tree into branches [29].WGCNA approach takes the association between the two connected genes into account, and considers the topological overlap measure (TOM) which represents the overlap in shared adjacent genes. Based on the single block and block-wise module colors, we then calculated the module eigengene (ME) which represents the expression level for each module. Strength of interaction between clinical trait and ME in each modulewas calculated to identify the clinical signi cant module. Finally, gene signi cance (GS) in the module was de ned asaverage p-value of each gene whereas the module signi cance (MS) was represented as the average GS of all genes in agiven module [27].

Enrichment and modules network analysis
The ClueGO plugin in Cytoscape(version 3.7.0) (https://cytoscape.org/) [30] was used to performGene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of clinical signi cant modules [31]. Signi cant GO terms and KEGG pathways (P<0.05) were selected. Further, MCODE plugin in Cytoscape was used to construct modules network and sub-networkswere extracted for further analysis.

Hub gene identi cation and validation
Hub genesare considered as functionally signi cant. The hub gene in a module was identi ed by module connectivity when the absolute value of the Pearson's correlation of gene-module membership (MM) was larger than 0.9 [27,32]. In cases where the absolute value of the Pearson's correlation of gene-trait signi cance(TS) was more than 0.2 [27], a gene was considered as highly correlated with a certain clinical trait. Furthermore, genes with both |MM| 0.9 and |TS| 0.2 were regarded as "real" hub genes. In the veri cation set of GSE16561 and GSE58294,the receiver operating characteristic (ROC) curve analysis were performed to validate the expression of hub genes in stroke and control samples. Moreover, the area under curve (AUC) was calculated usingpROCR package [33]. Larger AUC value of a gene indicated that it can effectively distinguish stroke from the control samples, and the hub gene with AUC > 0.6 in the three datasets was de ned as a diagnostic e ciency gene [28,34].

Co-expression network construction
Samples of GSE22255 were clustered to detect outliers using hierarchical cluster analysis. One sample was removed, and 39 samples were retained (sFig. 2). Retained samples were clustered again and clinical traits were converted to color representation and visualized using heatmaps (sFig 3). In the current study, scale free topology for multiple soft thresholding powers were calculated, and the power of β = 12 (scale free R 2 = 0.905) was selected as an appropriate soft thresholding parameter for weighted co-expression network construction (sFig. 4). A total of 11,747 most variant genes were used for co-expression networkconstruction. Cluster dendrogram and network heatmap of the most variant genes were constructed based on a dissimilarity measure ( Figure 1).

Key modules identi cation
After k-means clustering, detecting, calculating and checking module eigengenes, a total of 24 modules were identi ed (s.Table1). Grey module was not included in any module, so the subsequent analysis was not performed on this module [28]. Lightgreen(73 genes), pink(317 genes), midnightblue(109 genes) and yellow(870 genes) modules were highly correlated with disease status ( Figure 2).These modules were selected as clinically signi cant modules for further analysis. ME was in agreement with the expression level for each module which showed that lightgreen, pink and midnightblue modules were downregulated, while the yellow module was upregulated. Moreover, lightgreen (correlation coe cient r=-0.41, P=0.01), pink(r=-0.33, P=0.04) and midnightblue (r = -0.4, P=0.01) modules were negatively correlated with disease status. On the contrary, yellow(r = 0.35, P=0.03) module were positively correlated with disease status (Figure 2). In addition, correlations between GS and the two modules (pink and yellow) were 0.21(P <0.001) and 0.16(P < 0.001), respectively, which indicated signi cant correlation with stroke ( Figure 3). Therefore, pink and yellow modules were used for subsequent analysis.

GO enrichment KEGG pathway analysis of yellow and pink modules
To explore the biological processes and pathways in the pink and yellow modules,enrichment analysis was performed. GO enrichment KEGG pathway analysis were conducted using ClueGO plugin in Cytoscape (detailed information is shown in Table1). Analysis showed that biological processes of the yellow module were mainly enriched in ion transport system dysfunction including potassium ion transport,cellular potassium ion transport and potassium ion transmembrane transport, which are implicated inneuron death. Biological processes of the pink module were mainly involved in regulation of neuron projection regeneration and repair of DNA damage such asnucleotide-excision repair, transcription-coupled nucleotide-excision repair which play an important role in prevention of cell death after stroke. In addition, pathways of the yellow module were enriched in neuroactive ligand-receptor interaction, arginine biosynthesis, and alanine, aspartate and glutamate metabolism. In the pink module, pathways were mainly enriched in nucleotide excision repair, retinol metabolism and B cell receptor signaling pathway.

Hub genes identi cation and validation
Based on the cut-off criteria (|MM| > 0.9 and |TS| > 0.2), 11 genes in the yellow module and 13 genes in the pink module were identi ed as hub genes (Table2). In GSE22255 dataset, expression levels of SAMHD1 and RUNX1-IT1 were signi cantly lower in stroke samples compared with control samples (sFig.

Discussion
Stroke remains a global burden to human health.Understanding molecular mechanisms implicated in stroke pathogenesis and prognosis is critical for development of precision medicine or personalized medicine. Although the pathogenesis of stroke is extremely complicated, recently, signi cant progress has been made in areas such as energy metabolism disorders, excitatory amino acid toxicity, penumbra depolarization and apoptosis, which are involved in stroke pathophysiological processes. However, molecular mechanisms of stroke have not been fully explored. Therefore, for effective management of strokestudies should explore the molecular mechanisms implicated in the development of stroke. In this study, we used gene expression datasets from GEO database to construct a weighted gene co-expression network.Further, we identi ed pathways and key genes that are potential biomarkers or therapeutic targets for stroke. In addition, whole genome data for stroke from GEO database and literature were used for validation of results.
WGCNA was performed to construct free-scale gene co-expression networks, and to explore gene coexpression modules associated with stroke. A total of 11,747 most variant genes were used for construction of co-expression network and 24 modules were identi ed. Pink module with 317 genes and yellow module with 870 genes had signi cant association with stroke.A total of 26 genes with high connectivity were screened out from the two modules. Among them, 8 hub genes were highly associated with stroke. Hub genes identi ed for the pink module were PRR11, NEDD9, Notch2, RUNX1-IT1 and ANP32A-IT1.On the other hand, hub genes identi ed for the yellow module were ASTN2, SAMHD1 and STIM1. Co-expression analysis showed that different modules were correlated with different functions. Notably, genes in the pink module are implicated in cell apoptosis and neuronal differentiation whereas genes in the yellow module are implicated in synaptic form and stroke recovery. PRR11 (proline-rich protein 11), which is involved in cell cycle regulation, is involved in regulation of protein-protein interaction and cell signal transduction via Wnt/β-catenin signaling pathway [35]. PRR11 consists of a binary nuclear localization signal, two proline enrichment regions and a zinc nger domain, which regulates gene transcription by binding to duplex DNA [36]. Previous studies report that PRR11 is implicated in tumor progression.However, the role and clinical application value of PRR11 in stroke is unknown [36,37]. Studies show that Notch signaling in cerebral ischemia plays a role in in ammation, oxidative stress, apoptosis, angiogenesis, synaptic plasticity and the function of blood-brain barrier [38][39][40]. Notably, Notch2 is upregulated with increased cell death shortly after cerebral ischemia injury in hippocampal areas [39]. Further, an increase in Notch2 levels in apoptotic cells was reported after oxygen and glucose deprivation treatment [41].In addition, Notch2 signaling is associated with the progression of atherosclerosis [42]. Moreover, Notch2 mediates quiescence in endothelial cells, whereas in ammatory cytokines trigger increased Notch2 activity promoting apoptosis. Understanding the role of Notch2 signaling in stroke may provide insights on development of new therapeutics [43]. The brain undergoes self-repair by producing new neurons and has the ability to compensate for the loss of function after stroke [44]. NEDD9 (Neuronal precursor cell-expressed, developmentally down-regulated gene) was initially identi ed in mouse central nervous system [45,46]. NEDD9 isa splicing variant of Cas-L and promotes neurite outgrowth through tyrosine phosphorylation and is absent in adult brain.Interestingly, NEDD9 is upregulated for neuronal differentiation after cerebral ischemia [46]. This implies that NEDD9 is involved in recovery of neurologic function after stroke, and upregulation of NEDD9 may widen the therapeutic time window for cerebral ischemia [46,47]. SAMHD1, originally identi ed from a dendritic cell, is a deoxyribonucleoside triphosphate triphosphohydrolase [48][49][50] and its mutations were recently linked to susceptibility to stroke. Recent studies have shown that SAMHD1 gene mutations might cause genetic predispositions that interact with other risk factors, resulting in increased vulnerability to stroke [51]. Furthermore, a stroke cohort study reported that SAMHD1 gene mutations may be implicated in stroke pathogenesis in general population [52]. In addition, a functional loss of SAMHD1 protein resulting from the missense mutations c.64C>T and c.841G>A were reported to play a role in stroke pathogenesis [53]. In patients with cerebrovascular diseases, lack of SAMHD1 protein expression was associated with decreased expression of IFNB1 and increased expression of IL8 [54]. ASTN2 (astrotactin-2) is a conserved perforin-like membrane protein expressed in the developing and adult brain and is primarily involved in neuronal development [55,56]. ASTN2 modulates a number of protein complexes in neurons that impact synaptic form and regulate synaptic adhesion activity [57]. ASTN2 binds to various interacting proteins, like ROCK2 and SLC12a5, which are implicated in vesicle tra cking and synaptic function. Ca 2+ entry is important for platelet activation and stroke. Glutamateinduced dysregulation of intracellular Ca 2+ homeostasis is a key mechanism in stroke pathogenesis [58]. STIM1 (stromal interaction molecule 1), a Ca 2+ sensor localized in endoplasmic reticulum, is involved in regulation of store-operated calcium entry [59]. Defective Ca 2+ entry mechanism in mutant platelets activate STIM1, resulting in impaired platelet aggregate formation [60]. This process offers protectionto mice from ischemic stroke [61]. Dong M et al reported that STIM1/Orai1 expression was associated with mortality and recurrence in ischemic stroke patients and was an independent predictor of the 3-month stroke recovery [62].
Intriguingly, three hub genes were identi ed in this study. Recent studies have shown that long non-coding RNA is considered a key regulator of pathogenesis of stroke [63]. Two key noncoding RNAs (RUNX1-IT1 and ANP32A-IT1) were identi ed as effective diagnostic genes in the present study. LncRNA RUNX1-IT1, and ANP32A-IT1 are the intronic transcript 1 from their respective genes [64,65]. RUNX1 IT1 plays a tumour suppressive role and plays aprotective role in colorectal cancer by regulating cell proliferation, migration, and apoptosis [65]. Although the roles of RUNX1-IT1 in different diseases have been explored [66], its biological roles in stroke remains unknown. ANP32A is implicated in neurons differentiation, brain development, and neuritogenesis. Modulating ANP32A signaling could help manage oxidative stress in brain [67] and restore cognitive function [68] with therapeutic implications for neurological disease. To our best knowledge, no study has reported therole of RUNX1-IT1 and ANP32A-IT1 in stroke.
Neuronal death in cerebral ischemic area of stroke is accompanied by unregulated genes expression.Therefore, the hub genes of the pink module can regulate cellular signal transduction involved in in ammation, oxidative stress, apoptosis, angiogenesis, and synaptic plasticity. On the other hand, hub genes of the yellow module may play a role in synaptic form, antiplatelet aggregate formation and shortterm prognosis after stroke. Therefore, we speculated that pink module and yellow module arekey modules in neuronal death, new synapse formation, and functional recovery of stroke. Moreover, noncoding genes subset in the hub genes were identi ed in the present study. Non-coding RNAs play a key regulatory role in pathogenesis of stroke, therefore, further studies involving non-coding RNAs populations and exploring the mechanism underlying their role in stroke should be carried out.
Co-expression analyses revealed mRNA regulation expression network in stroke. In the present study, we used WGCNA to construct gene co-expression networks and to explore the relationships between modules and clinical traits. Due to the high degree of consistency in the expression relationships of the module, genes of the same module might share a common biological role. Our results provide valuable information for basic and clinical research of stroke. Although there are similar limitations with most other data mining approaches [69], we adoptedthe indicated methods to reduce possible bias. In order to increase the credibility of WGCNA results, we used matched stroke and normal samples for analysis, andexternal data from GEO database to validate the results.

Conclusion
The hub genes PRR11, NEDD9, Notch2, RUNX1-IT1, ANP32A-IT1, ASTN2, SAMHD1 and STIM1 were found to be signi cantly correlated with stroke and were veri ed in other two datasets. These hub genes can be used as potential diagnostic and prognostic biomarkers of stroke, although further research should be carried out. Moreover, pink and yellow modules were considered as critical modules in neuronal death, new synapse formation, and functional recovery of stroke.

Declarations
Ethics approval and consent to participate: Not applicable.

Consent for publication:
Yes.
Availability of data and materials: Not applicable.    Module-trait relationships. Each row corresponds to a module eigengene, column to a trait. Each cell contains the corresponding correlation and p-value. The table is color-coded by correlation according to the color legend.  Validation for Hub genes regulatory network. Eight hub genes in the pink and yellow modules were colored with lightcyan, and the target genes were colored with magenta.