Identication of Molecular Biomarkers Associated With Stroke Progression Using WGCNA and SVM-RFE

Background: Stroke is the second most common cause of death worldwide and the leading cause of long-term severe disability with neurological impairment worsening within hours after stroke onset and being especially involved with motor function. So far, there are no established and reliable biomarkers to prognose stroke. Early detection of biomarkers that can prognose stroke is of great importance for clinical intervention and prevention of clinical deterioration of stroke. Methods: TGSE119121 dataset was retrieved from the Gene Expression Integrated Database (Gene Expression Omnibus, GEO) and weighted gene co-expression network analysis (WGCNA) was conducted to identify the key modules that could regulate disease progression. Moreover, functional enrichment analysis was conducted to study the biological functions of the key module genes. The GSE16561 dataset was further analyzed by the Support Vector Machines coupled with Recursive Feature Elimination (SVM-RFE )algorithm to identify the top genes regulating disease progression. The hub genes revealed by WGCNA were associated with disease progression using the receiver operating characteristic curve (ROC) analysis. Subsequently, functional enrichment of the hub genes was performed by deploying gene set variation analysis (GSVA). The changes at gene level were transformed into the changes at pathway level to identify the biological function of each sample. Finally, the expression level of the hub gene in the rat infarction model of MCAO was measured using RT-qPCR for validation. Results: WGCNA analysis revealed four hub genes: DEGS1, HSDL2, ST8SIA4 and STK3. The result of GSVA showed that the hub genes were involved in stroke progression by regulating the p53 signal pathway, the PI3K signal pathway, and the inammatory response pathway. The results of RT-qPCR indicated that the expression of the four HUB genes was increased signicantly in the rat model of MCAO. Conclusion: Several genes, such as DEGS, HSDL2, ST8SIA4 and STK3, were identied and associated with the progression of the disease. Moreover, it was hypothesized that these genes may be involved in the progression stroke by regulating the P53 signal, the PI3K signal, and the inammatory response pathway, respectively. These genes have potential prognostic value and may serve as biomarkers for predicting stroke progression. The early identication of the patients at risk of progression is essential to prevent clinical deterioration and provide a reference for future research.


Introduction
Stroke is a serious threat to human health. Globally, it is the second most common cause of death and the main cause of long-term severe disability (Feigin et al.,2014). Stroke can be ischemic or hemorrhagic, but ischemic stroke is the main type, accounting for about 87% of all stroke cases (Radu et al.,2017). Ischemic stroke is a severe clinical syndrome with a high burden of neurological disability and mortality (Evers et al.,2004). Current treatments for ischemic stroke injuries have proven to be inadequate, partly because of the incomplete understanding of the cellular and molecular mechanisms involved in ischemic stroke. Early identi cation of patients at risk of progression is becoming increasingly important because of the high fatality rate and narrow treatment window for stroke. The identi cation of the relevant genes regulating the progression of stroke can promote the understanding of the underlined mechanisms of stroke, facilitate the prognosis of stroke, and even be the basis for the development of new treatments.
With the development of high-throughput sequencing and bioinformatics, the availability of sequencing data has enabled the discovery of key genes, networks, and pathways related to stroke. However, identifying these features remains challenging. Weighted gene co-expression network analysis (WGCNA) is a network-based screening method and a data exploration tool, which can divide genes with similar expression patterns into several biologically meaningful co-expression modules, analyze the relationship between gene modules and clinical features, and nally evaluate the importance of genes in featurerelated modules. The latter group of signi cant genes constitute candidate biomarkers and therapeutic drug targets (Zhang et al.,2005). WGCNA has been widely used in a variety of medical elds, such as oncology (Xiang et  Machines (SVM). The so-called recursive feature elimination is used to initially obtain a features ranking in its iteration. In every iteration, the score of each feature is calculated according to certain rules, and the feature with the lowest score is removed, with this process being repeated until all genes have been ranked. SVM-RFE belongs to the category of wrapper dimensionality reduction techniques since it uses the weight vector of the hyperplane produced by the SVM model to select and delete features. Then, the training dataset is used to train the SVM classi cation model, and all the steps are repeated until the features in the candidate list are completely deleted, and the last gene to be deleted is the most signi cant one for the classi er (Guyon et al.,2002).
In this study, gene expression pro le GSE119121 dataset was retrieved from the Gene Expression Omnibus database (GEO) and weighted gene co-expression network analysis (WGCNA) was performed to identify the key modules associated with disease progression, and to functionally annotate the key modules with genetic ontology (GO) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway terms.
Then, the GSE16561 dataset was analyzed with the SVM-RFE algorithm to identify the gene with the highest score regarding disease progression. The key genes of each module revealed WGCNA were used as inputs to the SVM-RFE method to obtain the best stable results for predicting phenotypes.

Weighted co-expression network construction and identi cation of key modules
There were totally 47 samples of transcriptional data in GSE119121, including control (n= 8), 1-hour ischemia (n = 8), 2-hours ischemia (n = 8), 3-hour ischemia (n = 8), 6-hour ischemia (n=7) and 24-hour ischemia (n= 8) groups, respectively. All samples were included in the co-expression network. The soft threshold β is determined by the function "sft$PowerEstimate" to reassure that the reconstructed network possesses the scale-free topology feature. The absolute value of the correlation coe cient was greater than 0.8, and the data selection threshold was set to 10 ( Figure 2) for the present manuscript's data.
Then, the dynamic hierarchical tree cutting algorithm was used to detect the co-expression modules, and the minimum height of the merged modules was set to 0.25. Finally, 14 modules are obtained for the detection of stroke-related gene modules by WGCNA ( Figure 3). Firstly, the samples were clustered to delete the contour samples ( Figure 4). Then, WGCNA was used to identify the genes that are highly coexpressed in each group. Modules are colored randomly. A total of 8 gene modules were identi ed ( Figure   5), showing the genetic signi cance of each tissue module, with the most signi cant being the brown (r=0.65 p=5e-05) and the black r =-0.92 p=2e-20 ones. Both these modules were highly correlated with stroke ( Figure 6). Therefore, the brown and black modules were selected for subsequent veri cation.

Pathway analysis for speci c modules
All genes in the Brown module were selected for GO and KEGG analysis. Our results ( gure 6) of GO and KEGG analysis demonstrated that the brown module ( gure 7) was mainly enriched in signaling mechanisms, such as innate immune system, regulation of defense response, production of cytokines, in ammatory response, cellular redox homeostasis, endoplasmic reticulum stress response and so on. Among them, the regulation of defense response, the production of cytokines, in ammatory response, and the endoplasmic reticulum stress response were all related to stroke, which is in agreement with previous ndings. It is thus speculated that the impact of the brown module on stroke progression was closely related to the above pathways. The interaction diagram of each module was shown in the PPI diagram. In addition , the pathway analysis results showed that the black module ( gure 8) was associated with many processes including defense response, receptor internalization, congenital immune response, negative regulation of genome replication, negative regulation of immune response, and positive regulation of DNA metabolic processes.

Feature selection for machine learning
Speci c stroke samples were extracted for analysis from the GSE16561 dataset of the GEO database, including the control group (n=24) and stroke patients (n=39) to further identify the key genes that regulate stroke progression by performing SVM-RFE feature screening and subsequent validation. The results showed that the SVM-RFE feature selection method could achieve high accuracy and low error rate ( gure 8). In addition, the top 100 genes were selected according to their score and the intersection with the genes of the black module and the brown module (annex 1). The results showed that 4 genes were prioritized, and these were DEGS1, HSDL2, ST8SIA4, and STK3 ( gure 9).
After further comparison of the differences of the four hub genes between normal patients and stroke patients, the results showed that the four hub genes were signi cantly increased in the stroke group (P < 0.01), suggesting that these four genes can act as risk factors for the development of stroke.

Detection e ciency of hub genes
the "pROC" package was used to analyze the receiver operating characteristic (ROC) curve to assess the predictive ability of the four hub genes on stroke progression. The results showed that the expression of the four genes could signi cantly distinguish the stroke progression, and the areas under the curve of the DEGS1, HSDL2, ST8SIA4, STK3 genes were 0.884, 0.911, 0.902, and 0.984, respectively ( gure 10). However, the speci c molecular mechanisms involved in the four genes should be further explored.

Pathway analysis of hub genes
The differential feature enrichment of four hub genes was conducted using the GSVA algorithm to study the correlation between DEGS1, HSDL2, ST8SIA4 and STK3, and the functional characteristics of stroke progression. The results showed that the overexpression of the DEGS1 group was mainly enriched in the pathways ANGIOGENESIS, IL2_STAT5_SIGNALING, TGF_BETA_SIGNALING, . It is observed that the GSVA results show that most of the hub genes were related to the p53 signal pathway, the PI3K signal pathway, and the in ammatory response pathway, indicating that stroke progression is associated with these pathways.

Validation of hub genes
As shown in gure 12, the results of real-time photoluminescence (RT-PCR) showed that the DEGS1, HSDL2, ST8SIA4 and STK3 genes are overexpressed in the stroke group in comparison to the control group with the most pronounced difference being the one of the HSDL2. The PCR results validated the main ndings of the discovery transcriptomics analysis. The four hub genes can be used as biomarker primers to in uence stroke progression (See attachment 2 for primers in Annex 2).

Discussion
Ischemic stroke is one of the most common cerebrovascular diseases, and its mortality rate is increasing year by year. Many stroke survivors are being left with various neurological defects, leading to different degrees of impaired quality of life, and this leads to a great burden on patients, their families, and society (Benjamin et al.,2018). The occurrence and development of ischemic stroke are closely related to genetic changes. The identi cation of genes that regulate the progress of stroke can promote the understanding of the underlined mechanisms of stroke, predict the progress of stroke, and contribute to the development of new treatments.
In this study, the GSE119121 dataset was retrieved from the GEO database, and 14 modules were obtained after the co-expression module analysis with the dynamic hierarchical tree cutting algorithm. 8 gene modules were identi ed by WGCNA among the groups of the overexpressed genes. The most prominent of these were the brown module and the black module. GO and KEGG analysis showed that the brown module was mainly enriched in signal mechanism such as innate immune system, regulation of defense response, production of cytokines, in ammatory response, endoplasmic reticulum stress response, and so on, with all of them being highly related to stroke. This is consistent with previous reports. In recent years, the role of in ammasome-mediated in ammatory pathways in various central The four hub genes were veri ed in the second data set, and all four hub genes were signi cantly increased in the stroke group (P < 0.01), suggesting that these four genes are risk factors for stroke development.
DEGS1 is a protein-coding gene. Karsaig has identi ed DEGS1 as a pathogenic gene involved in myelin degeneration in the central nervous system and peripheral nervous system (Kraveka et al.,2007). Previous studies have shown that sphingolipid is increasingly regarded as an important cellular mediator in tumor and in ammatory hypoxia . Sphingolipid is also an important part of the cell membrane, which is functionally related to basic processes, such as cell differentiation, neuronal signal transduction, and myelin formation. Sphingolipid imbalance is the chief culprit of various neurological diseases (Karsai et al.,2019). Defects in sphingolipid synthesis or degradation can lead to various nervous system diseases. The high expression of DEGS1 after cerebral ischemia promotes the synthesis of ceramides, which may affect the balance between sphingolipids and dihydrosphingolipids. DEGS1 enzyme participates in catalyzing biosynthesis of ceramide from the head and controls the process from dihydroceramide to ceramide, which is essential for maintaining the balance regulation between sphingolipid and dihydrosphingolipid (Rodriguez et al.,2014). Previous studies have shown that ceramide can be used as a predictor of ischemic stroke (Mohamud et al.,2019). Alsana believes that manipulating dihydroceramide levels may represent a new treatment strategy for cancer. This leads to the accumulation of dihydroceramide, which ultimately leads to the activation of endoplasmic reticulum stress and autophagy before cancer cells die (Alsana et al.,2020). The polyubiquitin of DEGS1 seems to change its function from promoting apoptosis to survival, which plays a vital role in the mechanism of stroke treatment. The con rmation of this validation could suggest that DEGS1 could also be used as a target for the treatment of stroke.
HSDL2 is peroxisome located in human, mouse, and rat cells, and a potential key regulator of lipid metabolism. HSDL2 can regulate lipid metabolism and participate in cell proliferation and apoptosis. Abnormal lipid changes are the pathological basis of stroke, HSDL2 may be involved in the occurrence and development of stroke by regulating lipid metabolism (Zhang et al.,2018). Recent evidence shows that HSDL2 is highly expressed in breast cancer (Dong et .,2016). The overexpression of HSDL2 promotes apoptosis, but its mechanism in cerebral infarction should be further studied.
STK3 is a pro-apoptotic kinase in inhibiting growth. This kinase promotes chromatin condensation during apoptosis and has been proved to be the substrate of Caspase-6 in biochemistry. Caspases 3, 6, and 7 are involved in the execution of apoptosis. Casp6 has become an important marker for Huntington's disease, Alzheimer's disease, and cerebral ischemia, and it is activated in the early stages of the disease. After brain injury, CASP6 induces neuronal death by triggering an apoptotic cascade, which in turn affects the recovery of nerve function, and can be used as a potential therapeutic target for neurological diseases (Girling et al.,2018).CASP6 also plays a key role in axonal degeneration, which further pinpoints the importance of this protease in the pathways of neurodegeneration (Riechers et al.,2016). ST8SIA4, a protein-coding gene, mainly synthesizes PolySia, which is a complex sugar that participates in many processes of nerve development, such as neuroblast migration, outward neurite growth, axon path nding, axonal bundle formation, and synaptic formation. PolySia is almost completely attached to the neural cell adhesion molecule (NCAM) in the brain. NCAM is an adhesive glycoprotein located on the cell surface of neurons and immune cells and is involved in many important biological functions, including cell adhesion and migration, synaptic formation, learning and memory consolidation, and social interaction. PolySia-modi ed NCAM is thought to be related to many speci c functions of the brain (Burgess et al.,2008). In addition, polySia can be used as a reservoir of neurological (biological) active molecules, such as neurotrophic proteins (e.g. BDNF) (Kanato et al.,2008), catecholamine neurotransmitters (e.g., dopamine) (Isomura et al.,2011), growth factors (e.g., broblast growth factor (FGF2, bFGF) (Ono et al.,2012)).
The differential feature enrichment of the four hub genes was further performed using the GSVA algorithm to study the correlation between the hub gene and the functional characteristics of stroke progression. The results showed that most of the hub genes are involved in the pathological process of stroke through the in ammatory response pathway, the PI3K signal pathway, and the P53 signal pathway, while they were also associated with the prognosis.
The in ammatory response pathway includes processes such as in ammatory cell in ltration, apoptosis, and oxidative stress. Several studies have shown that apoptosis is triggered by stroke. Caspase is a marker protein in the process of cell apoptosis, and important factors include caspase-3, caspase-8, and caspase-9. When subjected to stress, cytochrome C released by the mitochondria binds to procaspase-9 /Apaf-1 to activate and cleaved caspase-9 . The lytic caspase-9 further copes with other caspase members, initiating the caspase cascade, and then triggering apoptosis . Activated caspase-8 cooperatively cleaves and activates the caspase of downstream effector molecules caspase-1, caspase-3, caspase-6, and caspase-7 and ampli es the apoptosis signals (Buschhaus et al.,2018). Clinically, in the acute phase of cerebral infarction, certain effects, such as reducing secondary brain injury and promoting nerve regeneration, have been achieved by inhibiting the process of apoptosis.
The PI3K/Akt signaling pathway is involved in various cellular processes, and it has been proved that the activation of this pathway is related to the occurrence and development of angiogenesis. The negative regulation of angiogenesis promotes the genes of thrombosis, vascular permeability, and in ammation, protecting thus vascular function (Lugovaya et al.,2020). On the one hand, RelA, NF-κ B/ Rel family plays an important role in the response to in ammation and immunity, which may constitute the regulatory signal of PI3K-AKT, and then regulate the pathway of apoptosis (Gao et al.,2019). On the other hand, several studies have con rmed that drugs can improve stroke symptoms by regulating the PI3K/Akt pathway (Lugovaya et al.,2020). A recent study has also reported that PI3K/Akt regulates apoptosis and that the activation of the PI3K/Akt pathway after stroke plays a protective role in neuronal apoptosis .
Under the pathological condition of stroke, p53 plays an important role in the regulation of apoptosis and the cell cycle (Guo et al.,2020). The increase of cyclin chaperone D level will affect the process of cells entering the S phase under the regulation of p53 (Saito et al.,2005). The degradation of p53 hinders its role in the regulation of apoptosis (Gao et al.,2020). MDM2 is the ubiquitin ligase of p53 and plays a central role in regulating the stability of p53. Akt mediates the phosphorylation of MDM2 in Ser166 and Ser186 and increases its interaction with p300, so MDM2 mediates the ubiquitination and degradation of p53 (Liu et al.,2020). Phosphorylation of MDM2 also blocks its binding to p19ARF and increases the p53 degradation (Wang et al.,2020). It can be seen that many genes are involved in stroke through their interactions.

Conclusion
In this study, several potential biomarkers and related pathways of ischemic stroke were identi ed, based on the comprehensive weighted gene co-expression network and machine learning analyses with WGCNA and SVM-RFE, respectively, with this being a signi cant contribution to improve diagnosis and treatment of ischemic stroke.

Data collection
The Series Matrix File data GSE119121 was downloaded from the NCBI GEO public database, and included transcriptomics data of 47 samples from rats, including control (n=8), 1-hour ischemia (n=8), 2hour ischemia (n=8), 3-hour ischemia (n=8), 6-hour ischemia (n=7), and 24-hour ischemia (n=8) and samples. This dataset was used to reconstruct a co-expression network using WGCNA. The GSE16561 matrix le included a total of 63 samples, consisted of the control group (n=24) and stroke patients (n=39). This dataset was used for SVM-RFE feature screening and subsequent validation. The owchart of the overall analysis is shown in Figure 1.

Identi cation of key modules by WGCNA
By, Gene coexpression modules were reconstructed using the weighted gene co-expression network (WGCNA), and the correlation between gene networks and phenotypes was explored. Then, the core genes of the network were identi ed detecting the hubs of the modules. The WGCNA-R package was used to construct the co-expression network of all genes in the GSE119121 dataset screening the genes with the top 5000 variances with this algorithm and applying a soft threshold of 10. The weighted adjacency matrix was transformed into a topological overlap matrix to estimate the network connectivity degree, and the hierarchical clustering method was used to construct the clustering tree structure of the TOM matrix. Different branches of the cluster tree represented different gene modules, and different colors represented different modules. Genes were then classi ed using the weighted correlation coe cient, according to their expression patterns. In speci c, genes with similar patterns were classi ed into one module, and tens of thousands of genes were divided into multiple modules through gene expression patterns.

Functional enrichment analysis of gene modules
The Metascape web tool (www.metascape.org) was used to obtain the biological functions and signal pathways involved in the modules of interest in WGCNA (black and brown modules in this study having the highest correlation with phenotype). This analysis enabled the annotation and visualization of the genes in speci c modules using gene ontology and Kyoto gene genome encyclopedia pathway terms. Min overlap ≥ 3 and p ≤ 0.01were considered as thresholds to infer statistically signi cant ndings.

SVM-RFE
SVM-RFE is a machine learning method based on support vector machines (Huang et al.,2014, Guo.,2014), which searches for the optimal features by recursively deleting features from the feature vector generated by SVM. In this study, the R package "E1071" was used to perform SVM-RFE analysis, and the key genes regulating disease progression in the data set GSE16561 were found and used for subsequent veri cation.

Gene Set Variation Analysis
Gene Set Variation Analysis (GSVA) is a non-parametric and unsupervised method to evaluate transcriptome gene enrichment. GSVA converts gene level changes into pathway level changes by comprehensively scoring the gene sets of interest, and then annotates samples based on the biological functions. In this study, gene sets were downloaded from the Molecular signatures database (v7.0), and each gene set was scored comprehensively by the GSVA algorithm to evaluate the potential changes in the biological function of different samples.

Veri cation
Six SD rats were purchased from the Experimental Animal Center of Guangxi Medical University. The rats were randomly divided into two groups: the Middle Cerebral Artery Occlusion (MCAO) group (n = 3) and the control group (n = 3). The models of the focal brain ischemia-reperfusion rats were performed by the Middle Cerebral Artery Occlusion (MCAO) (Longa et al.,1989). The rats were anesthetized 24h later and then their brain tissue samples were collected. The total RNA was extracted by Trizol and reversely transcribed into cDNA using the YEASEN reverse transcription kit. RT-qPCR was performed with a qPCR kit. The relative expression was measured by the 2-delta CT method. Our research was approved by the Ethics Committee of Guangxi Medical University.

Declarations Author Contributions
LST was the rst author who performed most of the experiments and data analysis. ZY contributed equally to this manuscript. LST and ZY performed the animal experiments. WXG designed, performed, and supervised the study and wrote the paper. WXG, HQP and LCX provided suggestions and guidance for this study and revised the paper. All authors contributed to the article and approved the submitted version.

Funding
This work was supported by the National Nature Science Foundation of China (no. 81700759).

Data Availability Statement
The original contributions presented in the study are included in the article/Supplementary Material and further inquiries can be directed to the corresponding authors.

Ethics Statement
The animal study was reviewed and approved by the Animal Ethics Committee of the Beijing Institute of Basic Medical Sciences.

Consent for publication
Not applicable.

Informed consent
For this type of study, formal consent was not required. Flowchart illustrates the bioinformatic analysis process performed on the retrieved stroke datasets. Functional correlation of modules to identify modules related to stroke progression. Each row corresponds to a module feature gene, and each column corresponds to a clinical feature. Each cell contains a corresponding correlation in the rst row and a P-value in the second row. The table is colorcoded by correlation according to the color legend. Module-character relationship. Each row corresponds to a module feature gene and the column corresponds to a feature. Each cell contains the corresponding correlation and p-value. The relationship between gene meaning and module. The correlation between modular traits and the meaning of gene expression.

Figure 9
The expression of four central genes in the second data set.

Figure 12
Validation of hub gene using qRT-PCR. The expression of four genes in brain tissue was detected by qRT-PCR and showed a signi cant increase of expression compared to controls. GAPDH was used as an internal control.