Five Genes as Diagnostic Markers of Ischemic Stroke and Their Correlation with Immune In ltration

Background: This study aimed to identify potential diagnostic markers of ischemic stroke (IS) and discuss the function of immune cell infiltration during the pathological process. Methods: We used IS datasets from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) were identified, and functional correlation analysis was performed. We then screened and verified the diagnostic markers of IS. We evaluated the infiltration of immune cells in infarcts using CIBERSORT and analyzed the correlation between diagnostic markers and infiltrating immune cells. Results: A total of 366 DEGs were screened in this study. Genes encoding CTSG, F13A1, PABPC1, ECHDC2, BIRC2 and infiltrating monocytes, M0 macrophages, activated dendritic cells, and neutrophils (area under the curve [AUC] = 0.945) were identified as diagnostic markers of IS. Immune cell infiltration analysis suggested that memory B cells, regulatory T cells, M0 macrophages, CD8+ T cells, T cells, activated natural killer cells, monocytes, activated mast cells, and neutrophils were involved in the IS process. Analysis of correlations between expressed genes and infiltrating immune cells found that CTSG was positively associated with M0 macrophages, F13A1 was positively associated with monocytes, PABPC1 was positively associated with activated dendritic cells, eosinophils were negatively associated with neutrophils, ECHDC2 was negatively associated with monocytes, and BIRC2 was positively associated with eosinophils. Conclusion: five genes and four types of immune cells were identified as diagnostic markers of IS, and immune cell infiltration may play an important role in the progression of IS.


Introduction
Stroke is characterized by a high rate of disability (> 80% of patients) and is the second most common cause of mortality worldwide, after cancer [1,2]. As an acute cerebrovascular disease, stroke causes significant social and economic burdens annually [3].
To date, the diagnosis of IS is primarily based on scales, such as the National Institute of Health stroke scale (NIHSS), and imaging findings. However, early alterations in ischemic brain regions may not be detected using these technologies due to limited sensitivity [4,5]. Thus, there is an urgent need for a new diagnostic method that is either minimally invasive or noninvasive and has high accuracy and specificity for improving the early diagnosis of IS.
Several studies have emphasized that immune cells may play a significant role in ischemic regions during the occurrence and development of stroke [6,7,8,9,10,11]. Neutrophils are attracted by chemokines and reach the ischemic lesions first [12]. Subsequently, secondary injury is mediated by neutrophils cause via releasing matrix metalloproteinases (MMPs), proteases, reactive oxygen species (ROS) and other proinflammatory factors [6]. Furthermore, adherence of neutrophils to the endothelium causes the cerebral no-reflow phenomenon and the simultaneous production of MMPs, ROS, and proteases which damage vessels and ischemic tissues [7]. T lymphocytes, such as regulatory T (Treg) cells, CD8 + and CD4 + effector T cells and γδT cells, infiltrate the brain parenchyma at a later stage, and Tregs have shown to have a neuroprotective function [8,9]. As a primary component of brain tissue, glial cells also take a part in the post-IS immune reaction.
Several types of immune cells have a profound influence on the development of ischemic injury in brain tissue. Therefore, investigations on the infiltration of these cells and the expression matrices of specific genes can facilitate the early diagnosis of IS [8,9]. In this study, as an initial step toward identifying potential diagnostic biomarkers, we screened hub genes with differential expression between patients and healthy participants from an RNA sequencing (RNA-seq) dataset. For the first time in IS patients, we quantified 22 types of immune cells [13] and conducted the correlative analysis between these cells and hub gene expression. Subsequently, several types of immune cells that exhibited the strongest correlations with gene expression were chosen for the second step in the identification of potential diagnostic biomarkers. We built a diagnostic predictive model based on these diagnostic features to identify IS in an individual and conducted related enrichment and protein-protein interaction (PPI) analyses.
Raw data from both datasets were processed using the Affy software package containing Robust Multi-Array Average (RMA) algorithm [18], and gene expression profiling by an array was combined into one dataset using the sva on the R platform [19] to minimize batch effects. Two principal component analysis (PCA) plots were drawn to test batch effects and inter-sample correction using TBtools software [20].

Display of Differential Gene Expression and Enrichment Analysis
R package limma [21] was used to identify DEGs, between stroke patients and healthy people, via estimating gene expression differences based on an empirical Bayesian approach. The R package ggplot2 [22] was used to draw a volcano plot showing DEGs that met the significance criteria (|log2FC| > 1.3, adjusted p-value < 0.05).
Then Gene Ontology (GO) enrichment analysis was conducted, via using package clusterProfiler [23], as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. The gene set enrichment and PPI analyses were conducted online using Metascape (http://metascape.org) [24].

Selection and Identification of Stroke Diagnostic Markers
Support vector machine-recursive feature elimination (SVM-RFE) [25] combined with Least absolute shrinkage and selection operator (LASSO) logistic regression [26] were used to single out variables that were significant for diagnosing stroke. The R packages e1071 [27] and glmnet [28] were used respectively to implement LASSO-logistic regression and the SVM-RFE algorithm. We selected variables from both the LASSO and SVM-RFE results as diagnostic markers.

Quantitative Analysis of Immune Cell Infiltration
CIBERSORT [13], a computational approach, was used to estimate the presence of all types of immune cells in merged stroke transcriptomes according to filtration criteria (p-value < 0.001). A PCA plot was drew, via using package ggplot2 [29], which revealed differences in the expression of genes related to infiltrating immune cells between stroke patients and healthy individuals, and a correlation plot was drew, via using the ggcor [30] package, which can visualize the co-expression of genes related to all types of immune cells. A violin plot was also drawn using the R package ggplot2 [29] to display the quantities of immune cell matrices between the two groups.

Correlations between Stroke Diagnostic Markers and Immune Cells
We used the R package ggcor [30] to identify correlations between all types of immune cells and stroke diagnostic biomarkers, and ggplot2 [29] was used to draw a lollipop chart to illustrate the results.

Batch Effect Minimization and Differential Gene Expression Analysis
First, we merged two GEO datasets into one matrix following background adjustment and normalization.
The effects of the likelihood of batch effects due to non-biological technical bias minimization were displayed in PCA cluster plots. Data from different GEO documents were inequivalent before normalization and batch effect removal ( Figure 1A). These data from diverse matrices are comparable depending on the sva package that can support subsequent analyses ( Figure 1B). After preliminary data processing, we identified 366 DEGs by comparing gene expression between stroke patients and healthy individuals, which are shown in the volcano diagram in Figure 1C.

Enrichment Analysis
The GO enrichment analysis results showed that the DEGs were mainly concentrated in the annotations of neutrophil degranulation, neutrophil activation involved in immune responses to biological process, the secretory granule membrane, the secretory granule lumen, the cytoplasmic vesicle lumen, the vesicle lumen of the cellular component, and phosphoric ester hydrolase activity of molecular function ( Figure   2A and B). The most significant functional pathways were all biological processes: myeloid cell differentiation, neutrophil activation involved in immune responses, neutrophil degranulation, response to lipopolysaccharide, and response to molecules of bacterial origin ( Figure 2C).
The KEGG analysis revealed that the DEGs were mainly associated with hematopoietic cell lineages, the interleukin 17 (IL−17) signaling pathway, the nuclear factor (NF)−kappa B signaling pathway, rheumatoid arthritis, and transcriptional misregulation in cancer ( Figure 3A, B, and C).
PPI network analysis was carried out via using MCODE algorithm that included in Metascape. The sub-network of PPI which ranked first was made up of 30 genes including CXCL1, PPBP, CXCL2, ELANE, CXCL8, CCR7, ITGAM, MCEMP and so on (Supplementary Figure 1). It supported that neutrophil activation, granulocyte activation and neutrophil degranulation is of great importance in stroke pathological process as functional pathways. Summarily, immune activation may participates in stroke pathological progress.

Identification of Potential Diagnostic Markers
The LASSO-logistic algorithm screened out 20 genes to be used for subsequent validation ( Figure 4A and B), seven genes, of which five were included in the former 20 genes, were identified via SVM-REF ( Figure 4C). For both methods, the training set was composed of samples from both the stroke patient and healthy individual groups with even-numbered GSM codes, whereas the testing set contained samples with odd-numbered codes.
The Venn diagram shows that the overlap of the results from the two methods contained five genes (cathepsin G [CTSG], F13A1, PABPC1, ECHDC2, and BIRC2), which were selected as our potential diagnostic markers ( Figure 4D).

Immune Cells Quantization and Correlation Analysis
A three-dimensional PCA diagram was used to visually reveal the different features of immune infiltration between IS patients and healthy ones. By using this method, we observed significantly different patterns of immune cells infiltration between the groups ( Figure 5A and Figure

Calculation of the Multivariate Fitted Prognostic Factor
The asymmetrical diagram of correlations between the 22 immune cell types and five potential prognostic genes showed that there were 5 kinds of immune cells which were obviously correlated with at least one gene ( Figure 6A). The 5 kinds of immune cells identified as diagnostic predictors were M0 macrophages, monocytes, neutrophils, eosinophils and activated dendritic cells. A multivariable combined receiver operating curve analysis of the testing set was conducted by combining the five prognostic genes with one or more immune cell types, and a fitted factor with the highest area under curve (AUC) value, specificity, and sensitivity was determined. When 4 kinds of immune cells (M0 macrophages neutrophils, activated dendritic cells, and monocytes) and 4 potential genes were fitted into one factor, an AUC value of 0.997 was calculated in the training set ( Figure 6B and C), and a maximum AUC of 0.945 was reached in the testing set ( Figure 6D and E).

Correlative Relationships between Potential Diagnositc Genes and Immune Cells
A correlation analysis revealed that expression of BICR2 was correlated with quantity of activated dendritic cells (r = 0.289, p = 0.026) and eosinophils (r = 0.337, p = 0.0009; Figure 7A  The color is closer to deep green, the correlative relationship is stronger and the color is closer to yellow, the correlation is weaker. The bigger the size of dot, the stronger the correlation, the smaller the size, the weaker the correlation.

Discussion
In this study, we used an RNA-seq expression matrix to construct a diagnostic prediction model for IS by making use of machine learning algorithms and analyzed the characteristics of associated immune cell infiltration. Following two rounds of screening, we found five potential diagnostic genes between people who have IS and healthy ones and four types of immune cells that contributed to the diagnostic model.

Results of subsequent enrichment and PPI analyses indicated that immune response processes and pathways related to neutrophils are involved in the pathophysiological alterations of IS.
Stroke is an acute cerebrovascular disease that is characterized by significant disability and causes considerable social and economic burdens annually [3]. Recently, several studies have implicated glial cells, an intrinsic cell population in brain tissue that includes microglia, astrocytes, and oligodendrocytes, in post-stroke immune alterations. Activated microglia play the same pathophysiological role as macrophages following ischemia and can induce neuronal injury by producing ROS via nicotinamide adenine dinucleotide phosphate oxidase [31]; these cells also produce cytokines [32] and MMPs [33,34].
Another possible mechanism of inflammation is the expression of CD14 on microglia and subsequent activation by toll-like receptor 4, which triggers microglial activation [36,37]; interferon produced via this process induces the expression of microglial genes that contribute to ischemia/reperfusion injury [38].
Surprisingly, microglia may also play a bidirectional role in ischemia/reperfusion by secreting antiinflammatory factors [39] and preventing the secretion of inflammatory factors related to astrocytes [40].
In the acute stage of IS, astrocytes participate in diverse functional processes, which include limiting brain tissue damage caused by ischemia, reducing the inflammatory response, maintaining nervous system homeostasis in the tissue around infarct region, and facilitating repair of the blood-brain barrier [41,42]. Furthermore, astrocytes are crucial for axonal regeneration and functional recovery during the chronic post-stroke stage. Through their interactions with other adaptive immune cells including NK cells and CD8 + T cells, astrocytes not only increase the infiltration of these immune cell populations but also enhance their function, which aggravates cerebral injury.
Another major part of innate immune system in the brain is the oligodendrocyte, which participates primarily in the cerebral infarction repair and significantly influences recovery. In contrast to the other two types of glial cells, the proliferation of oligodendrocytes cannot solely rely on itself without the existance of oligodendrocyte progenitor cells [43]. Research has found that various types of immune cells, such as macrophages, neutrophils, and T lymphocytes, are also of significance in the pathological alterations of IS [6,8,9,10,11], and this information has aided in the identification of new biomarkers (e.g., from RNA-seq data) for the early diagnosis of IS. In this study, we searched for genes that could act as key diagnostic biomarkers and investigated their functional features related to pathways in IS.
The results that majority of DEGs were enriched in neutrophil degranulation, neutrophil activation of immune response, myeloid cell differentiation, and T cell activation as a biological process are produced by GO enrichment analysis. The results that the NF−kappa B signaling pathway, IL−17 signaling pathway, hematopoietic cell lineage, rheumatoid arthritis, and transcriptional misregulation in cancer were significant pathways in IS are produced by KEGG enrichment analysis. The results of a PPI analysis using MCODE suggested that pathways involving neutrophils and B cell receptors were important [44]. These results confirm the importance of immune cell infiltration in IS.
We used both the LASSO-logistic model and SVM-RFE to screen potential biomarkers, which yielded in five genes: CTSG, F13A1, PABPC1, ECHDC2, and BIRC2. CTSG takes part in the development of dermatomyositic myoideum, which is possibly mediated by inflammatory cytokines [45].
CTSG, which is released by activated neutrophils, was shown to be positively correlated with the development of cardiovascular and cerebrovascular diseases by mediating platelet activation, as a serine protease [46]. CTSG cleaves of the glycoprotein (GP) Ib [47] and mediates the releasing of plasminogen activator inhibitor-1 [48], which promotes intravascular thrombosis. A previous animal experiment reported that CTSG-like antigen was highly expressed in vivo around cerebral wound margins and analogous phenomenon was also observed in cultured primary astrocytes. Astrocytes may take part in a series of anti-inflammatory biological processes by secreting CTSG-like antigens [49]. Meanwhile, a separate study reported that serine protease inhibitor alpha 1-antichymotrypsin, which is mainly released by astrocytes, can form complex with CTSG that will limit the deterioration of cerebrovascular diseases [50]. Another study found that CTSG can suppress the thrombin-induced highly-expression of inducible nitric oxide, via upregulating release of NO, which is the activation signal of astrocytes [51]. F13A1, which is also known as coagulation factor XIII subunit A, not only promotes thrombosis but also limits both thrombus growth by suppressing embolus growth [49] after coronary angioplasty and the risk of stroke in young female [52]. Another study discovered a novel variant of F13A1 in a child with hemorrhagic stroke [53]. PABPC1 contributes to various physiological processes, such as by facilitating the binding of the mechanistic target of the rapamycin-regulated RNA-binding protein La-related protein 1 to mRNA [54], thus participating in microRNA-mediated gene silencing [55], and by regulating immunoglobulin secretion in plasma cells [56]. One study reported that ECHDC2, increases susceptibility to ischemic injury, whereas resistance to ischemic/reperfusion injury is enhanced when ECHDC2 expression is knocked down. Mechanistically, ECHDC2 may increase the cellular levels of branched-chain amino acids without altering mitochondrial oxygen consumption [57]. Finally, in an animal experiment about stroke, fastigial nucleus stimulation was shown to lead to neuroprotection by targeting BIRC2 [58].
What is more is that BIRC2, induced by glial cell line-derived neurotrophic factor A, as well as BIRC3 and Gadd45b can suppress the process of neuron apoptosis, as reported in a previous in vitro experiment [59].
Therefore, four potential gene biomarkers-F13A1, ECHDC2, CTSG, and BIRC2-may have direct or indirect relationships with the pathological processes of stroke, whereas PABPC1 may affect both gene silencing and plasma cell activation.
We also evaluated specific immune cell patterns in IS and found that monocytes, activated NK cells , T cells, CD8 + T cells, inactive and activated mast cells and neutrophils were detected at significantly higher levels in stroke patients compared with healthy individuals, whereas memory B cells, Treg cells, and M0 macrophages were detected at a lower level in person with stroke compared with healthy individuals. After acute stroke, immune cells reach the ischemic region in an orderly and continuous process [60]. The quantity of infiltrating Treg cells, which are characterized by forkhead box protein P3 expression and secrete interleukin 10 to inhibit excessive immune responses [61], was lower in stroke group, whereas the quantities of infiltrating T cells, neutrophils, and activated NK cells [62], which damage ischemic tissues [63], were higher in stroke patients than in healthy individuals, indicating more severe inflammatory immune responses in stroke patients. Because the samples of the two GEO datasets were both collected from peripheral blood, the quantity of macrophages identified in the study was not reflective of the number of macrophages in the ischemic cerebral region. Furthermore, peripheral blood monocytes are recruited into the ischemic brain tissue via chemokines, whereas both microglia-derived and monocyte-derived macrophages are produced in cerebral ischemic tissues [60]. Consequently, our results revealed a high monocyte level accompanied by a low macrophage level.
The results of a correlative analysis between potential gene biomarkers and immune cells showed that eosinophils were positively correlated with PABPC1 and BIRC2 expression, activated dendritic cells were positively correlated with PABPC1 expression, M0 macrophages were positively correlated with CTSG expression, monocytes were positively correlated with F13A1 and negatively correlated with ECHDC2 expression, and neutrophils were negatively correlated with PABPC1 expression. ECHDC2 was expressed at low levels, whereas F13A1 was expressed at a higher levels compared with healthy group, which suggests that monocytes differentiate into both activated macrophages and dendritic cells. Moreover, CTSG may increase the number of macrophages, especially M0 macrophages and regional M1 macrophages in ischemic tissues, which would aggravate the inflammatory response. Further research is required to ascertain the complex immune cell infiltration process in cerebral ischemic regions, especially regarding the three subtypes of macrophages.
Several limitations do exist in our research. First of all, the datasets chosen from the GEO database were all obtained from peripheral blood samples; therefore, biological processes that occur in the brain, such as the differentiation of the three subtypes of macrophages, could not be investigated. RNA-seq data based on cerebral tissue samples should be jointly analyzed in future studies. Secondly, the feature fitted for the five genes did not achieve an optimal AUC, specificity, or accuracy, and the detailed mechanisms underlying the effects of these gene products on the pathological processes of IS via an immune response or other process remain unknown. Therefore, further basic clinical studies are essential to clarify the molecular biological mechanisms in the development of IS.
In conclusion, we identified five potential diagnostic genes for IS, namely, CTSG, F13A1, PABPC1, ECHDC2, and BIRC2. We also identified four types of immune cells as potential diagnostic biomarkers, namely M0 macrophages, neutrophils, monocytes, and activated dendritic cells, and the combination of these cell types with the five gene biomarkers improved the diagnostic efficiency. Further research on specific immune cell infiltration patterns and detailed inflammatory immune responses will facilitate the development of immunotherapies for IS. Moreover, further examination of the five potential gene biomarkers identified in this study may aid development of a new highly specific and sensitive diagnostic method for IS that is noninvasive and convenient.

Ethics approval and consent to participate
Not applicable (Original data of this study is from GEO which is a public functional genomics data repository).

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no conflict of interests.        The color is closer to deep green, the correlative relationship is stronger and the color is closer to yellow, the correlation is weaker. The bigger the size of dot, the stronger the correlation, the smaller the size, the weaker the correlation.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.