Identication of key genes related to macrophage inltration in gastric cancer based on WGCNA

Background: Gastric cancer (GC) is rampant around the world. Most of the GC cases are detected in advanced stages with poor prognosis. The identication of marker genes for early diagnosis is of great signicance. Studying the tumor environment is helpful to acknowledge the process of tumorigenesis, development, and metastasis. Methods: In GEO, 22 kinds of immune cell inltration were calculated by CIBERSORT. Macrophages were discovered remarkably inltrated higher in GC compared with normal tissues. WGCNA was utilized to construct the network and then identify key modules and genes related to macrophages in TCGA. Results: Finally, 18 hub genes were veried. In the PPI bar chart, the top 3 genes were chosen as hub genes involved in most pathways. On the TIMER and THPA websites, it is veried that the expression levels of CYBB, CD86 and C3AR1 genes in tumor tissues were higher than those in normal tissues. Conclusion: These genes may work as biomarkers or targets for accurate diagnosis and treatment of GC in the future. Our ndings may be a new strategy for the treatment of GC.

immunotherapy strategies. By understanding immune cells in ltration, we are capable of decrypting the proportion and functional potential of immune cells in tumor tissues.
With the development of bioinformatics, many algorithms and tools are utilized to explore TME [11].
Weighted gene co-expression network analysis (WGCNA) is used to nd co-expressed gene modules, and to explore the association between the gene networks and the phenotypes of interest, as well as the core genes in the network [12]. Cell type Identi cation by Estimating Relative Subsets of RNA Transcripts (CIBERSORT) deconvolutes the expression matrix of immune cell subtypes based on the principle of linear support vector regression. RNA-Seq data can be used to estimate immune cell in ltration in various cancers [13].
In this study, the WGCNA coexpression network in GC was constructed, and 18 key genes associated with macrophages were determined. Subsequently, the hub genes and signal pathways were found out in PPI.
Moreover, we veri ed that these hub genes were highly expressed in GC at the gene and protein levels.
Finally, immunohistochemistry (IHC) results of the collected clinical samples also showed that these genes were highly expressed in GC. In conclusion, CYBB, CD86, and C3AR1 were identi ed as the potential biomarkers in GC.

Materials And Methods
Data from the GEO cohort and preprocessing From the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/), we downloaded publicly available raw microarray expression data of GSE13911 to acquire the transcriptional data of GC.
Among the datasets of GC in GEO, GSE13911 has paired normal tissues, containing 31 normal samples and 38 tumor tissues. The chip standardization method was RMA (Robust Multichip Average algorithm), referring to the chip tutorial, the standardization process is mainly divided into 3 steps: Background correction (removing array auto-uorescence), Quantile normalization (making all intensity distributions identical), and Probeset summarization (calculating one representative value per probeset).
Quanti cation of tissue immune cells CIBERSORT, a deconvolution algorithm, can calculate the cell composition of complex tissues based on the standardized gene expression data, and change the method to energize the abundance of speci c cell types. The composition of immune cells in breast and liver cancer tissues was veri ed and successfully evaluated by ow cytometry [14]. It provides 22 kinds of common in ltrating cell expression data LM22 as a reference. Then, we employed the CIBERSORT in the R packages to evaluate the in ltration of immune cells [12]. CIBERSORT was constructed based on microarray data, thus we applied CBERSORT for quantitative analysis of immune cells in GEO data. Xcell (https://xcell.ucsf.edu/) is another powerful tool for quantifying the in ltration of immune cells in tissues [15]. It is more robust for RNA-seq data. For the TCGA data, Xcell was used for the quantitative analysis of immune cells.
Data preprocessing of GC in TCGA RNA-seq and corresponding clinical data of GC from The Cancer Genome Atlas (TCGA) database (https://genome-cancer.ucsc.edu/) was acquired. The RNA-seq data was in the form of HTSeq-FPKM (fragments per kilobase of transcript per million). 32 normal tissues and 381 tumor tissues were screened for further study.

Co-expression network construction
A weight co-expression network was constructed by using the R package WGCNA [12]. WGCNA is suitable for complex data analysis in multiple samples. It can calculate the expression connection between genes, identify gene modules with similar expression patterns, analyze the relationship between the gene set and the sample phenotype, and plot the regulation between the genes in the gene set, and appraisal key regulatory genes. Through this analysis, we were able to identify the co-expressed gene set, which is called modules. We also associated modules with phenotype data for further analysis and discovered potential marker genes.
The rst step was to lter the gene expression data. Missing values or genes with low expression were ltered. The samples with many missing values were also lted. WGCNA has a built-in test gene and sample function, and a basic lter could be performed through this function. After basic ltering, we proceeded to checked if there were samples of outliers and judged the clustering tree of the samples.
When constructing a co-expression network, the correlation coe cients between genes were multiplied to characterize their correlation. The power value was determined, which is the soft threshold or what we call the power value and the beta value.
Differentially expressed mRNAs between GC and non-tumorous tissues Limma package in R was used to screen out the differentially expressed mRNA. FDR-Filter = 0.01 and logFC-lter = 1 were used as the critical values. Differentially expressed genes (DEGs) were screened out for follow-up research.

Correlation analysis of immune cell in ltration and clinical information
The corresponding clinical information organized in TCGA was analyzed. The clinical information matrix was combined with the immune cell expression matrix. Xcell (https://xcell.ucsf.edu/) calculated the content of 22 kinds of immune cells to visualize the changes of macrophages in different clinical stages [15]. Survival analysis of immune cells and the evaluation of their effects on tumors were attached.

Functional enrichment analysis
Gene ontology (GO) enrichment analysis was realized to explore the molecular functions (MF), biological processes (BP), and cellular components (CC) related to DEGs. From the Kyoto Encyclopedia of Genes and Genomes (KEGG), numerous functions of hub genes were presented.
The validation in TIMER (Tumor IMmune Estimation Resource) TIMER (https://cistrome.shinyapps.io/timer/) uses RNA-Seq expression pro ling data to detect immune cell in ltration in tumor tissues [16]. It provides the in ltration of 6 kinds of immune cells (B cells, CD4 + T cells, CD8 + T cells, Neutrophils, Macrophages, and Dendritic cells). SCNA (somatic copy number alteration) module explores the relationship between somatic cell copy number variation and immune in ltration. TIMER uses GISTIC2.0 data to examine the effect of different gene copy states on immune in ltration compared with normal tissues. The SCNA module is grouped according to the CNA (copy number alteration) of a certain gene. This module explores the different levels of in ltration in 6 kinds of immune cells between the groups. The grouping of CNA is divided into ve situations: deep deletion (-2), arm-level deletion (-1), diploid / normal (0), arm-level gain (1) and high ampli cation (2).
Gene expression level in GEPIA (Gene Expression Pro ling Interactive Analysis) GEPIA (http://gepia.cancer-pku.cn/) generates a gene expression boxplot and stages plots based on userde ned inputs. The speci c gene expression of the normal tissues was compared with tumor tissues.
Based on the section of the pathological stages, the violin diagram was generated according to the pathological stages of the cases [17].

THPA (The Human Protein Atlas database)
The Human Protein Atlas project (https://www.proteinatlas.org/) contains the protein expression of normal cells, tissues, and cancers. More than 50% of all human protein-encoding genes in line with 11,200 unique proteins were included [18]. In this website, IHC was used to compare the expression of different proteins between human normal tissues and cancer tissues.

IHC
Para n-embedded GC tissues were collected from the First A liated Hospital of Xi'an Jiaotong University. IHC was performed using the EnVision immunohistochemistry kit(Dako, Denmark, k5007). Firstly, the slides were toasted at 60°C overnight. Secondly, the slides were depara nized with xylene and rehydrated in gradient ethanol baths. 0.01M citrate buffer was applied to retrieve the antigen. A 3% H 2 O 2 solution was dropped in methanol to block endogenous peroxidase. Then, slides were treated with 5% normal goat serum and incubated in the wet box at 37°C for 30 minutes to block non-speci c antigen.

Results
As shown in Fig. 1, the owchart presented the process. Firstly, we analyzed the GEO dataset GSE13911. Then, we found macrophage in ltration was different in GC and normal tissues. To verify this phenomenon, we downloaded related data from TCGA. In this section, we identi ed 3 hub genes relevant to macrophages by using WGCNA. Finally, the former results were veri ed by various websites and IHC.
GEO Data processing used to observe the co-expression (red) or co-inhibition (blue) of various cells in tumor tissues. We could summarize from the picture that neutrophils and activated dendritic cells had the highest correlation and the correlation coe cient was 0.65. In contrast, resting and activated mast cells were most negatively correlated and the coe cient was -0.59. Then, 31 pairs of samples could be matched in the GSE13911 data set for pairwise comparison. The violin plot was a variation of the box plot, which combines the distribution kernel density estimation curve with the box plot. The outermost shape showed the density of the location so the distribution of the data could be found. The middle white dot represented the median, the thick black bar (black box) represented the quartile range (25% quantile and 75% quantile), and the thin black line extending from it represented the 95% con dence interval. Follicular helper T cells and M0 macrophages, activated dendritic cells, and activated mast cells were high in the tumor group (Fig. 2D).
The difference was statistically signi cant. In the normal group, plasma cells and CD8 + T cells and resting mast cells were highly expressed. The most important nding was that M0 macrophages in the tumor were signi cantly higher than the normal group. Besides, the trends of M1 and M2 macrophages were also consistent with the general knowledge in immunology. Besides, the paired plot result was unanimous in the result of the violin plot but merely looked at a certain cell more directly (Additional le 1: Figure S1). The M0 macrophages were exclusively selected. In the 31 pairs of matched samples, it could be seen that except for individual cases, most of the M0 macrophage in ltration was higher in the tumor group than the normal group.
Immune in ltration of GC in TCGA Combined with TCGA, Xcell website was used to analyze the in ltration of immune cells. The changes of macrophages in different clinical stages were observed.
In KS multi-group test, the p-value indicated the overall situation. The signi cant discrepancy stated the difference between each group. There was an upward trend in Fig. 3, which meant the macrophage fraction was increasing as the clinical-grade increased.

Differential analysis of genes in TCGA samples
In TCGA, 32 samples were in the normal group and 381 samples were in the tumor group. The following parameters (FDR-Filter=0.01, logFC-lter=1) were used to analyze the differences between the genes in TCGA samples. A total of 6227 DEGs were picked out.
WGCNA co-expression network analysis WGCNA achieved the goal of quickly locking core genes by grouping genes (modules) and associating gene modules with phenotypes. Next, WGCNA was utilized to divide these genes into various modules. Macrophages and neutrophils were used as independent variables to calculate the modules related to macrophages. Finally, Module Membership (MM) > 0.8 and the Gene-Signi cance (GS ) > 0.5 were used to screen out 18 genes that were positively associated with macrophages.
The rst step of sample clustering in WGCNA was to screen out the samples with outlier expression and clusters those with a similar expression. It could be delineated from Fig. 4A that the TCGA-BR-6710-01A-11R-1884-13 sample was distinct from other samples. The ltering principle of the soft threshold is to make the network more scale-free. Filtered soft threshold, undirected networks with power less than 15 or directed networks with power less than 30 could not lead the scale-free network structure R^2 to reach to 0.8 or the average connectivity to drop to below 100. This may result from batch effects, sample heterogeneity, and complicated experimental conditions on expression, which needed to be removed.
The second step of the WGCNA gene clustering was to cluster genes with similar expression trends. By using a dynamic tree cut [19], a hierarchical clustering tree was formed. Its rationale was a module (pathway) based analysis. On the tree, each leaf represented a gene. The genes with similar expression data were tightly connected, formed a branch of the tree, and represented a gene module. After that, several modules were generated (Fig. 4B).
The third step was to choose a suitable cutting position. We calculated all integers from 1 to 20 as thresholds to test the optimal threshold. Among them, power Estimate was the best power value, t Indices saved the characteristics of the network corresponding to each power and k is the connection degree value. The average value of the connection was visualized and then generated Fig. 4C, D. When the y-axis was equal to 0.9, the intersection with the curve was exactly equal to 4 so we chose 4 as the power value.
The fourth step was to merge the clustered gene modules and make the modules with similar expressions into a large module. We used 4 as a beta value to build a gene module. In Fig. 5A, red indicated a positive correlation and green indicated a negative correlation, the darker the color, the stronger the correlation.
The key step was to associate the genes of different modules with the content of macrophages and pick out the modules that were positively correlated with the content of macrophages. The MEpink module in Fig. 5B was selected as most positively correlated with macrophages and the MEbrown module was negatively correlated with macrophages. This research focused on the modules that had a poor survival expectation and the MEpink module was positively correlated with the content of macrophages based on this module.
The sixth step could be realized in Fig. 5C that the MEpink module had a strong positive correlation with macrophages. Among them, there were 149 genes in the pink module. In case of MM > 0.8 and GS > 0.5, 18 genes most relevant to macrophage in ltration were selected for following research. Table 1 exhibited the function of 18 genes. A barplot was drawn (Additional le 2: Figure S2), and these genes were highly expressed in tumors. A total of 18 genes were all up-regulated in tumor tissues in Heat map (Additional le 3: Figure S3). From the correlation heat map (Fig. 5D), we could know the co-expression relation between these genes. All the values were greater than 0.5, indicating that these genes were correlated with each pair, which was also con rmed by WGCNA. After nding the modules associated with the phenotypes, the PPI bar chart (Fig. 5E) was obtained. Genes were at the core position, and the genes could be screened according to different conditions. Cytoscape software was used to form Fig. 5F. CD86, CYBB, and C3AR1 were identi ed as hub genes.
Verifying the tissue expression level in THPA and GEPIA database We obtained the transcription level of real hub genes from THPA. As Additional le 4: Figure S4A exhibited, the expression level of three hub genes was remarkably high in cancer tissues, which also supported our suppose. CYBB, C3AR1, and CD86 were highly correlated, and had high expression in GC (Additional le 4: Figure S4C-E). There was a rising trend in Additional le 4: Figure S4B, but taking out any of them had little effect on survival (p>0.05), indicating that, they might work together.
Verifying the protein expression level by IHC After IHC staining, it was mostly intuitive that CD86 was highly up-regulated in GC tissues (Fig. 6A). Fig.  6B showed a rise of CYBB expression in tumor tissues compared to adjacent non-tumorous tissues. The positive cell density of C3AR1 was also higher in GC tissues than that in normal tissues (Fig. 6C). These results were consistent with THPA.

Verifying immune in ltration with TIMER
Using the TIMER website, we found the expression level of CD86, CYBB, and C3AR1 had a positive correlation with CD8 + T Cells, CD4 + T Cells, Macrophages, Neutrophils, and Dendritic cells (Fig. 7B). The degree of macrophage in ltration signi cantly affected the prognosis. Hence it is worthy of further research and exploration. These genes affected macrophages and had an impact on survival (5-year survival rate, 10-year survival rate and long term survival rate ) (Fig. 7A). In order to explore the relationship between gene copy number variation and immune in ltration abundance, SCNA module was used to analyze the effect of different somatic copy number alterations of CD86, CYBB and C3AR1 on the immune cell in ltration in GC. It showed that these genes had a great in uence on immune cell in ltration (Fig. 7C).

Gene function analysis
From the GO enrichment pathway map, we acquainted these 18 genes that were mainly enriched in several pathways (Fig. 8A). These genes mainly regulated the production of tumor necrosis superfamily cytokines. Table 2 exhibited the details. In GC, another GO-Biological Process enrichment analysis of DEGs was displayed in Additional le 5: Table S1. It was delineated that these 18 genes were mainly enriched in the phagosome pathway in the KEGG enrichment pathway map (Fig. 8B). GO-Cellular Components, GO-Molecular Function, and KEGG pathway enrichment analysis of DEGs in GC samples were showed in Table 3.
ClueGO as a Cytoscape plugin that integrated gene ontology (GO) terminology and KEGG/BioCarta pathway were used to create a functionally ordered GO/pathway network [20]. After screening (p <0.05), C3AR1 participated in the regulation of mononuclear cell migration and monocyte chemotaxis. CD86 and CYBB involved in the regulation of cytokine biosynthetic process and interleukin-2 production (Fig. 8C).

Discussion
Although the incidence of GC continues to decline in Western countries, it is still a common tumor in developing countries. At present, surgery is still the main treatment for localized GC. However, most GC is diagnosed at advanced stages, the recurrence rate remains very high. Also, combined chemotherapy for patients with GC usually causes drug resistance. In recent years, immunotherapy has become one of the most promising cancer treatment strategies and had signi cant effects on several types of tumors [21][22][23]. Immunotherapy for GC also has promising prospects [24].
Tumor-associated macrophages (TAMs) are one of the most abundant immune cell populations in the TME [25][26][27]. TAMs might be a potential marker for poor prognosis. While the mechanisms of TAMs in GC is still unclear, further research is required .
In our study, we rst calculated the in ltration of various immune cells in GC tissues and matched normal tissues(GSE13911) by using CIBERSORT. We found that the M0 macrophage in ltrated in the tumor tissues was signi cantly higher than the normal tissues (p=0.001). Subsequently, we found that the degree of macrophage in ltration had a positive correlation with disease progression by analyzing the data of TCGA. We suspected that tumor progression was related to the degree of macrophage in ltration. To verify our hypothesis, the WGCNA algorithm was performed to nd 18 genes that were positively related to the degree of macrophage in ltration in GC. Furthermore, we identi ed three hub genes: CD86, CYBB, and C3AR1 based on the protein-protein interaction network for further studies. We used TIMER to explore the effect of various immune cell in ltration levels on the survival of GC patients. Results showed that the more TAM in ltration, the worse the short-term and long-term survival of patients. Moreover, we found that these three genes affected various tumor-in ltrating immune cells, especially tumor-in ltrating macrophages. Again, these hub genes were found to be highly expressed in tumor tissues than normal tissues by using the THPA database, which supported our results. As the tumor progressed, these hub genes expressions were con rmed to rise in GEPIA. Then, we collected some tumor tissues and adjacent tissue samples in the clinic. Through IHC, we also con rmed that these genes were highly expressed in tumor tissues. Finally, we conducted GO and KEGG enrichment analysis on these 18 genes and found these genes were related to the production of TNF and IL-2 and the formation of phagosomes.
Cytochrome B-245 Beta Chain (CYBB) is expressed in eosinophils, neutrophils, and B lymphocytes, et al [28], responding to many in ammatory cytokines and stimuli such as IFN-γ, LPS, and TNF-α [29,30]. It has been proposed as a primary component of the microbicidal oxidase system of phagocytes [31]. CYBB can lead to Immunode ciency 34 (IMD34) and Granulomatous Disease. The terminal of a respiratory chain needs it. Cellular pH can be regulated by CYBB. C3AR1 is activated by its natural ligand, C3a, which is a 77 amino acid split product converted by C3 protein and is traditionally considered to be mainly pro-in ammatory [32]. When the complement system is activated, an anaphylatoxin called Complement C3a Receptor 1(C3AR1) is released. After stimulation, chemotaxis, granule enzyme, and superoxide anion were produced [33]. It is widely expressed in a variety of differentiated hematopoietic cell lines, in the lung, spleen, ovary, placenta, small intestine, whole brain, heart, and endothelial cells. Also, it was mainly expressed in lymphoid tissues [34].
CD86 is a member of the immunoglobulin superfamily. It works as a ligand for two proteins-CD28 antigen and cytotoxic T-lymphocyte-associated protein 4(CTLA-4). It is widely known that the CD86 molecule (B7-2) belongs to the B-7 family. Together with CD80 molecule (B7-1), it is expressed on antigen-presenting cells(APCs, like dendritic cells, macrophages, and B-cells) [35]. Diseases related to CD86 include acute myocarditis and myocarditis. CD86, as a costimulatory molecule, participates in the formation and regulation of immune synapses [36]. Dai et al showed that CD86 expression in CLL was lower than that in normal B cells, but its role in CLL cell survival was not clear [37,38]. In general, C3AR1 is related to complement function. CYBB and TNF are pro-in ammatory genes. CYBB is related to interaction between neutrophils and macrophages. CD86 is a macrophage activation marker. Its high expression indicates that macrophages are activated.
The articles on identi cation of GC related prognostic genes by WGCNA have sprung up. For instance, one research explored the prognosis of GC, but not related to macrophages [39]. Another study aimed to identify potential vital miRNAs and modules associated with the progression of GC [40]. Our study revealed the important genes associated with macrophages in GC for the rst time.
However, our study had limitations that should be acknowledged. To begin with, the defect of this paper was that several other modules could be used to identify the hub genes, but we only chose one on consideration. Again, we did not differentiate the types of macrophages although M1 and M2 macrophages have opposite biological effects, and they were derived from M0 macrophages in different tumor microenvironments. But using the Xcell result, we did not differentiate the types of macrophages.
Last but not least, we analyzed the results based on bioinformatics methods and only predicted the relevant results. Therefore, further molecular biological experiments were required to con rm the function of the hub genes associated with GC.
In summary, the macrophage immune in ltration was higher in the GC tissue than its paired normal tissues. First of all, we combined immune in ltration with WGCNA. Subsequently, we applied the TCGA to seek the genes which could explain the source of this disparity. Combined with WGCNA, the key modules and genes associated with macrophages were identi ed. Lastly, three hub genes were discovered and validated, which can well predict the degree of macrophage in ltration in GC. These genes can be used as drug targets to reduce macrophage in ltration and improve survival rate in the future.

Additional Files
Additional le 1: Figure S1.   The owchart of the data processing. GSE13911 was used to analyze the immune cell pro les by CIBERSORT. Corresponding TCGA gene expression and clinical data were used to nd the genes related to tumor-in ltrating macrophages by WGCNA. Through PPI analysis, 3 hub genes were identi ed. The relationship between hub genes and tumor-in ltrating macrophages was explored by TIMER. The expression of hub genes in tumor and normal tissues was identi ed in GEPIA. THPA was utilized to verify hub genes expression between adjacent non-tumorous tumor and tumor tissues. This result was validated by IHC.