Weighted Gene Co-Expression Network Analysis Reveals a New Survival Model for Prognostic Prediction in Ewing Sarcoma

Background: Ewing sarcoma (ES) is a malignant bone or soft-tissue cancer that mainly arises in children and young adults. However, the prognosis of Ewing sarcoma remains very poor, and there is no effective prediction method. The aim of our study was to identify a prognostic model for ES patients based on prognosis-associated mRNA expression proles. Methods: The GSE17679 dataset was downloaded from the Gene Expression Omnibus (GEO) database. Differently expressed genes (DEGs) between ES and normal control were identied using R package “limma”. A weighted gene co-expression network analysis (WGCNA) was used to screen gene modules associated with recurrence/metastasis and survival status based on DEGs. Results: The prognostic model was constructed based on genes in MEbrown module, which was most associated with recurrence/metastasis and survival status, using Kaplan-Meier survival and lasso regression analysis. Sixteen genes were screened to construct the prognostic model. ES patients were grouped into high- and low-risk groups based on the median of risk score calculated for each of them. ES patients in high-risk group have worse survival than patients in low-risk group. The AUCs (Area under the ROC curve) for 1-year, 3-year, and 6-year overall survival were 0.903, 0.995, 0.953. Conclusions: Taken together, our research constructed a prognostic model which has excellent prediction performance for overall survival of ES patients.


Background
Ewing sarcoma (ES) is an aggressive and rare tumor developed usually in bone, but sometimes in soft tissues as well, which mainly occurred in children and young adults [1]. With the development of therapeutic methods and diagnostic techniques, the prognosis of ES patients has elevated markedly [2,3]. However, the 5-year survival rate remains less than 20% for patients with metastatic or recurrent tumors [4,5]. Therefore, the screening of gene modules associated with recurrence/metastasis and survival status and construction of an effective prognostic model which could contribute to predict the survival rate to help the treatment of ES patients is urgently required.
In recent years, high-throughput technologies have been widely used to identify essential driver genes in the processes of tumor [6-9]. Weighted gene co-expression network analysis (WGCNA) is a valuable tool for the analysis of the gene expression patterns in multiple samples, which could e ciently identify and screen co-expressed gene modules and key biomarkers [10]. The prognosis-related gene signature and prognostic model have shown a great advantage for tumor prognosis prediction [11]. However, there are few effective prognostic models in ES for now.
In our study, human ES expression and clinical data in GSE17679 were downloaded from Gene Expression Omnibus (GEO). The R package "Limma" was used to get differently expressed genes (DEGs) between tumor and normal tissues; the WGCNA tool was used to screen gene modules associated with recurrence/metastasis and survival status based on DEGs. We further performed enrichment analysis to reveal the biological functions and pathways of selected gene modules. In addition, a prognostic model was established based on prognosis-related gene signature, which was screened from the gene module associated with recurrence/metastasis and survival status, using Kaplan-Meier survival and lasso regression analysis. More importantly, we tested the prognostic model using ROC analysis, which showed a great prediction performance for overall survival of ES patients. The current study provided new possibilities for developing prognosis risk assessment models of ES patients.

Result Identi cation of DEGs in ES
We obtained the gene expression pro le and clinical information of ES and normal tissues from GSE17679. Differently expressed genes (DEGs) between ES and normal tissues were identi ed using R package "limma". A total of 4855 DEGs (2739 up-regulated genes and 2116 down-regulated genes) were screened based on the cut-off criteria of a adjust p-value < 0.01 and |log2FC (fold change) | > 1, in which 3779 protein-coding genes were chosen for subsequent analysis ( Figure 1A-B).

Construction and identi cation of key modules
Next, 3779 protein-coding DEGs were used to perform WGCNA using R package "WGCNA".

Functional analysis of MEbrown module
To further explore the potential function of genes in MEbrown module, we conducted the functional analysis using R package "clusterpro ler". The functional enrichment results revealed that these genes were mainly associated with chromosome segregation and nuclear division in BP (biological process) category. For CC (cellular component) category, the genes in MEbrown module were mainly distributed in chromosomal region and spindle region. In MF (molecular function) category, the genes in MEbrown module mainly enriched in single-stranded DNA binding and catalytic activity, acting on DNA ( Figure 4A).
In addition, KEGG pathway analysis indicated the enrichment of MEbrown module mainly in Cell cycle pathway ( Figure 4B).

Construction of the PPI network
To further explore the relationship between genes in MEbrown module, The STRING website was used to construct the PPI network. We identi ed three signi cant sub-modules using cluster analysis of PPI network in Cytoscape MCODE plug-in. These three sub-modules are respectively composed of 82, 24 and 11 genes ( Figure 5A-C).

Construction of the Prognostic Model for ES
We further performed Kaplan-Meier survival analysis and screened 49 genes with p < 0.00001 (Supplementary Table 1). The prognostic model was constructed based on the 49 genes using lasso regression ( Figure 6A

Discussion
Understanding the pathogenesis of ES is essential for identifying new biomarkers to develop new therapeutic strategies [12,13]. However, metastasis and recurrence are the main causes of poor prognosis of ES [14,15] Weighted gene co-expression network analysis A total of 88 ES samples with clinical data (including age, sex, survival time, survival status and recurrence/metastasis status) were included in the WGCNA using R package "WGCNA" [10]. The WGCNA was performed as previously described [17,18]. Brie y, the pearson correlation coe cient was used to construct the correlation matrix. The co-expression similarity matrix was constructed by the absolute value of correlation between transcriptional data. We used the formula Axy =|Cxy| β (Axy: the adjacency between gene x and gene y; Cxy: Pearson correlation between gene x and gene y) to establish the weighted adjacency matrix. The topological overlap matrix (TOM) adds the adjacent genes generated by related networks to calculate the corresponding differences. Based on the "hclust" algorithm, the TOM diversity measure was clustered by gene hierarchy, and the genes with highly collaborative changes were divided into a module. The minimum gene number of the gene tree was 40. The module was constructed by dynamic branch cutting method.

PPI network construction and module analysis
The protein-protein interaction network (PPI) analysis was performed by STRING database (https://stringdb.org/) using modules associated with survival status in WGCNA. The Network was visualized by cytoscape software (version:3.