A Prognostic Model for the Overall Survival of Patients with Ewing's Sarcoma

Background: Ewing's sarcoma (ES) is the second most common primary malignant bone tumor. Although the disease has been studied on a molecular basis, its prognosis has not improved. Therefore, the goal of this study is to screen effective biomarkers for predicting the prognosis and progression of ES. Methods: In this study, the gene expression prole of Ewing's sarcoma was downloaded from public databases. Candidate genes were screened by two methods of weighted gene co-expression network analysis (WGCNA) and differential genes expression analysis, and the Hub gene were determined by the protein–protein interaction (PPI) network. The univariate cox regression and least absolute shrinkage and the selection operator (LASSO) Cox regression were performed on the Hub gene to construct a prognostic model. The receiver operating characteristic (ROC) curve tests the predictive power of the prognostic model. Use the clinical data in the International Cancer Genome Consortium (ICGC) database for verication. Finally, Gene Set Enrichment Analysis (GSEA) was performed to obtain the biological role of prognostic genes in ES. Results: A total of 542 differentially expressed genes (cid:0) DEGs (cid:0) were obtained. Through WGCNA, the obtained turquoise module is signicantly related to ES. PPI network analysis of candidate genes identied 10 Hub genes. The univariate cox regression results showed that CACNB1, IDH2, ATP1B4, PRKAG3, STAC3 are risk factors affecting the prognosis of ES. A prognostic model was constructed using IDH2 and PRKAG3, and patients were divided into two risk groups. The survival time of patients in the high-risk group was signicantly shorter than that in the low-risk group. The ROC curve conrms that this model has certain accuracy. IGCG cohort verication yielded consistent results. GSEA results showed that IDH2 in the development of ES may promote the progression of the disease through G2M CHECKPOINT, GLYCOLYSIS, MITOTIC SPINDLE, MTORC1 SIGNALING, PEROXISOME, PI3K-AKT-MTOR SIGNALING pathways, while PRKAG3 through MYOGENESIS pathway. Conclusions: This study shows that IDH2 and PRKAG3 may play important roles in the progression of ES, and can be used as therapeutic targets and prognostic evaluation biomarkers.


Background
ES is an aggressive tumor originating from mesenchymal stem cells, with the highest incidence in adolescents and young adults [1][2][3]. In the past 30 years, chemotherapy, surgery, and/or radiotherapy have had signi cant effects on improving the survival of patients [1,4], but at this stage, even if intensive treatment has been taken to improve the prognosis of patients, the effect was still not signi cantly improved [5,6]. ES has a certain background in molecular research, and its therapeutic and prognostic effects are still worth expecting. Studies have shown that ES is characterized by balanced chromosomal translocation, which leads to the expression of fusion oncoproteins. The most common one is EWS/FLI to drive tumor occurrence and development [7,8]. However, studies have pointed out that targeted inhibitors of EWS/FLI are not clinically feasible [9]. Also, insulin-like growth factor 1 and its receptor play key roles in promoting the progression of Ewing's sarcoma [10,11]. However, the results of clinical trials of inhibitors of this pathway have been disappointing [12,13]. Nevertheless, the results still have important guiding signi cance for us to research the treatment of the disease on a molecular basis. Therefore, the identi cation of new genes and pathways related to ES occurrence and patient prognosis is crucial.
In recent years, sequencing technology and bioinformatics have shown indispensable roles in the study of disease molecular mechanisms and speci c biomarkers. WGCNA is a systematic biological method that can obtain gene function and gene association from the expression of the whole genome. The identi cation of genes that promote key roles in disease provides profound insights [14]. This method has been used to explore the molecular pathology of diseases, such as liver cancer, Laryngeal Squamous Cell Carcinoma, etc. [15,16]. However, there is no report on using the WGCNA method to study Ewing's sarcoma related genes and discover new prognostic markers.
In this study, we constructed a prognostic model in the GSE17679 cohort and veri ed it in the IGCG cohort. We further performed a functional enrichment analysis of IDH2 and PRKAG3 to explore the underlying mechanism. This study provides an explanation for the pathogenesis and potential molecular biological processes of ES, and the identi ed genes are expected to become potential biomarkers and targets for diagnosis and treatment.

Data source
In this study, gene expression data sets (GSE12102, GSE17679, and GSE34620) were obtained from the GEO (https://www.ncbi.nlm.nih.gov/geo/), and these data sets are all based on the GPL570 platform. The GSE12102 data set includes 37 cases of ES. The GSE17679 data set contains 88 cases of ES tissue and 18 cases of non-tumor tissue, and obtains the survival information of 88 patients. The GSE34620 data set contains 117 cases of ES tissue. The gene expression pro le and survival information of 57 patients with ES were obtained from the ICGC (https://icgc.org/). Data processing and differential expression analysis Use the affy package to read in the raw data of the three data sets for background processing, quality control, and normalization processing [17]. The probe is annotated through the platform annotation le.
Combine the three data sets, use the Sva package to remove the batch effects between the three data sets [18], and then use the limma package to analyze differentially expressed genes [19], the false discovery rate (FDR) <0.05 and |log 2 FC| > 3 is used as the cut-off criterion.
Weighted gene co-expression network analysis Select genes with mutation rate> 15% from the expression pro le data for analysis. Use WGCNA R software package to cluster analysis of samples. Calculate the Pearson correlation coe cient of any two genes, and determine the soft threshold ability β value to screen the co-expression module. To test whether the β value conforms to the scale-free network, the logarithm of nodes with k connectivity (log(k)) should be negatively related to the logarithm of the occurrence probability of a speci c node (log(P(k))), and the correlation coe cient should be greater than 0.85. The number of genes in the gene network module is set to at least 50.

ES candidate genes and their function analysis
Select the module with the highest correlation with the disease, take cor.geneModuleMembership> 0.8 (correlation between genes and certain clinical phenotypes) as the cut-off criterion, and the overlapping genes between the acquired genes and DEGs as ES Related candidate genes. In order to clarify the potential biological process of candidate genes, the candidate genes were analyzed by GO enrichment and KEGG pathways through clusterPro ler software package [20].

Construction of PPI and selection of Hub gene
Search tool used to search for interacting genes (STRING) Online tools can be used to predict the PPI and construct a PPI network of genes with a con dence level of ≥0.7 [21]. The Cytoscape software visualizes the PPI network [22], and the cytoHubba plug-in obtains the top 10 genes, which are regarded as Hub genes.

Prognosis analysis
In order to verify the prognostic risk of the hub gene, the hub gene was analyzed by univariate Cox regression, LASSO COX regression, multivariate Cox regression and Kaplan-Meier survival curve. Validation is done through the ICGC database. The above analysis is through the R language survival package [23], the survminer package (https://CRAN.R-project.org/package=survminer), and the glmnet package [24]. P<0.05 was considered statistically signi cant.

Gene Set Enrichment Analysis
In order to explore the biological processes involved in prognostic genes, according to the median value of each gene expression, it is divided into two groups for GSEA analysis [25], and the GSEA software (version 4.1) is used for analysis, MsigDB gene set "h.all.v7.2.symbols.gmt" as a reference [26], the number of random combinations is set to 1000, and p<0.05 is considered statistically signi cant.

Differentially expressed genes in ES
After preprocessing the data and removing the batch effect, the principal component analysis (PCA) method was used to display the results before and after processing ( Figure 1A, B). The results show that the distribution pattern between tumor samples and normal samples is different, which can be used for further analysis. Compared with normal tissues, tumor tissues have 119 up-regulated genes and 423 down-regulated genes. The volcano map ( Figure 1C) and heat map ( Figure 1D) show the DEGs. Construction of co-expression network and screening of candidate genes Cluster analysis included 18 normal tissues and 242 ES samples, did not exclude any samples ( Figure   2A). Choose β = 3 as the optimal soft threshold ( Figure 2B, C), and scale-free R2 = 0.88 ( Figure 2D, E) to ensure a scale-free network. 5 modules were identi ed, namely turquoise, blue, brown, green, and yellow modules ( Figure 2F). We mapped the correlation between the module and the clinic. Among these modules, the turquoise module has the highest correlation with ES (correlation coe cient =-0.97, p<0.001) ( Figure 2G). Select the genes with cor.geneModuleMembership> 0.8 in the turquoise module, there are 381 in total. Then take the intersection with the DEGs. There are 305 overlapping genes as candidate genes for ES ( Figure 2H).  The PPI network between candidate genes is established using the STRING database. The clusteringcoe cient algorithm of the CytoHubba plug-in is used to select the top 10 genes from the PPI network as Hub genes, including STAC3, ATP1B4, TAS2R38, SSTR4, SORBS1, TNNC1, FBXO27, CACNB1, IDH2, PRKAG3 (Figure 4). The GSE17679 cohort is used to build a prognostic model Using the clinical data of GSE17679, the univariate cox regression analysis was performed on 10 Hub genes ( Figure 5A), and 5 genes were obtained through the threshold of P<0.05. 5 genes were subjected to LASSO Cox regression analysis, and the penalty parameter lambda was selected by the cross-validation method to obtain 2 relatively independent characteristic genes for constructing a prognostic model ( Figure 5B, C). The results showed that high expression of IDH2 and high expression of PRKAG3 are associated with poor prognosis of ES. Kaplan-Meier curve also showed that the high expression of IDH2 (P<0.001) and PRKAG3 (P<0.004) were associated with poor prognosis (Figure 6A, B). Risk Score = (0.749 × IDH2) + (1.499 × PRKAG3), according to the median value of risk score, patients are divided into highrisk group and low-risk group. PCA analysis in the GSE17679 cohort showed that the distribution patterns of patients in different risk groups were different ( Figure 6C). The Kaplan-Meier curve shows that the higher-risk group has a worse prognosis than the low-risk group (log-rank p<0.001) ( Figure 6D). The ROC was used to curve to evaluate the performance of the risk score, and the area under the curve (AUC) for 3 years, 5 years, and 10 years are 0.811, 0.838, and 0.844, respectively ( Figure 6E).

Validation of the prognostic model in the ICGC cohort
The 57 samples from the International Cancer Genome Consortium (ICGC) database were veri ed, and the results showed that the high expression of IDH2 (P=0.005) and PRKAG3 (P<0.034) was associated with poor prognosis (Figure 7A, B). In order to test the prognosis model constructed by the GSE17679 cohort, patients in the ICGC cohort were divided into a high-risk group or a low-risk group after the median risk score calculated by the same formula as the GSE17679 cohort. Similar to the results obtained in the GSE17679 cohort, PCA analysis con rmed that the distribution patterns of patients in the high-risk group and the low-risk group were different ( Figure 7C). Similarly, the high-risk group had a shorter survival time than the low-risk group (log-rank p<0.019) ( Figure 8D). In addition, the area under the curve (AUC) of the prognostic model at 3, 5, and 10 years are 0.659, 0.662, and 0.607, respectively ( Figure 7E).  in the past few decades, the ability to treat the disease is still limited due to the lack of precise molecular targets for ES. Many studies have shown that genetic abnormal changes promote the occurrence and development of the disease, but the molecular mechanism is still unclear [27]. Therefore, further exploration of molecular markers and potential mechanisms related to the progression and prognosis of ES can help the diagnosis and treatment of the disease. In this study, 542 DEGs between tumor and normal tissues were identi ed based on 3 ES microarray data of Gene Expression Omnibus (GEO) database. The turquoise module was found to be signi cantly related to ES, and candidate genes were screened out. Construct a PPI network to detect the interaction between the proteins encoded by candidate genes, and select Hub genes. Univariate COX regression, LASSO COX regression, multivariate COX regression, and Kaplan-Meier curve were used to analyze the survival of the Hub gene. The results showed that the high expression levels of IDH2 and PRKAG3 in ES are closely related to the poor prognosis.
Isocitrate dehydrogenase 2 (IDH2) catalyzes the oxidative decarboxylation of isocitrate to α-ketoglutarate in the mitochondria. Recent advances in cancer genetics have determined that the IDH2 gene is mutated in glioma, acute myeloid leukemia and other cancers [28]. Current studies have shown that inhibitors of IDH2 gene mutations may be clinically effective [29]. However, IDH2 has very few mutations in ES [30,31]. Wild-type IDH2 is down-regulated in liver cancer and gastric cancer, and low-expression IDH2 has a low ve-year survival rate [32,33]. Studies have found that high expression of IDH2 is associated with poor prognosis in esophageal squamous cell carcinoma [34], colon cancer [35], lung cancer [36], and non-small cell lung cancer [37]. The clinical correlation between wild-type IDH2 and ES has not been fully studied. Our results show that IDH2 is associated with the development and poor prognosis of ES.
PRKAG3, a protein kinase, is activated by AMP, the non-catalytic subunit of γ3, which may play a role in regulating energy metabolism of skeletal muscle [38,39]. PRKAG3 and overall breast cancer risk are considered as potential targets for cancer treatment [40,41]. Another study showed that PRKAG3 is associated with the recurrence of triple-negative breast cancer in the Chinese population [42]. We believe that PRKAG3 promotes the development of ES and serves as a new molecular marker.
Our research has constructed a new prognostic model, which has been well validated in the GSE17679 cohort and the IGCG cohort. IDH2 and PRKAG3 may be potential prognostic molecular markers for poor survival of ES, and provide potential targets for pathogenesis and treatment. In addition, G2M CHECKPOINT, GLYCOLYSIS, MITOTIC SPINDLE, MTORC1 SIGNALING, PEROXISOME, PI3K-AKT-MTOR SIGNALING, etc. may be the key pathways of IDH2 regulation in ES. PRKAG3 may promote the development of ES through the MYOGENESIS pathway. In future research, further experimental veri cation is needed to prove the biological role of IDH2 and PRKAG3 in ES.

Conclusion
In summary, we have studied the gene expression pro les of Ewing's sarcoma in GEO and ICGC databases through bioinformatics methods. We provide a prognostic model that has a good predictive effect on the prognosis of patients with Ewing's sarcoma. The model includes two genes (IDH2 and PRKAG3), and using GESA to nd that these genes may play important roles through some signaling pathways. These results may serve as potential prognostic molecular biomarkers in Ewing's sarcoma, and may contribute to the molecular targeted therapy of Ewing's sarcoma.

Availability of data and materials
The datasets used and/or analyzed during the present study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.