The prediction of survival and analysis of microenvironment in patients with sarcoma

Sarcomas were rare, aggressive, and heterogeneous group of tumors. The degree of tumor microenvironment cells, inltrating immune cells and stromal cells in the tumor had an important impact on prognosis of sarcoma. The aim of this study was to identify the differentially expressed genes (DEGs) association with immune of sarcoma and potential prognostic immune biomarkers for predicting survival of sarcoma patients. The gene expression data and clinical data of sarcoma was downloaded from The Cancer Genome Atlas (TCGA) dataset. The immune scores and stromal scores were calculated by ESTIMATE algorithm. The limma package was used to identify the immune DEGs. ClusterProler package and STRING were further to analysis the immune DEGs. A prognostic signature was built based the immune gene and clinical data by univariate Cox regression analysis and multivariable Cox analysis. Finally, the prognostic signature was evaluated by functional assessment and the Gene Expression Omnibus (GEO) database.


Background
Sarcomas, derived from mesenchymal tissue, are malignant tumors comprising more than 50 different entities [1][2][3] with high histological heterogeneity within each subtype [4]. According to statistics reported, more than 10 thousand patients diagnosed with sarcoma and 3800 would be dead in sarcoma per year in the United States [5,6]. The patients in sarcoma is poor prognosis with a high rate of recurrence after initial resection. However, the pathological mechanism of sarcoma is still unclear, and we are still helpless in the treatment of it.
The tumor microenvironment (TME) is the cellular milieu consisting mesenchymal cells, endothelial cells, tumor in ltrating immune cells (TIIC), extracellular matrix(ECM) molecules and in ammatory mediators [7], which can provide the metabolites and controls factors of proliferation and diffusion for many tumors. Tumor microenvironment has also been reported to seriously affect gene expression in tumor tissues [8][9][10][11]. Immune cells and stromal cells are the two main types of non-tumor components in the tumor microenvironment, and are considered to have important signi cance for the diagnosis and prognostic evaluation of tumors [12]. It has been con rmed to be related to relapse and overall survival [13][14][15].
To expose the immune microenvironment of sarcoma and the underlying pathological mechanism, the ESTIMATE algorithm [10] was performed to calculate the immune score of sarcoma from the TCGA database(https://portal.gdc.cancer.gov/) [16,17], and limma package was used to analyzed immune DEGs based the immune score. With the data from IMMPORT website(https://www.immport.org/), we further screen out the immune genes related sarcoma. Finally, an immune gene prognostic signature was constructed with the immune genes.

Data collection
The level 3 gene expression pro le (level 3 data) and clinical information of sarcoma was downloaded from TCGA database. The gene expression RNAseq-HTSeq-Counts was from Illumina. Clinical data consisted age, gender, survival, outcome, and histological type, et al.

Data processing
Based on R software(version 4.0.0), the gene expression data were standardized and ESTIMATE algorithm [10] contributed to attained the immune scores and stromal scores. The differentially expressed genes(DEGs) was analyzed by limma package [18] with a level of llogFCl ≥1.5 and adjusted P value ≤0.05.

Enrichment analysis of DEGs
Gene Ontology(GO) is an ontology widely used in the eld of bioinformatics for annotating large scale genes and gene products [19]. Kyoto Encyclopedia of Genes and Genomes (KEGG) can be used to predict which pathways a particular gene is enriched [20]. ClusterPro ler package [21], an R package for comparing biological themes, completed analyzed the GO enrichment and KEGG pathway enrichment.

PPI network construction
We uploaded all the up regulated DEGs and down regulated DEGs to the online tool, STRING database (https://string-db,org/) and set the level at a high con dence (0.700) to construct the protein-protein interaction(PPI) network. The data of PPI networks was download from STRING and was inputted into Cytoscape software [22] (version 3.7.1) for reconstruction. The plug in Molecular Complex Detection (MCODE) analyzed the PPI sub networks, and the sub networks would be considered only with a level of MCODE scores 10 and number of nodes 10.
Analysis of immune genes and construction of prognostic model IMMPORT shared data enables searching and downloading shared biomedical research data(https://www.immport.org/), and we could obtain the immune related genes from it. Based on the immune related genes from IMMPORT and immune DEGs we analyzed, we obtained the immune genes of sarcoma. With the Survival package, a univariate Cox regression analysis was to examine the correlation of the expression level of immune genes and overall survival (OS) of patients with sarcoma. Genes with a P-value <0.05 were considered prognostic immune genes with expression levels signi cantly related to OS. And the prognostic immune genes selected by the "step" function were used for construction of the prognostic model. By tting the prognosis immune genes to the multivariate Cox regression analysis with OS as the dependent variable, the relative contribution of these prognosis immune genes in sarcoma survival prediction was evaluated. Through the linear combination of the expression levels of immune genes with the multivariate Cox regression coe cient(α) as the weight, an immune expression-based prognostic risk score model was constructed. And the risk score (RS) was calculated as follows: RS = expression of gene 1 × α 1 gene 1 +expression of gene 2 × α 2 gene 2 +expression of gene 3 × α 3 gene 3 +…expression of gene n × α n gene n . And the sarcoma patients could be divided into high risk group and low risk group by the median risk scores of the immune gene prognostic model.

Evaluation and veri cation of prognostic model
To test the reliability of the prognostic model we constructed, receiver operating characteristic (ROC) curve was generated by survivalROC package in the R software to evaluate it. The univariate Cox regression analysis and multivariate Cox regression analysis would further assess whether the prognostic model can independently assess the prognosis of patients. In addition, the prognostic model was performed in the sarcoma data from GEO Datasets (https://www.ncbi.nlm.nih.gov/geo/).

Statistical analysis
FDRs(False discovery rate, adjusted p value) in limma and clusterPro ler were adjusted for multiple testing with the Benjamini-Hochberg procedure to control FDR [23]. Univariate analysis was performed using the log-rank test. Coexpression relationships was performed by the Pearson correlation coe cient. P value <0.05 was considered statistically signi cant. And statistical analyses were completed by R software (version 4.0.0), SPSS22.0(IBM Corporation, Armonk, NY, USA), and GraphPad Prism 8.

Results
The association of clinical data with immune scores and stromal scores We downloaded 263 samples gene expression pro les and clinical data from TCGA dataset. Among them, males and females counted 119(45.25%) and 144(54.75%), respectively. Age covered from 20 to 90 years old(60.56±14.63). Leiomyosarcoma histologic subtype included 67(25.48%) cases of conventional leiomyosarcoma, 34(12.93%) cases of poorly differentiated or pleomorphic or epithelioid leiomyosarcoma, 4(1.52%) cases of well differentiated leiomyosarcoma, and 158(60.08%) cases were of unknown histological result. Positive and negative tumor margins were found in 74(28.14%) and 139(52.85%) cases, respectively, and 50(19.01%) cases were unknown. With the ESTIMATE algorithm, immune scores ranged from -2039.13 to 3295.15(433.88±1116.81), and stromal scores were distributed between -1275.33 to 2519.30(882.36±718.82). The average immune scores was higher than stromal scores obviously(P 0.001). From the result, the male had a higher immune score and stromal score than the female (Figure 1 A-B, p<0.05). However, there was no obvious association between subtype and immune or stromal scores ( Figure 1 C-D, p>0.05). According to the level of the score, we divided the patients into two groups, namely, high immune scores or stromal scores group and low immune scores or stromal scores group. The high immune or stromal scores group of patients had an older age than low immune or stromal scores, though there was no statistical signi cance (

Identi cation of signi cantly differentially expressed genes
To nd out the relation of gene expression and immune scores or stromal scores, we analyzed the differentially expressed genes between high immune scores group and low immune scores group, as well as between high stromal scores group and low stromal scores group by limma package in R software. The heatmaps of differentially expressed genes were conducted by heatmap package (Figure 2 A-

The functional analysis of DEGs
To reveal the correlation of immune with sarcoma, we had a deeper analyzed the function of DEGs compared by high immune scores and low immune scores, we analyzed the function of 732 up regulated genes and 567 down regulated genes with ClusterPro ler package in R software. GO term were divided into three aspects, including biology: biological process (BP), molecular function (MF), and cellular component (CC). And a total of 1193 GO terms were enriched by high regulated DEGs, including 1015 BPs, 76 CCs, and 102 MFs. GO result showed that the up regulated genes enriched in immune cell activation, proliferation, and adhesion (Figure 3 A). It was revealed immune process played a signi cant role in sarcoma. And the down regulated DEGs were enriched in 477 GO Terms, including 324 BPs, 105 CCs, and 48 MFs. Down regulated DEGs were related to various muscle contractions (Figure 3 B). KEGG results showed that up regulated DEGs enriched in eight pathways, including Cytokine-cytokine receptor interaction, Viral protein interaction with cytokine and cytokine receptor, Chemokine signaling pathway, Osteoclast differentiation, and Cell adhesion molecules (CAMs) (Figure 3 C). And down regulated DEGs enriched in ten pathways, including Calcium signaling pathway, cAMP signaling pathway, Adrenergic signaling in cardiomyocytes, Vascular smooth muscle contraction, and Cardiac muscle contraction (Figure 3 D).

The construction of PPI network
We constructed the PPI network of DEGs to better understand the intrinsic interaction between the identi ed DEGs. With the data constructed by STRING and reconstructed by Cytoscape, we got the PPI networks of up regulated DEGs and down regulated DEGs, respectively (Supplement 1 A-B). To further analyzed the PPI network with the plug-in MCODE, with a level of Score (Density*#Nodes) ≥10 and Nodes ≥10, we got 4 subnets and one subnet from up regulated DEGs PPI network and down regulated PPI network, respectively (Figure 4 A-D). In the subnet of down regulated DEGs PPI, GNG4 owned the highest degree 24 followed by ADCY5 and ADCY2(degree=17). According to the GO result, we knew up regulated DEGs were associated with immune strongly. And the subnet 1 of up regulated DEGs, we called the top one subnet, was the most prominent subnet for owning only 36 nodes but with the most edges and highest scores. Thus, we decided to focus on this subnet to analyzed deeper. In the top one subnet, FPR2 owned the highest degree of 72, followed by C3(degree=66), C3AR1(degree=66), CXCL8(degree=66), CCR5(degree=65), CCR7(degree=65), CCR2(degree=59), CXCL1(degree=59), CXCL10(degree=57), and CXCR3(degree=57). Among the top ten genes, the CCR gene family and the CXCL gene family account for three. And the CCR gene family (CCR1, CCR2, CCR4, CCR5, and CCR7) and the CXCL gene family (CXCL1, CXCL2, CXCL6, CXCL8, CXCL9, CXCL10, CXCL11, and CXCL13) account for ve and eight in the top one subnet, respectively. In addition, the CCL gene family (CCL4, CCL4L1, CCL5, CCL13, CCL19, and CCL20) was also signi cantly enriched in this subnetwork, accounting for 6. A matrix graph of Pearson's correlation analysis indicated that the expression of CCR family genes, some of CXCL family genes and CCL family genes were closely related to each other internally with a correlation coe cient ≥0.5( Figure 5 A-C). The numbers in each grid represent the correlations between the corresponding genes. In addition, we found that the high expression of CCR family genes indicated a better survival in a short time, though, there were no statistical signi cance in CCR1 and CCR4( Figure 6).

Construction of the immune genes' expression-based prognostic signature
Based on the immune scores, we analyzed the differentially expressed genes. To further search for immune related genes, we download the immune genes from IMMPORT, and took the intersection of DEGs we analyzed and immune related genes from IMMPORT. A total of 85 genes were the common genes, that was the immune genes related to sarcoma. With the survival package, a univariate Cox regression analysis was to examine the correlation of the expression level of immune genes and overall survival of patients with sarcoma. A total of ten genes were considered as prognostic immune genes and selected using the "step" function to screen for the optimal combination (Figure 7 A). The follow seven prognostic immune genes were used for the prognostic model construction: BTC, EDN3, GREM2, IL17B, PAK3, RAC3, and ZC3HAV1L (Table 1). Survival analysis indicated that the prognostic model could separate the sarcoma into high risk group and low risk group, and patient with a high risk scores had a worse survival than those with a low risk scores. (Figure 7 B-C). Wilcoxon rank sum test was used to test the difference between the seven prognostic immune genes in the high-risk group and the low-risk group, and we found that its expression in the two groups was signi cantly different (Figure 7 D).
Evaluation and veri cation of prognostic model ROC curve showed that the risk score performed well for the survival prediction of sarcoma, and the AUC of ROC curve was 0.765 (Figure 8 A). After the univariate Cox regression analysis and multivariate Cox regression analysis, we found the risk score was an independent prognostic factor in sarcoma (Figure 8  B). In addition, GSE17674 was downloaded from GEO Datasets veri ed the prognostic models. GSE17674, gene expression pro ling was from GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, contained forty-four Ewing sarcoma patient clinical data. Background correction, normalization and standardization of gene expression data were performed by RMA -Robust Multichip Average algorithm [24]. The risk scores of patients from GSE17674 were calculated and the Kaplan-Meier curves of event-free survival and overall survival were drawled (Figure 9). The patients with high risk scores had a worse event-free survival and overall survival, though, p>0.05 in the event-free survival.

Discussion
Sarcoma, a malignancy tumor with a high recurrence rate, brings us big challenges in treatment. Analyzing the tumor microenvironment by CIBERSORT algorithm to calculate the immune scores and stromal scores, there have been some literatures reported that there were a signi cant relationship between tumor microenvironment and tumor prognosis [25,26]. Though, Runzhi Huang el at [11]. have analyzed the relationship between sarcoma recurrence and tumor immune microenvironment with TCGA data, this is the rst report to predict the prognosis of sarcoma patients by using TCGA data to calculate the immune score to construct a prognostic model. In our study, we analyzed the immune scores and stromal scores. To best understand the relationship between immune and sarcoma, we analyzed the function of immune DEGs. The high expressed immune DEGs were enriched in immune-related biological process aspects such as T cell activation, regulation of T cell activation, leukocyte cell-cell adhesion, regulation of lymphocyte activation, and leukocyte proliferation, mononuclear cell proliferation, lymphocyte proliferation, regulation of leukocyte proliferation, regulation of mononuclear cell proliferation, regulation of leukocyte cell-cell adhesion, et al. They also enriched in immune-related pathways, including Cytokine-cytokine receptor interaction, Viral protein interaction with cytokine and cytokine receptor, Chemokine signaling pathway, Osteoclast differentiation, Cell adhesion molecules (CAMs), Phagosome, Complement and coagulation cascades, NF-kappa B signaling pathway, Arachidonic acid metabolism, PI3K-Akt signaling pathway.
In the rst subnet of high expressed immune DEGs analyzed by MCODE, CCR gene family, CXCL gene family, and CCL gene family were clearly clustered. They had a signi cant coexpressioin relationship, especially CCR genes family. And high expression of CCR family genes indicated a better survival in a short time. CCRs(C-C motif chemokine receptors), CXCLs(C-X-C motif chemokine ligands), and CCLs(C-C motif chemokine ligands) belongs to chemokine receptors or chemokine ligands. Chemokines, chemokine receptors, and chemokine ligands play a signi cant role in cell migration, immune surveillance, in ammation, and tumorigenesis [27][28][29]. Numerous studies reported that chemokines and chemokine receptors were association with the tumor growth [30] via numerous mechanisms [31]. Chemokine receptors can inhibit the proliferation of tumor cells. Such as CCR1 reduce the proliferation of human liver cancer cells [32], CCR5 inhibits breast cancer progression in a p53-dependent manner [33]. CCR7-mediated leukocyte migration is greatly signi cant in the normal immune response. CCR7 recruits natural T cells and activated dendritic cells to the lymph nodes and initiate an adaptive immune response [34]. CCL19 is one of CCR7 ligands [35,36], and can activates T-cells [37]. Inhibition of the CCR7 gene causes T cells to home to secondary lymphoid organs with signi cantly impaired function [38]. In addition, CXCL1 and CXCL2 have been shown to play a role in the growth of gastric cancer, adenocarcinoma, pancreatic cancer, and melanoma, lung cancer [39,40].
To further analyzed the relation of sarcoma and immune, we screened out the immune genes related to sarcoma by immune DEGs and immune genes from IMMPOTT, and performed the univariable analyzed and multivariable Cox analyses to construct a prognostic model for predicting the prognosis of patients of sarcoma. A total of seven immune genes were used to construct the prognostic model. BTC, GREM2, IL17B, and PAK3 were protective immune genes related to sarcoma, while EDN3, RAC3, and ZC3HAV1L were risk immune genes related to sarcoma. EDN3(endothelin 3) is an important paracrine factor for embryonic melanocyte proliferation, migration, and survival during fetal development [41]. Its involvement was con rmed in the development of malignant melanoma [42] and choriocarcinoma [43]. And Liu Y, et al. reported that EDN3 was highly produced by glioblastoma stem cells and was an essential mitogen for NCC development and migration [44]. RAC3(Rac family small GTPase 3), a member of the Rac family of small guanosine triphosphatases (GTPases), played a signi cant role in regulating tumor metastasis and the tumorigenic process [45,46]. Liu TQ demonstrated that knockdown the RAC3 obviously suppressed the proliferation and colony formation of lung cancer cells [47]. However, Wang G performed that high expression of RAC3 is associated with a longer survival of lung adenocarcinoma patients [48]. Until now, there are still very few reports on the ZC3HAV1L(zinc nger CCCH-type containing, antiviral 1 like) gene, especially the relationship between ZC3HAV1L and tumors. And we are the rst to report that ZC3HAV1L is a risk factor for sarcoma.

Conclusion
To sum up, we were the rst to analyzed the differential expressed genes based the different microenvironment in sarcoma and further revealed the mechanism of sarcoma, especially the role of CCRs, CXCLs, and CLLs in sarcoma. Combined with the genes from the IMMPORT website, we obtained the sarcoma-related immune genes, and constructed immune gene prognostic model to predict the survival of patients in sarcoma well.  The association between immune or stromal scores and clinical data in sarcoma Comparison of gene expression pro le with immune scores and stromal scores in sarcoma. The heatmaps(A and B), volcanos(C and D), and Venn diagrams(E and F) of differentially expressed genes between high and low immune scores or stromal scores group. Note: H, high scores. L, low scores. lfc, Log(fold chang). Imm, immune score. Str, stromal score. DEGs, differentially expressed genes.

Figure 6
The survival analysis of CCR genes in sarcoma.  The evaluation of prognostic model. A, ROC curve analysis for predicting survival in sarcoma patients according to the risk score. B, From left to right are univariate Cox regression analysis and multivariate Cox regression analysis for predicting survival in sarcoma patients according to the risk score. Note: his_subtype, Histological subtype.

Figure 9
Kaplan-Meier curves of OS and EFS for the low-and high-risk groups in GSE17674. Note: OS, Overall survival. EFS, Event-free survival.