Mining TCGA to Reveal Immunotherapy-related Genes for Soft Tissue Sarcoma


 Objective: By mining the TCGA database to look for immunotherapy targets of soft tissue sarcoma, and analyzed their biological behavior. Methods: The data of 265 samples were downloaded from the TCGA database to analyze the expression profile of soft tissue sarcomas. Research methods include immune and stromal scores, calculating DEGs, volcano maps and differential gene survival curves, gene enrichment analysis.Results: Kaplan-Meier survival curves showed that in the high immune score group, the total survival time was generally higher than that in the low immune score group. Analysis of the top ten terms resulted in the minimum P values for immune and inflammatory responses, plasma membrane, receptor activity, and chemokine activity. By plotting the K-M curves, we obtained 86 survivals related DEGs. Finally, the genes that can be used as independent risk factors for prognosis of soft tissue sarcoma were obtained by multivariate analysis of the DEGs. Conclusion: We believe that these genes are expected to be new targets for sarcoma immunotherapy and key genes for the analysis of prognosis of sarcoma.


Introduction
Soft tissue sarcoma is a malignant tumor whose etiology is still unknown. Its classi cation is variable, and based on its different sources, its histological and biological characteristics also differ. The treatment of soft tissue sarcoma depends on surgery, radiotherapy and chemotherapy, and it is necessary to select high-risk patients for preoperative radiotherapy and chemotherapy in clinical studies. However, the classi cation of soft tissue sarcomas is complex, and conventional chemotherapy regimens are not effective for all types of tumors (Hall, K.S. et al., 2020). Numerous studies (Francescutti,  The ESTIMATE algorithm based on the TCGA database can be used to evaluate the purity of the tumor. (Fan, T. et al., 2020;Jia, D. et al.,2018) Based on gene expression data, it involves the numbers of immune cells and stromal cells whose expression is the highest in the tumor microenvironment. Thus, we can directly study the genes that have the highest impact with the immune and stromal scores, and the value of these genes in predicting prognosis and survival rate of soft tissue sarcoma can be evaluated. (Li, F. et al., 2020;LAN, H. et al., 2017;Peille, A.-L. et al., 2013) In this study, the ESTIMATE algorithm was employed to analyze data from the sarcoma group in the TCGA database and collect the immune, stromal, and estimate scores. These three scores were combined to identify genes associated with the prognosis of soft tissue sarcomas. We also con rmed that a part of these differentially expressed genes (DEGs) can be used as independent risk factors for the prediction of prognosis of soft tissue sarcoma. Through a subsequent con rmation based on the Gene Expression Omnibus (GEO) database, we obtained 12 genes that are meaningful both in the TCGA and GEO database.

Methods
Immune and stromal scores: The data of a total of 265 samples were downloaded from the TCGA database to analyze the expression pro le of soft tissue sarcomas. The gene expression pro le of soft tissue sarcoma was roughly divided into seven subtypes according to Widemann's (Widemann, B. C. and Italiano, A., 2018) reported. The gene expression data of 265 cases downloaded from the TCGA database also. Gene expression data used to calculate immune and stromal scores were derived from TCGA, as were clinical data such as overall survival, age at rst diagnosis, and residual tumor data. The independent dataset used to verify differential genes was derived from the GEO database, GSE30929 dataset (SEP 02, 2011).
Methods for calculating DEGs: Data analysis using the "limma" package analysis, set Fold change =2 and p < 0.05.
Graphics: Volcano maps and differential gene survival curves were produced using the tool SangerBox.
Kaplan-Meier curves of immunological and stromal scores and sarcoma typing scores were obtained using the tool GraphPad. The Venn diagram was drawn with the Web tool "Venny". Construction of Protein-Protein Interaction Network (PPI): The overall PPI network was derived from the STRING database and analyzed by the tool "Cytoscape". Molecular Complex Detection (MCODE) was used to divide the network into three modules, and only the genes with more than 5 edges were selected as the key genes.

Immune and stromal scores
The range of estimate scores based on the ESTIMATE algorithm was from -2445.3107 to 13152.619.
Stromal scores ranged from -740.7589 to 6339.22. The range of immune scores was from -1890.07 to 7914.021. The stromal score of dedifferentiated liposarcoma was the highest among the eight subtypes (Among all seven subtypes, only two desmoid tumor samples were included; thus, they were not included for comparison.). This was followed by myxo brosarcoma, UPS, MPNST, and LMS. The synovial sarcoma subtype had the lowest stromal score (P < 0.0001). In addition, the immune scores were ranked from the highest to lowest as follows: UPS>myxo brosarcoma > dedifferentiated liposarcoma > MPNST > LMS > synovial sarcoma (P < 0.0001), indicating that immune and stromal scores are important for the classi cation of different soft tissue sarcoma subtypes.
The gene expression data were divided into two groups-high expression and low expression groupsbased on the median of the immune and stromal score. Kaplan-Meier survival curves were drawn, which showed that in the high immune score group, the total survival time was generally higher than that in the low immune score group (see Fig.1 A). The median of overall survival times in the high immune score group and the low immune score group were 2464 days and 1722 days, respectively (log-rank test P 0.05). Similarly, although the result was not statistically signi cant (the log-rank test P 0.05), the survival curve obviously indicated that the total survival time of the group with high stromal scores was higher than that of the group with low stromal scores, with a median total survival time of 2448 days for the high score group compared with 1722 days for the low score group (see Fig.1 B).
The sarcoma samples downloaded from the TCGA database were divided into seven typesdedifferentiated liposarcoma, desmoid tumor, UPS, LMS, MPNST, myxo brosarcoma, and synovial sarcoma-and a box plot (Fig.1 C,D) drawn to compare different types of immunity and stromal scores, and there were statistically signi cant(P<0.0001), indicating that the immune and stromal scores were signi cantly different for the classi cation of sarcoma.

Differentially expressed genes results
To compare the difference in gene expression between the high score and low score groups, we used the "limma" algorithm, with the setting, log (fc) 2 = 2 and FDR = 0.05, to calculate the DEGs for the low stromal, immune, and estimate score groups relative to the high group. Volcano plots were drawn to indicate differences in gene expression among cases with different immune, stromal, and estimate scores. In the charts (see Fig.2 A, B, C), green represents downregulated genes, and red represents upregulated genes. In the immune score group, 91 genes were upregulated and 479 genes were downregulated. In the stromal score group, 157 genes were upregulated and 281 genes were downregulated. In the estimate score group, 100 genes were upregulated and 482 were downregulated. A Venn diagram was drawn to obtain 258 genes with common differences in all three groups, among which 66 genes were upregulated and 192 genes were downregulated (see Fig.2 D,E).

Functional enrichment analysis of DEGs
A total of 18 cellular components (CC), 26 molecular functions (MF), and 95 biological processes (BP) were obtained by GO analysis of 258 DEGs. Analysis of the top ten terms resulted in the minimum P values for immune and in ammatory responses, plasma membrane, receptor activity, and chemokine activity. Similarly, the results of KEGG pathway analysis were also mainly related to immunity, among which the highest proportion included Staphylococcus aureus infection, accounting for 29.17%, Leishman disease, accounting for 25%, and viral protein interaction with cytokine and cytokine receptor, accounting for 12.5% (see Fig.3).

Relationship between DEGs and overall survival results
To explore the different roles of 258 DEGs in survival, we set the median as the boundary to draw a Kaplan-Meier survival curve for each of the DEGs. Eighty-six genes were statistically signi cant predictors of overall survival (log-rank P < 0.05) (see Fig.4).

Construction of a protein co-expression network results
A total of 15 key node genes were selected because they were the most closely related to other genes around.  Using clinical data from TCGA database, univariate regression analysis was performed to obtain "age at initial pathologic diagnosis "and "residual cancer" as prognostic factors for sarcoma. Multivariate regression analysis of the differential genes was carried out to obtain a gene list that could predict the prognosis of sarcoma, including risk genes (HR>1) and benign genes (HR<1).

Veri cation of DEGs in the GEO database results
To verify the DEGs obtained are also signi cant in other datasets, we used the 140 independent liposarcoma samples of GSE30929 in the GEO database to perform survival analysis by calculating the forward relapse-free survival (DFS). We obtained 12 genes that are meaningful both in the TCGA and GEO database:

Discussion
Soft tissue sarcoma is a kind of tumor with complicated classi cation and high malignancy, and its immunotherapy has been a hot topic. To identify relevant targets for immunotherapy for soft tissue sarcomas, we performed a series of data analyses using samples from the TCGA database. By calculating the immune and stromal scores, we con rmed that the immunological/stromal scores differed among the different types of soft tissue sarcomas. As shown in previous studies, UPS expressed more T-cell permeation-related genes, while the synovial sarcoma expressed signi cantly less. Therefore, UPS is a kind of soft tissue sarcoma suitable for immunotherapy. ( Pollack, et al.,2017; Toulmonde, M.,et al.,2020) And the overall survival time was longer in the high-immune group than in the low-immune group. Thus, the high expression of some immune-related genes was bene cial to prolong the overall survival time.
To further understand which genes affected immune scores mostly, we performed differential analysis and obtained differential genes. Our results showed that most genes were downregulated when the low score group was compared to the high score group. According to the Venn diagrams, we chose 258 genes which were differential expression in three scores. And these differential genes were analyzed by functional enrichment. Top ten results of GO analysis suggested that these DEGs had immune-related functions. And so it was in the KEGG analysis. As we know, the sarcoma tumor cells inhibit the immune response in the tumor microenvironment to promote its growth and metastasis. Therefore By plotting survival curves for individual genes, we disproved the obtained relationship between immune/stromal scores and overall survival time. AOAH, for example, is a downregulated gene; i.e., AOAH gene expression was low in the low score group compared with that in the high score group. The higher the expression of AOAH, the higher the corresponding immune/stromal score, and the longer the overall survival time. This corresponded to a longer overall survival in the high immune/stromal rating group.
There were totally 86 survival related DEGs. Similarly, we performed enrichment analyses on 86 DEGs; the outcomes included 51 biological processes, 9 cellular components, and 11 biological processes that were identi ed as signi cant. The top terms were immune and in ammatory responses, NADPH oxidase complex (Martner, A., Aydin, E. and Hellstrand, K.,2019), integral component of plasma membrane, superoxide-generating NADPH oxidase activator activity, and superoxide-generating NADPH oxidase activity.
In further study, we constructed a protein expression network of 86 genes, which was divided into four modules, among which the main three modules included 15 key gene nodes because they are most closely related to the surrounding genes, including LILRB2, AOAH, CCL5, NCF4, CYBB, ITGAM, CD2, IL2RG, CD163, VAV1, CXCR3, HCK, MS4A6A, LY86, IDO1. And we found that these genes all had immune-related functions by functional enrichment analysis. We con rmed that these DEGs were signi cant in immunotherapy. Among them, IDO1 has been proven to be related to the growth and immunosuppression Finally, we used an independent group of 140 samples from the GEO database to verify the roles of these genes in survival and con rmed that 12 genes were meaningful in this independent group. nine of the genes are involved in immunity and have been reported to be involved in the development or metastasis of different types of tumors; while, none of them has been reported to be associated with the development and progression of sarcomas. This means that these genes are likely to play a role in the treatment of sarcoma as immunotherapeutic targets. Compared with the sarcoma-related mutant genes previously revealed by gene sequencing, this study identi ed other genes and related pathways previously unrelated to the disease from the perspective of tumor microenvironment. (Barretina, J. et al. ,2010) We believe that these genes are expected to be new targets for sarcoma immunotherapy and key genes for the analysis of prognosis of sarcoma. This will address a key step in tumor immunotherapy, and effective tumor cell surface antigens can be developed so that immune agents can attack tumor cells directly. In conclusion, we believe that these genes are expected to be new targets for sarcoma immunotherapy and key genes for the analysis of prognosis of sarcoma. To provide gene targets for immunotherapy of soft tissue sarcoma.
Declarations doi: 1016/j.gene.2020.145105. Figure 1 Relationship between immune and stromal scores and the overall survival time. (A) People with a high immune score had a longer overall survival time, which was statistically signi cant. Similarly, as shown in the gure B, a high stromal score indicated a longer overall survival time, although this was not statistically signi cant.

Figure 2
In volcano map(A B C), green represents downregulated genes, red represents upregulated genes, and black represents genes with the same expression level in the high and low score groups. Venn diagrams (D E) show genes that are generally up-regulated or down-regulated in all three scores.

Figure 3
Page 16/16 Top 10 term of DEGs' GO analysis results: biological processes (A), cellular components (B), molecular functions (C) and KEGG pathways (D) are presented.

Figure 4
Kaplan-Meier survival curve of differential genes and overall survival, generated by the batch survival analysis tool of Sanger Box. A total of 86 differentially expressed genes with survival signi cance were obtained. Green represents low expression of the gene, and red represents high expression.

Figure 5
The PPI network of 86 different genes(A) was created to discover the relation; B C and D represent cluster 1, 2 and 3, respectively. Protein expression network of a total of 86 DEGs obtained using the web String tool divided into 3 clusters with 64 nodes with 326 edges by cystoscope analysis. The darker the color is, the greater the log (FC) value of the score of gene expression is, and the larger the circle is, the more associated it is with other proteins.