Construction of Immune-gene-pairs and Prognostic Model and its Immune Cell Inltrationin in Glioblastoma — A Study based on TCGA and GTEx

Background: Gliomas are the most common primary intracranial tumor, containing 81% of brain malignant carcinoma. Glioblastoma is a primary brain tumor with a very poor prognosis. Existing treatment methods have many limitations, so more in-depth researches are urgently needed. Results: To screen key genes and build models, we made full use of RNA sequencing (RNA-seq) data from TCGA and GTEx and perform immune inltration analysis in multiple immune-related databases. Finally, we screened key genes and build the model. Meanwhile, we explored immune cell inltration. KEGG and GO analysis prompted us COVID-2019 may have potential relation with GBM. Further selection of immune-related differential genes and gene-pairs were then explored. The prognosis model built by lasso regression was tested to have great effect. Key genes were all found to have quite diverse expression level compared with normal samples. Moreover, their co-expression analysis reminded us more attention could be paid to WNT and NOTCH signaling. Immune inltration suggests that most immune cells are differentially expressed in tumor samples, and the interaction reminded us B cells may play a critical role. Conclusion: This study is the rst to build immune-gene-pairs and used lasso regression to construct the model and screen critical pairs. Key genes were then selected and observed their role in GBM.


Introduction
Gliomas are the most common primary intracranial tumor, containing 81% of brain malignant carcinoma [1]. Among all types of this disease, glioblastoma (GBM) has the highest incidence rate and causes a great health burden [1]. It is believed to originate from neuroglial stem or progenitor cells and the most common type is isocitrate dehydrogenase (IDH)-wildtype tumor [2]. Although compared with certain cancers such as liver cancer or breast cancer, GBMs is relatively rare, its treatment prospects are not optimistic due to high strong invasive ability and high recurrence [3]. It remains as a di cultly cured cancer regardless of therapeutic options' improving, so more in-depth researches are urgently needed [4].
The results of lasso regression Next, we downloaded immune-related gene list from ImmPort and crossed it with differentially expressed genes to obtain immune-related differentially expressed genes which could be used to set up immunegene-pair. Lasso regression was performed based on survival data. Results was shown in Fig 2. Meanwhile, risk score of each gene-pair and the most signi cant pairs were screened out. As was shown in Table 1, nine immune-related gene-pairs were believed to play a critical role in this model and worthy of further investigation. Besides, forest plots of these gene-pairs were displayed in Fig 3. All of the gene pairs have a P value of smaller than 0.001 and PTX3|BMP2 has the highest hazard ratio of 11.886. Evaluation of the model Evaluation of our model was also conducted. Firstly, we determined the critical value of risk score by ROC curve. As was shown in Fig 4A, the critical value of risk score was 3.672 and area under curve was 0.886.
Next, we divided the patients into two groups according to risk score by setting cut-off value as 3.672 and carried out survival analysis. Kaplan-meier plot indicated that the survival between two groups were statistically signi cant (P<0.001, Fig 4B). Also, we explored the AUC curve considering one-year, two-year and three-year survival data and found that the AUC were all greater than 0.85 ( Fig 4C). Furthermore, in Fig 4D, there was a great divergence in the survival status on both sides of the cutoff value. Combining these results, our model was believed to be reliable.

Clinically related analysis
Clinicaly related analysis were processed by integrating risk score data with clinical data. In Fig 5A, ROC curve indicated that risk score and age were important factors with AUC values greater than 0.800. Relationship between these factors and risk score was shown in Fig 5B-D which suggest age and grade were closely related to risk score. Heatmap also indicated that age and grade were signi cant clinical factors ( Fig 5E). This conclusion could also be supported by multivariate independent prognostic analysis and single factor analysis. Age, grade and risk score were critical affecting factors according to Figure 5F-G.

Further exploration of key genes
As was depicted before, noteworthy immune gene-pairs were got and these genes were regarded as key genes, including AZGP1, HLA-DRB5, TNFRSF12A, KLRC2, BIRC5, PSMC2, NR2F1, TMSB15A, TMSB4XP8, PPP3CB, WFDC2, JAG2, RBP1, RAC3, FURIN, CX3CL1, TCF7L2, IKBKB, PTX3 and BMP2. Results of violin plots were shown in Fig 6. Interestingly, expression level of these key genes between diverse risk groups were signi cantly different. Next, we conducted co-expression analysis. By analyzing samples with mutation and CNA data (273 patients/samples), the most common genetic alteration type was missense mutation and other types were also shown in Fig 7A. As for pathway analysis, WNT and NOTCH signal caught our attention ( Fig 7B and Fig 7C), reminding us they may play unignorable role in the development process of GBM.
In ltration analysis of various types of immune cells

Discussion
This research aimed at constructing and assessing immune prognostic model and screened out key immune-gene-pairs and genes. Downloading and integrating RNA sequencing data from TCGA and GTEx, differentially expressed genes were screened out and performed KEGG and GO pathway analysis. Notably, In KEGG analysis, we found coronavirus disease (COVID-2019) was one of the enrichment pathways. No relevant studies before clari ed relation between GBM and COVID-2019. However, in the review of Hyun Jee Han et al, patients tended to present worse situation when exposed to the virus [22]. It prompted us they may need further exploration since COVID-2019 was so serious that scientists all over the world were investigating its related information.
Then, we download immune-gene list from ImmPort and selected immune-related differentially expressed genes, which were used to construct immune-gene-pairs so as to adjust batch difference. After that, lasso regression was utilized to build model and nine most related gene-pairs were selected, including AZGP1|BMP2, HLA-DRB5|TNFRSF12A, KLRC2|BIRC5, PSMC2|NR2F1, TMSB15A|BMP2, TCF7L2|IKBKB, TMSB4XP8|PPP3CB, WFDC2|JAG2, and PTX3|BMP2. These genes were referred to as key genes. Some genes were proved to be relative to GBM or other cancers. For example, L Persano et al suggest reducing BMP expression level may be a great way to increase GBM responsiveness to chemotherapy [23].
Expression of AZGP1 in prostate cancer was especially higher compared with adjacent normal tissues [24]. More studies are needed to clarify in-depth mechanism. Furthermore, whether these genes related to GBM also correlate with COVID-2019 remain to be explored. Assessment of the model was as important as building. Subsequently, through ROC curve we got the cutoff value -3.672 and found area under one-year, two-year and three-year curve were all more than 0.75, reminding us this model was reliable. Similarly, there are great difference between high and low groups. Some clinical factors such as age, grade and sick score were all proved to be critical. Jigisha P Thakkar et al concluded younger age was related to better prognosis [25]. Moreover, key genes' expression level were higher in tumor samples; their co-expression level suggest WNT and NOTCH may be critical pathways and missense mutation was the most common genetic alteration type. Also, Nishani Rajakulendran et al reported that in orthotopic Wnt model, inhibition of Wnt/β-catenin and Notch signaling could lead to lower clonogenic potential [26]. More and more evidences unfolded NOTCH signaling was highly active in GBM stem cells, thus contributing to high resistance [27]. The in uence of these genes and pathways in the prognosis of GBM patients and the underlying mechanism are worthy of exploration. In conclusion, we attempted to investigate GBM immune-related gene-pairs, key genes and screen potential biomarkers. Immune-model was built and evaluated, it had great predictive effect. Similarly, our research also has areas for improvement. Firstly, we lacked experimental projects and related data, all analysis was based on data downloaded from open databases. Combining the clinical project with this analysis will lead to more credible conclusions. Secondary, we did not pay more attention to exploration of signaling. Last but not least, the conclusions were based on retrospective data and more research is required to certi cate these ndings.

Conclusions
This study is the rst to build immune-gene-pairs and used lasso regression to contruct the model and screen critical pairs. Key genes were then selected and observed their role in GBM.

Gene information and bioinformatics analysis
We downloaded gene expression data from TCGA (work ow type: HTSeqCounts) and GTEx (data format: fpkm Identi cation of differentially immune-related expressed genes and immune-gene-pair. After getting gene transcription data, we used "edgeR" R package (R version:3.6.1) to lter differentially expressed genes between GBM samples and normal samples. ImmPort (http://www.immport.org) is an open database which includes 17 categories, immune-related gene list was downloaded from this web [10]. Our study extracted differentially immune-related genes and their expression level. Afterwards, immune-gene-pair was constructed and it was combined with survival data. By the way, gene-pair meant comparing adjacent genes' expression levels and obtain results.

KEGG and GO pathways analysis
The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database which is utilized to explore genes' biological functions [11]. Gene Ontogy (GO) function analysis includes biological processes (BP), cellular components (CC) and molecular functions (MF), which could let us have a better understanding of biological processes [12]. By using "colorspace", "stringi", "ggplot2", "clusterPro ler" and "enrichplot" R package, we could perform GO functional annotations and KEGG pathway enrichment analysis of differential expressed genes we got before. Results could reveal pathways worthy of attention.

Lasso regression
Lasso regression is a method that is more and more widely used in the eld of bio-informatics. By constructing a penalty function, it compresses certain regression coe cients or making them zero, thereby establishing a more accurate model, especially for independent variables High-dimensional situation. In the model established in this study, survival time and status were taken as dependent variables, and immune gene pairs were taken as independent variables. We utilized the "glmnet" R language package (version 2.0-16) to t the logistic LASSO regression model [13,14]. As a result, risk score of each pair was calculated and summarized. Meanwhile, we obtained these immune-gene-pairs which have the most signi cant impact on prognosis. In addition, in order to verify the predictive effect of the model, our team drew ROC curve and got the cutoff value. Based on it, high and low risk groups were dived and survival curve was built. Combing clinical information with risk le, our study performed univariate and multivariate independent prognostic analysis. At the same time, we drew ROC curve, forest map and boxplot curve for visual observation. R language and several packages were applied during it, including "survival", "survivalROC", "limma" and "ggpubr".

Key genes
Through lasso regression, most related immune-gene-pairs were selected and they were believed to play important roles in GBM patients. Similarly, gene-pairs composed of these genes were believed to play an important role in GBM patients. These genes were so-called key genes in this study. In addition to explore the role of gene-pairs, individual or co-expression of key genes in GBM patients were worthy of studying. By the way, violin program was used to visually represent the individual role and "***" meant p-value < 0.01, since we compare certain genes' expression level between patients with different prognosis.

cBioPortal
The cBioPortal for Cancer Genomics (http://cbioportal.org) is an open and user-friendly dataset, which could explore and analyze multidimensional cancer genomics data [15]. By selecting GBM database and setting gene names, it is easy to interactively probe genetic alterations across samples, genes, and pathways. Due to its intuitiveness and simplicity of operation, the database has inspired many new discoveries in cancer bioinformatics. Through co-expression analysis, it suggests interesting pathways that maybe worth further exploration.
Immune databases xCell (http://xCell.ucsf.edu/) shows a novel method to present immune cells in ltration, which could explore 64 immune and stromal cell types [16]. Timer ((https://cistrome.shinyapps.io/timer/) comprehensively collects and analyze molecular characterization, investigating expression levels of six immune-interaction subtypes in 10,897 tumors from 32 cancer types [17]. QuanTIseq (http://icbi.at/quantiseq) is a fantastic way to quantify immune-in ltration, measuring the fractions of ten immune cell subtypes from a great number of RNA-sequencing data [18]. MCP-counter (Microenvironment Cell Populations-counter) focuses on exploring the in ltration of eight types of immune cells, thereby exploring the tumor microenvironment [19]. As a prospective cohort study, The European Prospective Investigation into Cancer and Nutrition (EPIC) aimed at investigating relationship between tumor and nutrition [20]. CIBERSORT (http://cibersort.stanford.edu/) could analyze RNA mixtures for exploring cellular biomarkers and therapeutic targets by sequencing samples from fresh, frozen and xed tissues [21].
Although results and ways of these immune databases are not all the same, systematic integration of the immune in ltration in multiple immune cells can provide us with more convincing evidence. R packages "limma", "scales", "ggplot2" and "ggtext" were utilized to extract speci c types of immune cell expression level and evaluate its correlation with risk score.