Identication and Validation of a Predictive Immune-Related Genes Signature for the Prognosis of Cholangiocarcinoma

Background: Immune-related genes (IRGs) play a crucial role in the initiation and progression of cholangiocarcinoma (CCA). However, immune signatures have rarely been used to predict prognosis of CCA. The aim of this study was to construct a novel model for CCA to predict survival based on IRGs expression data. Methods: The gene expression proles and clinical data of CCA patients from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database were integrated to establish and validate prognostic IRG signatures. Differentially expressed immune-related genes were screened. Univariate and multivariate Cox analysis were performed to identify prognostic IRGs, and the risk model that predicts outcomes was constructed. Furthermore, receiver operating characteristic (ROC) and Kaplan-Meier curve were plotted to examine predictive accuracy of the model, and a nomogram was constructed based on IRGs signature, combining with other clinical characteristics. Finally, CIBERSORT was used to analyze the association of immune cells inltration with risk score. Results: We identied that 223 IRGs were signi ﬁ cantly dysregulated in patients with CCA, among which ve IRGs (AVPR1B, CST4, TDGF1, RAET1E and IL9R) were identied as robust indicators for overall survival (OS), and a prognostic model was built based on the IRGs signature. Meanwhile, patients with high risk had worse OS in training and validation cohort, and the area under the ROC was 0.898 and 0.846, respectively. Nomogram demonstrated that immune risk score contributed much more points than other clinicopathological variables, with a C-index of 0.819 (95% CI, 0.727-0.911). Finally, we found that IRGs signature was positively correlated with the proportion of CD8+ T cells, neurophils and T gamma delta, while negatively with that of CD4+ memory resting T cells. Conclusions: We established and validated an effective ve IRGs-based prediction model for CCA, which could accurately classify patients into groups with low and high risk of poor prognosis. Cox regression analyses, and utilized them to establish a novel prognostic IRGs model. The model showed a prominent performance by using TCGA-CHOL cohort. After adjusting to other clinical factors, the risk scores generated by the model were demonstrated to be an independent prognostic factor, and had better discriminating power for risk stratication. The reliability of the model was further validated based on GSE107943 cohort. Our results indicated that the model could act as an effective marker for survival prediction of CCA patients. Based on immune risk scores combining with other clinical factors (age, gender, AJCC stage, N stage, M stage and margin status), a nomogram of survival prediction was generated and exhibited satisfactory agreements between the predicted and actual values, which increased the predictive capacity. Thus, our study demonstrated the potential clinical application for survival prediction of CCA patients.


Background
Cholangiocarcinoma (CCA) remains a highly aggressive malignancy with extremely unsatisfactory prognosis, which ranks as the second most frequently primary liver cancer and its incidence is increasing worldwide [1,2]. Although surgical resection provides the only opportunity to cure for CCA patients without metastasis, the 5-year survival rate underwent radical resection remains less than 30% owing to metastasis and recurrence [3,4]. The bene ts from adjuvant therapy and targeted therapy appear to be very limited due to the genetic heterogeneity [5]. Hence, prognostic markers are of great signi cance to clinical decision making and may provide novel insights into underlying mechanisms of CCA. Unfortunately, there are no dependable and effective biomarkers to accurately estimate the prognosis of CCA patients.
An increasing amount of evidence has demonstrated that immune-related genes (IRGs) are closely related to the tumorigenesis and progression of CCA [6][7][8], and several IRGs have been reported to serve as promising biomarkers for prognostic evaluation to patients [9][10][11]. For example, the PD-L1 expression was signi cantly correlated with TNM stage and overall survival for CCA patients after resection [6,12].
However, most of prognostic indicators or models only focus on a single IRG, which is not sufficiently rigorous to clinical practice.
Checkpoint inhibitor-based immunotherapy have achieved notable success in the clinical treatment of tumor [13][14][15][16][17]. Although many studies have demonstrated that immunotherapy shows a promising safety and e cacy for patients with CCA, the response rate is disappointing [18][19][20][21]. Tumor immune microenvironment (TIME) composed of immune in ltrating cells plays an important role in tumor progression and immunotherapeutic response [22][23][24]. A recent study by Montal et al. on the molecular classi cation of CCA identi ed that immune class had a higher lymphocyte in ltration and overexpression of PD-1/PD-L1, and the patients of the class might become the ideal candidates to receive immunotherapy [25]. These evidences provide a new insight into the role of IRGs and immune cells on identifying high-e ciency biomarkers that predict patients' prognosis and the effectiveness of immunotherapy.
Given the role of immune mechanisms in the pathogenesis of CCA, we conducted a study based on immune genes to establish a predictive model for prognosis of CCA patients. Our study aimed to develop an IRGs signature model to help better predicting survival of CCA patients and reveal potential mechanisms.

Data collection and TF genes and IRGs pro le mining
We made use of data in the public domain. RNA-seq data from CCA and relevant noncancerous tissues were downloaded from The Cancer Genome Atlas-Cholangiocarcinoma dataset (TCGA-CHOL) (https:// cancergenome.nih.gov/) and Gene Expression Omnibus dataset (GSE107943) (https://www.ncbi.nlm.nih.gov/gds/) [26], the data were normalized by the limma package. The corresponding clinicopathological and survival information were downloaded from open-access resource, which included age, histologic grade, tumor node metastases (TNM) stage, margin status, OS time etc.
IRGs were obtained from the ImmPort database (https:// www.immport.org/home). The genes of cytokines, cytokine receptors, and those that were related to the signaling pathways of the T-cell receptor, B-cell antigen receptor, natural killer cell cytotoxicity, and antigen processing and presentation were selected. Transcription factor (TF) genes were obtained from the Cistrome Cancer database (http://cistrome.org/CistromeCancer).

Identi cation of differentially expressed TF and IRGs
Differentially expressed genes (DEGs) were detected using Wilcoxon test in limma package. Probe sets without corresponding gene symbols or genes with more than one probe set were removed or averaged. |log 2 fold change| >1 and adjusted P<0.05 were considered as statistically signi cant. The expression patterns of TF and IRGs were draw from all DEGs. Volcano plots and heatmaps were plotted respectively by gplots and pheatmap package in R software.

GO and KEGG pathway enrichment analysis
To understand the biological mechanisms underlying IRGs, Gene ontology (GO) annotation and Kyoto encyclopedia of genes and genomes (KEGG) pathway analyses were performed using DAVID (https://david-d.ncifcrf.gov/) and ClusterPro ler package in R. GO analysis consists of three categories: biological processes (BP), cellular components (CC) and molecular function (MF). The results were visualized using GO plot package and false discovery rate (FDR) <0.05 was considered as the threshold.
Development and validation of prognostic IRGs signature for CCA Univariate and multivariate Cox analysis were utilized to assess the relationship between IRGs expression and overall survival (OS). To identify robust independent risk prognostic immune-related genes (PIR-DEGs), the nal IRGs to build prognostic prediction model were based on the threshold of P <0.05. We also used GSE107943 as validation cohort to test reliability and stability of the model. To con rm reliability of the detected immune-related genes, a risk score was designed to identify a signature for prognostic analysis. We applied a linear model for prognosis risk score prediction: Risk score = βgene(1) × exprgene(1) + βgene(2) ×exprgene(2)+···+βgene(n) × exprgene(n) The risk score was calculated based on combination of the expression level of immune-related genes weighted by the regression coe cient (β). Exprgene refers to the expression of IRGs in the sample, and β is the regression coe cient generated from multivariate Cox analysis in training set. The median risk score was used to divide samples into low-risk and high-risk group. The survival outcomes of the two groups were compared by KM curve. The predictive accuracy of the prognostic model was assessed by ROC curve (survival ROC package). And area under the curve (AUC) showed the variation between the predicted and observed outcomes.

Construction of TF and IRGs network
Correlations test between TF and PIR-DEGs was investigated by Pearson correlation coe cient (r), and the cut-off criteria were set as correlation coe cient>0.4 and P< 0.05. The regulatory network was further visualized using Cytoscape version 3.7.2.
A nomogram was visualized using "rms" package in the R platform based on risk scores and clinicopathological parameters. The nomogram was subjected to 1000 bootstrap resamples for internal validation. Discrimination and calibration were used to assess its predictive accuracy. Calibration of the nomogram for 3-, and 5-year OS was performed by comparing the actual survival with the predicted survival. Further, the discriminatory ability of the nomogram was evaluated by calculating the concordance index (C-index). All statistical tests were two-sided and P<0.05 were considered statistically signi cant. Data compilations and descriptive statistics were performed using the SPSS24.0 (Chicago, IL, USA), along with rms package.

Estimation of tumor-in ltrating immune cells
To explore the relationships between risk scores derived from the prognostic model and immune cells in ltration, we employed CIBERSORT, a useful resource to comprehensively pro le tumor in ltrating immune cells. CIBERSORT can precisely estimate the proportion of 22 immune cells in tissues by utilizing a deconvolution algorithm [27]. We uploaded the normalized gene expression data with standard annotation les to the CIBERSORT web portal (http://cibersort.stanford.edu/), and the algorithm was determined by 1000 permutations and LM22 gene signature as described in previous literatures.
Correlation matrix for all 22 immune cell proportions were displayed by the corheatmap package.

Statistical analysis
Primary end point was OS, which was de ned as the interval between the date of diagnosis and that of the most recent follow-up or of death from any cause. Categorical data were analyzed using Chi-square test or Fisher's exact test, and continuous variables were analyzed using ANOVA or Kruskal-Wallis H test for variables with an abnormal distribution. Survival curves were plotted using Kaplan-Meier. Multivariate Cox proportional hazards regression model was used to identify independent predictors of OS. Unless otherwise noted, P<0.05 was considered to be signi cant.

Differential expressions of TF genes and IRGs in CCA
We used RNA-seq data and clinical follow-up information collected from 45 samples, including 36 CCA and 9 non-tumor tissues in TCGA. Wilcoxon rank test in R was performed to preprocess and analyze the gene differential expression of microarray data. Totally, 2657 DEGs were identified, which consist of 1203 signi cantly down-regulated DEGs and 1454 signi cantly up-regulated DEGs, for the subsequent bioinformatics analysis (Fig. 1a, d). To explore potential function and pathway of these DEGs identi ed above, GO and KEGG pathway enrichment analysis were performed ( Supplementary Fig. 1).
The mRNA levels of 2498 IRGs (ImmPort database) and 318 TF genes (Cistrome Cancer database) were examined. This analysis eventually revealed 223 immune-related genes (130 downregulated and 93 upregulated) and 32 TFs (13 downregulated and 19 upregulated) with the thresholds of |log 2 fold change| >1 and adjusted P<0.05, which were identi ed as differentially expressed in CCA tissues compared with normal tissues. Then, heatmaps and volcano plots were visualized to show the expression pattern of differentially expressed IRGs and TFs between CCA and non-tumor tissues (Fig. 1b, c, e, f).
Screening of prognostic immune-related genes and construction of TF regulatory network Considering the association IR-DEGs with OS, we further conducted a Cox regression hazards analysis to screen the IR-DEGs. To prevent over tting of the model, TCGA-CHOL dataset was utilized as training group. We found 11 prognostic IRGs (RAET1E, CST4, CCL24, CCK, CGB, GUCA2A, TDGF1, TDGF3, THPO, AVPR1B and IL9R) that were signi cantly related to the survival of CCA patients in the training group ( Fig.   2a) To further explore potential mechanisms behind deregulation of IR-DEGs expression in CCA, we analyzed the correlation between TF genes and IRGs expression. We investigated the correlations between the TF-DEGs and prognostic IR-DEGs by the method of Pearson correlation coe cient. Based on the threshold criteria, 24 TF genes were prominently associated with the PIR-DEGs (P<0.05). We utilized Cytoscape software to build and visualize the TF-based regulatory network, as showed in Fig. 2b. This TF regulatory network revealed relationships among these immune-related genes. Furthermore, GO annotation analyses of IR-DEGs led to the identi cation of top signi cantly GO terms, and it indicated that the differentially expressed immune-related genes were mainly involved in the regulation of immune effector process and cell killing (Fig. 2c).

Construction of the prognostic model in the training group
To investigate signi cance of risk genes in estimating prognosis, the 11 survival-related IRGs were further submitted to a multivariate Cox proportional hazards regression analysis (with forward selection and backward selection). Finally, ve candidate PIR-DEGs (RAET1E, CST4, TDGF1, AVPR1B and IL9R) which may serve as signi cant predictors of the prognosis were obtained for inclusion in the risk model. Among these genes, RAET1E, CST4 and TDGF1 were identi ed as high-risk genes (serving as risk factors), while AVPR1B and IL9R were identi ed as low-risk genes (serving as protective factors) ( Table 1). In the TCGA training cohort, the expression distribution of ve IRGs and Kaplan-Meier curves for OS were shown in Supplementary Fig. 2 According to median risk score, patients were divided into high-risk (n = 18) and low-risk group (n = 18) in the training cohort. We ranked the risk scores of patients and analyzed their distribution (Fig. 3a). The survival status of each patient was marked on dot plot, which showed that mortality rate in the high-risk group was signi cantly higher than that of the low-risk group (Fig. 3b). Heatmap revealed that the expression level of ve PIR-DEGs varied with risk scores in two groups (Fig. 3c). Among patients with higher risk scores in the training cohort, three risk genes (RAET1E, CST4 and TDGF1) were upregulated, while two protective genes (AVPR1B and IL9R) were downregulated. In patients with lower risk scores, these genes displayed an opposite expression pattern. Kaplan-Meier plot and ROC curve were used to evaluate the performance of ve PIR-DEGs signature in predicting the outcome of patients. The prognosis was significantly worse in high-risk group than low-risk group (P<0.001, Fig. 3d). The OS rates at three years and ve years for high-risk group in the training cohort were 19.9% and 9.9%, respectively, while the corresponding rates for low-risk group were 83.9% and 69.9%, respectively. Then we used ROC curves to examine predictive accuracy of the prognostic model at three years. The AUC value of ROC was 0.898 (Fig. 3e).
Validation of the prognostic model by GEO cohort The GSE107943 cohort including 30 CCA samples was used for validation of the prognostic model. The risk score distribution, survival status and risk gene expression in the GEO cohort are displayed in Fig. 4ac. In agreement with results of TCGA cohort, the Kaplan-Meier survival curves demonstrated that the OS was signi cantly poorer in high-risk group than low-risk group (P<0.05; Fig. 4d), and the survival rates at three and ve years in high-risk group were 30.9% and 0.0%, respectively, while the corresponding rates in low-risk group were 75.0% and 40.0%, respectively. The AUC for three years was 0.846 (Fig. 4e). Taking together, these results indicated a moderate sensitivity and speci city of the prognostic model. We also analyzed the expression distribution of ve IRGs and Kaplan-Meier curves for OS, the results were shown in Supplementary Fig. 3 (Fig. 5a). Meanwhile, the multivariate analysis revealed that risk score was an independently risk factor in the TCGA cohort (HR = 3.009, 95%CI = 1.540-5.880, P = 0.001) (Fig. 5b). These results demonstrated that the prognostic model showed a good performance to predict prognosis.
Nomogram and ROC curves for the prediction of prognosis in the training cohort.
To better estimate the prognosis of CCA patients, we established a nomogram that integrated prognostic risk score and clinical variables (age, gender, N stage, M stage, AJCC stage, margin status and risk score) (Fig. 5c). The nomogram could accurately predict OS, and demonstrated that risk score contributed much more risk points than other variables. C-index for OS was 0.819 (95% CI, 0.727-0.911), and calibration plot for probability of survival at 3-or 5-year showed satisfactory agreements between the predicted and actual values (Fig. 5d, e).

The correlation between the risk score and tumor-in ltrating immune cells (TIICs) in the TCGA cohort
To determine whether our model could re ect the status of the tumor immune microenvironment in CCA, we investigated the correlation between risk score and immune cell in ltration in the TCGA cohort. The tumor-in ltrating immune cells were estimated by CIBERSORT algorithm. We plotted bar plots to demonstrate the proportion of immune cells of CCA and non-tumor tissues. As shown in Fig. 6a-b, the 22 TIIC proportions between two groups were significantly different. Heatmap was used to show the correlation between immune cells subpopulations, which indicated the fractions of different immune cells were weakly to moderately correlated in tumor tissues in the TCGA cohort. The three immune cells showed strong and moderate positive correlation with T gamma delta cells, CD8+ T cells and resting memory CD4+ T cells (Fig. 6c).
To analyze whether infiltration of immune cells might contribute to risk score, we analyzed correlation between risk score and tumor-immune cell proportions. As the risk score increased, the content of CD8+ T cells (P<0.05), neurophils (P<0.05) and T gamma delta cells (P<0.001) in CCA tissues also increased, while the content ofresting memory CD4+ T cells decreased (P<0.05) (Fig. 6d).

Discussion
Activation of immune system has been widely proven to be a decisive factor during tumorigenesis and metastasis [28][29][30]. With the development of tumor immunotherapy in recent years, the function of the immune system in the cancer growth and metastasis has been addressed, in which immune cells inhibit or promote malignant cells by up-or down-regulating expression of IRGs [31,32]. Hence, IRGs expression may be an important predictor of prognosis and progression in CCA. Up to date, considerable efforts have been made to identify prognostic model based on differentially expressed genes, but the report on prognostic IRGs signature of CCA is lacking [33,34]. In our study, we identi ed a new prognostic IRGs signature by screening the differentially expressed genes, and established a dependable model to predict survival of CCA patients based on this immune signature.
In the current study, we analyzed expression of 2498 IRGs, and screened out 223 IR-DEGs from TCGA-CHOL, including 93 upregulated and 130 downregulated genes. Eleven IR-DEGs were detected to show signi cant correlation with prognosis. The results revealed that IRGs were vital contributors to prognosis of CCA patients. Our functional analysis suggested that the IR-DEGs were widely involved in the tumor immunological process. To excavate the underlying molecular mechanisms behind the aberrant expression of PIR-DEGs, a TF regulatory network was built to display the relationship between TFs and PIR-DEGs.
Then we obtained ve robust PIR-DEGs of interest (AVPR1B, CST4, TDGF1, RAET1E and IL9R) through multivariate Cox regression analyses, and utilized them to establish a novel prognostic IRGs model. The model showed a prominent performance by using TCGA-CHOL cohort. After adjusting to other clinical factors, the risk scores generated by the model were demonstrated to be an independent prognostic factor, and had better discriminating power for risk strati cation. The reliability of the model was further validated based on GSE107943 cohort. Our results indicated that the model could act as an effective marker for survival prediction of CCA patients. Based on immune risk scores combining with other clinical factors (age, gender, AJCC stage, N stage, M stage and margin status), a nomogram of survival prediction was generated and exhibited satisfactory agreements between the predicted and actual values, which increased the predictive capacity. Thus, our study demonstrated the potential clinical application for survival prediction of CCA patients.
Recently, some studies have demonstrated that immune in ltration was a crucial determinant of responsiveness to immunotherapy and prognosis prediction of CCA [35][36][37]. Our immune-related prognostic model displayed positive correlation with CD8+ T cells, neurophils and T gamma delta cells, while negative with CD4+ memory resting T cells. A study about ICC revealed that increased CD8+ tumor-in ltrating lymphocytes (TILs) in tumors was associated with better tumor differentiation [38]. The in ltration of tumor-associated neutrophils (TANs) is recognized to facilitate tumor progression in CCA [39,40], which is consistent with our ndings. Several studies con rmed that high levels of CD4+ TILs were associated with better prognosis [41,42]. In summary, the immune milieu in tumor microenvironment may account to the difference of survival outcome between low-and high-risk group.
There have been multiple studies for the value of IRG models to predict prognosis in various cancers, such as renal papillary cell carcinoma [43], clear cell renal cell carcinoma [44], hepatocellular carcinoma [45] and cervical cancer [46]. Compared with the previous studies, our study had some advantages. First, we explored the relationship between IRGs expression signature and prognosis in CCA, and identi ed some prognostic IRGs. Second, our prognostic model based on IRGs and the nomogram showed outstanding performance in survival prediction. Inevitably, some limitations should be addressed.
The included cohort in our study has limited sample size, which needs a larger number of CCA samples to verify the model accuracy. Next, the potential molecular mechanisms of IRGs impact on CCA are not fully elucidated, which require further investigation.

Conclusions
In summary, we established a novel ve IRGs-based prognostic model that accurately predicted the prognosis in CCA, and further con rmed the prognostic performance of this model using GEO database.
Our nding elucidated the association between immune risk score and immune cells in ltration, proving its key role in tumor immune microenvironment. These ndings provide new insight into the role of IRGs and tumor immune cells in ltration on CCA, and could be useful in the future as a theoretical foundation for guiding immunotherapy. Differential expression of immune-related genes and TF genes in CCA. Upper panel: Heatmaps of the differential expressed genes (DEGs) between CCA and non-tumor samples, including a All differential expressed genes, b IRGs, and c TF genes. DEGs are represented in rows, and samples are represented in columns. The expression value for each row was normalized by the z-score. The color from green to red represents the progression from low expression to high expression. Blue bar represents non-tumor, while red one represents tumor samples.  TFs. The blue triangles in the network represent TFs and the circles represented IRGs that red is upregulated and green is down-regulated. The color of lines represents regulatory relationship, which red is positive regulation and green is negative regulation. c GO annotation analyses of the IR-DEGs. Gene ontology terms which signi cantly enriched were showed.    T gamma delta cells.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.