Machine learning prediction models based on hub genes related to immune phenotypes in muscle-invasive bladder cancer treated with immune checkpoint blockade

Bladder cancer is one of the most frequent cancer in the world. Muscle-invasive bladder cancer (MIBC) is a more aggressive subtype with higher morbidity and mortality. ICB therapy has shown potential for treating MIBC, but is limited due to the lack of predictive biomarkers. 1601 MIBC transcriptomic proles were obtained from 10 datasets. Unsupervised clustering of immune phenotypes in MIBC were performed based on immune-related signature genes. We analyzed the characteristics including immune microenvironments, metabolic pathways, and survival rates in different phenotypes. Multi-omic analysis and WGCNA were performed to identify hub genes distinguishing phenotypes related to prognosis. A model was established and CART was employed to predict the response of patients treated with ICB.

subtypes, which show different responses to chemotherapy [4]. A recent study de nes a molecular classi cation of 6 subtypes, stratifying patients for prognosis or response to treatment [5]. However, these studies focus mainly on malignant cells. In addition, stromal cells and immune cells make up a tumor microenvironment, which is essential to predict the e cacy of ICB therapy. The cancer immunogram involving in ltrated cells and immune related signaling has been shown with the potential for personalized immunotherapy in bladder cancer [6]. The different immune phenotypes of MIBC were likely to be closely related to various responses to immunotherapy.
The major objective of our study was to construct prediction models to distinguish immune phenotypes in MIBC and to improve prognosis prediction of ICB therapy. At rst, we performed unsupervised clustering of immune phenotypes of MIBC patients/samples based on immune-related signature genes. We then explored the microenvironment including in ltrated immune cells and stromal cells to evaluate the correlated immune signaling and survival rate in each phenotype. We also analyzed the difference of metabolic pathways among various immune phenotypes in MIBC. The concept of immunometablism has become widespread because metabolism regulation has been found to be essential for immune cells activation [7,8]. Next, Multi-omics data analysis was applied to highlight molecular principles of differentially expressed genes (DEGs) in corresponding phenotypes. Besides, somatic copy-number alterations were shown to be associated with tumor immune phenotype [9]. We also utilized weighted gene coexpression network analysis (WGCNA) to identify hub genes to distinguish different immune phenotypes. At last, in order to establish models for predicting prognostic status, the hub genes underwent lasso regression analysis to eliminate collinearity in a cohort of patients receiving ICB therapy. In addition, classi cation and regression tree (CART) algorithm was employed to predict the response of patients treated with ICB.

Sample Collection
We conducted a systematic search of the electronic databases, including PubMed, GEO pro le, TCGA and EGA datasets and used 1601 MIBC transcriptomic pro les from 10 datasets. Details of datasets, including their respective normalizations, are given in Supplementary Table 1 13,832 common genes in this study were extracted from the normalized data for six different platforms, and the combat function in the sva package was applied to remove the batch effects of these datasets [19].
Unsupervised immune clusters in MIBC 785 immune genes for 28 immune cell types were obtained from Charoentong et al. [20] Angiogenesis marker genes and antigen MHC I and II presenting machinery expression signature involved in antigen presenting machinery (APM) were obtained from Şenbabaoğlu and Rody et al [21,22]. A signature of Anti-in ammatory and Pro in ammatory genes were obtained from Azizi et al [23]. IFN signature genes were obtained from Rody et al [22]. All immune signature genes are listed in Supplementary Table 2. The unsupervised clustering for 1601 tumor samples were performed with hierarchical clustering, Ward linkage, and Euclidean distance based on immune signature genes.
In ltration levels for immune cell types and activity levels for angiogenesis, APM, Anti-in ammatory, Pro in ammatory and IFN were quanti ed using the ssGSEA [24] implementation in R package gsva [25]. Normalized RNA-Seq datasets mentioned above were provided as input without further processing. The ssGSEA score was normalized to unity distribution, for which zero is the minimal and one is the maximal score for each gene panels. The output for each signature is a near-Gaussian list of decimals that can be used in visualization/statistical analysis without further processing.
ssGSEA analysis ranked the list of genes (Supplementary Table 3) to identify immune pathways [14,21,22] and metabolic pathways [23] enriched in genes with highest variability for different clusters [24]. Expression levels for checkpoints were also evaluated. Immune scores and stromal scores were calculated by applying the ESTIMATE algorithm to the downloaded database [26].

Somatic mutation analysis
Somatic mutation data were obtained from the publicly available TCGA database via the GDC data portal (https:// portal.gdc.cancer.gov/). From the four subtypes of data les, we selected the "Masked Somatic Mutation" data and processed it based on the VarScan software. We prepared the Mutation Annotation Format (MAF) of somatic variants and implemented the "maftools" [27] R package which provides a multiple of analysis modules to perform the visualization process.
DNA methylation analysis TCGA-HNSC DNA methylation data (Illumina Human Methylation 450 k) was downloaded from https://tcga.xenahubs.net/download/TCGA.BLCA.sampleMap/HumanMethylation450.gz. Methylation data were normalized with the β-mixture quantile normalization method (BMIQ) [29] was utilized to preprocess β-values. ChAMP was used to perform quality control, standardization, and calculation of methylation sites and regions [30]. By using the ChAMP package (parameters: adjusted P-value < 0.05), differentially methylated probes (DMPs) between cluster 1A and cluster 3B were identi ed with bumphunter method. [31] And we use 95% quantile as the cut off value for the candidate regions, which contain at leaste 7 CpG probe. The heatmap was created in R (v3.4.1) using the pheatmap package on the quantile normalised methylation (beta) values.

Differentially Expressed Genes Screening
According to unsupervised clustering, samples were divided into three groups: cluster 1, cluster 2, cluster 3. Then, the DESeq2 function in the R software package [32] was used to analyze the genetic differences between cluster 1A and cluster 3C for TCGA and IMvigor210 cohorts. Then, we screened the DEGs to obtain those with the adjusted P-value < 0.05 and |log2(Foldchange)|>1. Statistical analysis was carried out for each dataset, and the intersecting part of DEGs(2705 genes)was identi ed using the Venn diagram webtool.

Gene Ontology Annotations and Kyoto Encyclopedia of Genes and Genomes Pathway Analyses
Gene Ontology (GO) analysis was performed to show the unique biological signi cance based on differentially expressed genes. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database was carried out to nd out the important pathway. The ClusterPro ler packages [33] in R was applied to analysis and demonstrated GO annotation and KEGG pathway.

Gene Set Enrichment Analysis
Expression dataset from SRP049695 was conducted Pathway enrichment analysis of a ranked gene list using Gene Set Enrichment Analysis (GSEA) [32]. The following operations were carried out according to the protocol . (http://software.broadinstitute.org/gsea/)

Co-expression Network Construction
WGCNA package of R software was applied to uncover the correlation among genes. [34] The soft-thresholding power β was calculated in the construction of each module using the pickSoftThreshold function of the WGCNA. The power of β was set at 10 to ensure a scale-free network. The minimum number of module genes was set at 30. The hierarchical clustering dendrogram summarized the Gene modules with different colors. Heat map and topological overlap matrix (TOM) plot were used to visualize the module structure. The relationships between modules and clinic traits were assessed using the correlations between the module eigengene and the clinic trait, allowing the identi cation of hub modules that are correlated with external traits and searching for the most signi cant associations. For hub modules, the quantitative measure of module membership was de ned as the correlation of the module eigengene and the gene expression pro le; gene signi cance was de ned as the absolute value of the correlation between the gene and the clinic trait. We set the ModuleMembership > 0.8 and the Gene-Signi cance > 0.2 for candidate hub genes.

Selection and Validation of Hub Genes related to prognosis
We selected candidate hub genes in TCGA dataset and used the Search Tool for the Retrieval of Interacting Genes (STRING; https:// string-db.org/) database to construct PPI network. The con dence score was set at 0.95. [35] Gene network les exported from STRING were input into Cytoscape software. The plug-in cytoHubba of Cytoscape was used for screen out hub genes with the intersecting top 50 genes based on 12 kinds of Global-based methods. Hub Genes related to prognosis were screened to plot the Kaplan-Meier survival curve in ggplot2 of R software. Consistent with TCGA results, we enrolled survival data and expression level of hub genes related to prognosis from an independent cohort IMvigor210.

Survival analysis
We used cohort IMvigor210 to establish the prognostic model. The 5 hub genes underwent lasso regression analysis to eliminate collinearity between genes. After performing 1000 10-fold cross-validations, the lambda value with minimized error was selected as the optimum lambda parameter value. We used multivariate Cox proportional hazards regression analysis to nd key genes involved in the model. The disease risk score, which was used as predictors of prognostic status, was de ned by parameter β of the multivariate cox proportional hazards regression analysis and the expression of each selected gene. The risk score for each patients was calculated, categorized into high or low-risk. The predictive performance of this model at 5-year endpoint was evaluated with a time-dependent receiver operating characteristic (ROC) curve.

CART analysis
CART analysis was a nonparametric, supervised statistical learning technique combining variable values involving the 5 hub genes which best discriminated between classes, in our case stable disease (SD) and progressive disease (PD). The optimal combination of variables and possible cutoff values for classi cation was determined by an exhaustive search of all possibilities by the CART algorithm [36]. The Gini criterion was applied to minimize node impurity after splitting. we also performed 10-fold cross validation to assess the classi ed e ciency of the decision tree.

Molecular immune phenotypes in MIBC patients
Unsupervised clustering of 1601 MIBC patients using immune signature genes was performed. Three main clusters were identi ed as cluster 1, cluster 2, and cluster 3 (Fig. 1A). ssGSEA scores indicated that Cluster 3 was in ltrated with high levels of innate and adaptive immune cells. Cluster 2 was heterogeneously in ltrated with immune cells while cluster 1 was non-in ltrated. Cluster 3 showed signi cantly high expression of signatures of CD8 T cells, macrophages, Th1 cells, NK cells, and Tregs, followed by cluster 2 and cluster 1 (Fig. 1B). Immune-related pathways, involving interferon (IFN) signaling, TGF-beta signaling, Antigen presenting mechanism (APM), angiogenesis, epithelial-mesenchymal transition (EMT) were mostly enhanced in cluster 3, followed by cluster 2 and cluster 1. On the other hand, DNA damage response (DDR) and FGFR3 signaling were mostly enhanced in cluster 3, followed by cluster 2 and cluster 1. Cluster 3 was divided into three subclusters 3A, 3B and 3C and cluster 1 was divided into two subclusters 1A and 1B. Cluster 3C was the most "in amed" subcluster with high levels of in ltrated immune cells and activation of immune-related pathways. On the contrary, cluster 1A was associated with "non-in ammation" due to the low levels of in ltrated immune cells and inhibition of immune cells activation. We used RNA-based ESTIMATE algorithm to con rm that cluster 3C displayed the highest stromal score and immune score, which was the least pure subcluster. The analysis of ssGSEA scores in metabolic pathways showed that hypoxia regulated genes and glucose deprivation related genes were increased in cluster 3, with a potential to activate immune cells through immunometabolic mechanisms. Cluster 3 also had the best overall survival compared to the other two clusters (Fig. 1C). The expression levels of IFNG, TGFβ, immune checkpoints PD1 and PD-L1 were most enhanced in cluster 3, while the expression level of FGFR3 was most enhanced in cluster 1 (Fig. 1D).

Differences in somatic mutations related to the immune phenotype
After applying unsupervised clustering in TCGA cohort, mutation information of each gene in each sample was exhibited in waterfall plot in different subclusters, as cluster 1A, 1B, 2, 3A, 3B, and 3C ( Fig. 2A, 2B, 2C

DEGs between cluster 3C and cluster 1A
After applying unsupervised clustering in TCGA and IMvigor210 cohorts, DEGs between cluster 3C and cluster 1A were analyzed. 3790 signi cantly up-regulated DEGs and 4794 down-regulated DEGs were identi ed in TCGA dataset. In IMvigor210 cohort, 2950 up-regulated genes and 1067 down-regulated genes are identi ed. The intersection includes 2043 signi cantly up-regulated and 662 down-regulated genes (Fig. 3A). The volcano plots of TCGA cohort and IMvigor 210 cohort were shown (Fig. 3B). The GO analysis of upregulated DEGs in cluster 3C were enriched in immune cell activation involving T cell activation, leukocyte activation, phagocytosis, and leukocyte migration (Fig. 3C). KEGG analysis showed that the downregulated genes in cluster 3C were enriched in various metabolic pathways involving amino acid metabolism and fatty acid degradation (Fig. 3D).

Difference in genomic CNVs and methylation in DEGs between cluster 3C and cluster 1A
We then performed an integrated analysis of Multi-omics data on TCGA cohorts. To evaluate whether copy number variations (CNVs) affect transcription of affected genes, analysis of genomic alteration of DEGs indicated regions of copy-number gains (chromosomes 1,2,3,4,10,12,17,19,21) or loss (chromosomes 2,6,9,10,11,17) as characteristic features of cluster 1A compare with cluster 3C in the TCGA cohort (Fig. 4A). 159 upregulated DEGs in cluster 3C were coded by regions with a higher frequency for copy-number deletions in cluster 1A, while 62 upregulated DEGs in cluster 1A were encoded by regions with more frequent gains in the cluster (Fig. 4B). Global methylation data available for the TCGA cohort were analyzed. In total, differentially methylated probes of 349 genes were identi ed comparing cluster 1A and cluster 3C. Probes with signi cantly higher beta values in cluster 3C were located in the proximal promoter of 25 DEGs with higher expression in cluster 1A, while Probes with signi cantly higher beta values in cluster 1A were located in the proximal promoter of 16 DEGs with higher expression in cluster 3C (Fig. 4C).

WGCNA analysis
The intersected 2705 DEGs were used to perform subsequent WGCNA analysis. The power of β was set at 10 to ensure a scale-free network (Fig. 5C). Gene modules were calculated. The gray module identi ed the genes that can't be classi ed to other modules. 12 gene modules were identi ed by the hierarchical clustering dendrogram (Fig. 5A). The relationships between modules and clinic traits were assessed. Many modules are correlated with survival time. The correlation coe cients of the red, yellow, blue, and pink modules were greatest, at 0.28 (0.001), 0.28 (0.001), 0.25 (0.005), and 0.23 (0.009), respectively (Fig. 5B). MM and GS value in the upper quartile of genes in the module were considered as key genes of this module (The MM and GS cut-off of these modules were 0.8 and 0.2, respectively). The gene distribution of red, yellow, blue and pink modules were shown (Fig. 5D). A total of 415 hub genes were investigated from 4 modules, which were analyzed using matascape for pathway and process enrichment analysis. GO analysis indicate these genes are closely related to cytokine-mediated signaling, regulation of cell activation, regulation of cytokine production and leukocyte differentiation (Fig. 5E).

Identi cation and validation of hub genes
Protein-protein interactions (PPI) were applied to identify 8 genes as hub genes (LCK, HLA-E, IRF1, IL2RB, IFNG, CCL13, CXCR6, PSMB10) (Fig. 6A). To validate our nding of hub genes, we examined TCGA and IMvigor210 cohorts. The KM curves for genes with P < 0.05 in TCGA cohort were shown in supplementary Fig. 2. IFNG, CXCR6, IL2RB, LCK, and PSMB10 were found to be closely related to prognosis in both cohorts (Fig. 7A, 7B, 7C, 7D, and 7E). We constructed an immune in ltration interaction network based on the 5 hub genes using the STRING dataset. The results were imported into Cytoscape for visualization (Fig. 6B) In order to establish a model for predicting prognostic status, 5 genes were used to perform a lasso regression, with 10fold cross-validation with 1000 repeats. λ value with the smallest Partial Likelyhood Deviance was shown in 2 of the 5 genes had coe cients that were not zero. A prognostic score model involving LCK and PSMB10 was established.
Risk Score=(-0.1973 × LCK) + (-0.0091 × PSMB10) By predicting survival of patients at 5 years, the areas under the curve (AUC) of the ROC curves generated from the riskbased prediction model in the test data were 0.652 (Fig. 6F). The test data were obtained from IMvigor210 cohorts, consisting of patients receiving ICB therapy.
The decision tree illustrated decision rules, among all the 5 hub genes that were entered in the analysis, 4 were selected by the program for the classi cation tree (Fig. 8). The four parameters were LCK, IFNG, PSMB10, and CXCR6. LCK was the most essential determining factor, which was the rst-level split of two initial branches of the classi cation tree. The mean accuracy, sensitivity, and speci city for predicting SD/PD were 70.1%, 70.0% and 71.7%.

Discussion
Unsupervised clustering of 1601 MIBC patients using immune signature genes has divided our cohorts into three clusters, Cluster 1, 2, and 3. Cluster 3 was termed as most "in ammed" while cluster 1 was identi ed as "non-in ammation" according to the difference in in ltrated immune cells and immune-related signaling. Cluster 3C showed the highest expression level of CD8 T cells, which are primed and activated to cytotoxic T lymphocytes (CTLs). CTLs are the key immune cells for killing malignant cells. CD8 T cells priming requires IFN-γ secreted by Th1 cells and NK cells and TNF-α secreted by Macrophages [37]. The expression level of Th1, NK cells, and Macrophages are also the highest in cluster 3.
ssGESA score of immune-related signaling showed high level of IFN signaling and TGF-β signaling in cluster 3. IFN upregulates APM, which was the highest level in cluster 3. Upregulation of APM could presents more antigen and increases the number of CD8 + T cells. IFN-γ also plays an important role in the regulation of PD-L1 expression [38].
Tumors with high IFN-γ levels are more likely to respond to anti-PD-1 therapy [39]. IFN-γ was suggested to have a protective response limiting the process of bladder cancer [40]. TGF-β is the effector molecule of macrophages and it is associated with stromal activation, angiogenesis and epithelial-mesenchymal transition (EMT), which are induced in cluster 3, with protumorigenic effects [41]. Besides, cluster 3 displayed lowest expression in genes involved in DNA damage response (DDR), which increased tumor mutation burden, likely inducing immune cells activation and increasing response to ICB therapy [42]. Cluster 1 showed the highest level of FGFR3 signaling. FGFR is associated with a lack of Tcell in ammation [43], likely lead to "non-in ammation" phenotype. Metabolic regulation is essential for immune cells activation. Hypoxia regulated genes are signi cantly increased in cluster 3, which might lead to increased activation of fatty acid catabolism and CD8 + T cell activation [44]. Glucose deprivation related genes are also induced in cluster 3, increasing PD-1 expression on CD8 + T cells. Metabolically, Glycolysis is reduced while fatty acid metabolism is induced in immune cells [45]. However, direct evidence on metabolism regulate immune cells in tumor microenvironment needs further studies.
The analysis of somatic mutation, genomic CNVs and DNA methylation in our study didn't show direct relation to the 5 hub genes. The molecular mechanisms regulating the expression of the hub genes require further studies in the future.
In order to distinguish different immune subtypes in MIBC patients, we performed a WGCNA analysis, 5 hub genes, IFNG, CXCR6, IL2RB, LCK, and PSMB10, which correlated with survival time were identi ed by us. A prognostic score model involving LCK and PSMB10 was established to predict patients undergoing ICB therapy. In addition, the decision tree selected 4 genes including LCK, PSMB10, IFNG, and CXCR6 by the program. In our study, higher expression of CXCR6 was correlated with higher survival rate. However, CXCR6 has been shown to predict poor prognosis in gastric cancer, breast cancer, prostate cancer and bladder cancer [46][47][48][49]. On contrary, CXCR6 was also found to be necessary for promoting NKT and CD4 T cells to remove senescent hepatocytes to prevent hepatocarcinogenesis [50], which might indicate a similar mechanism in MIBC. Interleukin-2 (IL-2) was one of the rst FDA-approved immunotherapy drugs for melanoma and renal cell cancer. IL-2 receptor (IL-2R) complex is important to control immune response [51,52]. There are three IL-2R subunits: IL-2Rα, IL-2Rβ, and IL-2γ. The application of IL-2 treatment was limited because it induced proliferation of Tregs through IL-2 Rα, which is preferentially expressed on Tregs [53]. Increased Tregs are correlated with worse prognosis [54,55]. On contrary, CD8 T cells and NK cells express IL-2Rβ and IL-2γ, which lack IL-2 Rα. The activation of CD8 T cells and NK cells contribute to better prognosis. Strategies targeting on IL-2Rβ have been shown effective tumor control through activating CD8 T cells and NK cells [56,57]. Lymphocyte-speci c protein tyrosine kinase (LCK) was essential for TCR signaling, associated with Th1, Th2, and Th17 cell differentiation [58]. As a result, LCK has become a druggable target to improve the e cacy of chimeric antigen receptors and to potentiate T-cell response in immunotherapy [59]. Another study indicated that LCK was associated with bladder cancer prognosis, which is consistent with our results [60]. Proteasome subunit beta type-10 (PSMB10) is one of the immunoproteasome subunits. Interferon-γ induces the expression of the catalytic immunoproteasomes [61]. PSMB10 plays a pivotal role in antigen generation/processing, which is crucial for cytotoxic T lymphocytes to eliminate cancer cells [62,63]. Upregulation of immunoproteosomes in breast cancer shown abundance of tumor in ltrating lymphocytes and a good prognosis [64].

Conclusions
The 5 hub genes, IFNG, CXCR6, IL2RB, LCK, and PSMB10, and generated models showed the potential for predicting the prognosis for patients receiving ICB therapy. The molecular mechanisms regulating the expression of the hub genes require further studies in the future.

Consent for publication
Not applicable.

Availability of data and materials
The datasets generated used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
None.
Author's contributions Pei-Heng Li and Zhi-xin Chen designed the study and collected the data. Zhi-Xin Chen analyzed the data. Pei-Heng Li interpreted the results. Dong Wang, Zhi Zheng, and Zhi-Gang Ji corrected the results and interpretation of data. All authors read and approved the nal manuscript. Molecular immune subtypes in MIBC patients. A Unsupervised clustering of 1601 MIBC patients using ssGSEA scores from immune cell types and immune-related pathways. Hierarchical clustering was performed with Euclidean distance and Ward Linkage. Three main clusters were identi ed as cluster 1, cluster 2 and cluster 3. Of the three clusters, subcluster 3C was the most "in amed" while subcluster 1A was "non-in ammation". ssGSEA analysis of metabolic pathways in different clusters was also shown. B The level of in ltrated immune cells, CD8 T cells, Th1 cells, macrophages, NK cells, and Tregs in the three clusters. Cluster 3 was in ltrated with highest level of innate and adaptive immune cells, followed by cluster 2 and cluster 1. C Survival curves of the three clusters. Cluster 3 was correlated with the best overall survival, followed by cluster 2 and cluster 1 (p=0.005). D The expression levels of IFNG, TGFβ, FGFR3, immune checkpoints PD1 and PD-L1 in the three clusters.