3.1 Intersection of WGCNA and DGEs
A schematic flow diagram of the performed biomarker identification assay is shown in Fig. 1.
In our study, we initially selected 4725 differentially expressed genes based on the criteria of |logFC| > 1 and adj.p.val < 0.05, as depicted in Fig. 2A. Subsequently, we conducted WGCNA analysis and determined that the optimal threshold for constructing a scale-free network was 6, as illustrated in Fig. 2B. After determining the optimal threshold, we set the merging module threshold to 0.25 and generated the gene clustering diagram, as presented in Fig. 2C. Next, we integrated the phenotypic data and calculated the correlation coefficients and module significance p-values between the quantitative module eigenvectors and the phenotypes. These results were visualized as a heatmap representing the module-trait correlation coefficients, as shown in Fig. 2D. Based on a significance level of p < 0.05 and considering the correlation coefficient, we identified the yellow module as the key gene module most relevant to BLCA tumor tissue (p < 1e-200, corr = 0.76), as depicted in Fig. 2E. Finally, we obtained a set of 609 genes by intersecting the differentially expressed genes with the key genes from the yellow module, as displayed in Fig. 2F.
3.2 Enrichment Analysis
We performed enrichment analysis of GO and KEGG pathways using the clusterProfiler package. To obtain significant GO terms and KEGG pathways, we applied a threshold of qvalueCutoff = 0.05 and pvalueCutoff = 0.05, As shown in Fig. 3A, we list the top ten important GO terms for DEGs in biological processes (BP), cellular components (CC), and molecular functions (MF). For example, in BP (Fig. 3B), DEGs were significantly enriched in response to ameboidal-type cell migration, wound healing, cell-substrate adhesion, muscle contraction, muscle system process, tissue migration, regulation of cell-substrate adhesion, epithelial cell migration, epithelium migration and muscle tissue development. The GO words of the MF group (Fig. 3C), including extracellular matrix structural constituent, actin binding, extracellular matrix binding, DNA-binding transcription activator activity, RNA polymerase II-specific, DNA-binding transcription activator activity, actin filament binding, glycosaminoglycan binding, integrin binding, muscle alpha-actinin binding and transmembrane receptor protein kinase activity, were significantly enriched by DEGs. In the CC group (Fig. 3D), DEGs were mainly enriched in collagen-containing extracellular matrix, contractile fiber, myofibril, I band, sarcomere, Z disc, actin filament bundle, focal adhesion, cell-substrate junction and actomyosin.
Figure 3E-G shows the analysis of the KEGG pathway of DEGs. We observed that DEGs are mainly involved in Focal adhesion, MAPK signaling pathway, Proteoglycans in cancer, cGMP-PKG signaling pathway, Vascular smooth muscle contraction, Oxytocin signaling pathway, Cellular senescence, Human T-cell leukemia virus 1 infection, Regulation of actin cytoskeleton and ECM-receptor interaction.
3.3 Feature Selection
We composed a new gene expression dataset using 609 features from WGCNA analysis and differential analysis, dividing the dataset into a training set and a test set, where 75% is the training set and 25% is the testing set. We performed feature selection using SVM-RFECV and XGBoost methods. Using these methods, we selected 28 features with SVM-RFECV and 26 features with XGBoost. These features were chosen based on their importance in predicting the outcome of the cancer dataset. In Fig. 4A-C, we present the confusion matrix and classification reports, including Precision, Recall, and F1 score, for the SVM-RFECV model. Similarly, in Fig. 4D-F, we show the confusion matrix and classification reports for the XGBoost model. Precision represents the ratio of correctly observed positive results to all observed positive results, while Recall is the ratio of correctly observed positive results to the total results observed in the desired category. F1 score is a performance metric that combines both Precision and Recall, providing a measure of overall model performance. Values greater than 0.5 indicate relatively good categorization, while values less than 0.5 suggest categorization failure. As shown in Fig. 4, the models constructed for the cancer dataset all show successful classification results. The accuracy of the test dataset was calculated as 98.15% for the SVM-RFECV model and 100% for the XGBoost model. The accuracy is determined by comparing the predicted labels with the true labels in the test dataset. These specific accuracy values were obtained based on the model's performance in correctly classifying the test samples. Finally, we took the intersection of SVM-RFECV and XGBoost, as shown in Fig. 4H, and finally identified six genes, NR4A1, PAMR1, CFD, RAI2, ALG3 and HAAO.
Based on the intersection results mentioned above, we further incorporated ferroptosis-related genes into the analysis. We obtained 567 genes related to ferroptosis from the FerrDB database. Among these genes, NR4A1 was identified as the most relevant ferroptosis-related gene in both the cancer group and the normal group, as depicted in Fig. 4I.
3.4 Survival analysis and ROC analysis
We performed Kaplan-Meier survival analysis to assess the survival outcomes of patients based on different gene expression or high/low risk groups. To evaluate the impact of the identified NR4A1, we utilized the TCGAbiolinks package to download clinical data and employed the survminer package for survival analysis. The cut_point function was used to determine the optimal threshold for stratifying patients into high and low gene expression groups. Additionally, we obtained clinical data from the GEO database for further analysis, including TCGA-BLCA, GSE3507, and GSE37815 cohorts. Statistical analysis was performed to compare the overall survival (OS) rates between different expression groups23. At the same time, we downloaded the clinical data of GEO data from the GEO database and analyzed TCGA-BLCA, GSE3507 and GSE37815 respectively. Further analysis revealed a significant difference in the OS rates between the high and low expression groups in the TCGA-BLCA cohort (p = 0.0031), as shown in Fig. 5A. Similarly, in the GSE31507 cohort, there was a significant difference in the OS rates between the low and high expression groups (p = 0.00095), as depicted in Fig. 5B. Furthermore, in the GSE37815 cohort, the high expression group exhibited a significantly lower OS rate compared to the low expression group (p = 0.00021), as shown in Fig. 5C.
We evaluated the diagnostic performance of the identified gene by analyzing their AUC values using ROC curve analysis. Firstly, for the TCGA dataset, we compared the expression of NR4A1 between the cancer group and the normal group, as shown in Fig. 5D. Secondly, we assessed the sensitivity and specificity of these genes for diagnosing BLCA by generating ROC curves. The AUC value for NR4A1 was calculated as 0.9, indicating a high discriminatory power, as depicted in Fig. 5E. Additionally, for the GEO datasets, we processed the batch effect using the sva (R/Bioconductor) package and merged the datasets. Subsequently, we calculated the inter-group differences and generated ROC curves. The AUC value obtained for the GEO dataset was 0.697, as shown in Fig. 5(F-G). The AUC value represents the area under the ROC curve and is a measure of the overall diagnostic performance of a test. AUC values range from 0 to 1, where a value of 1 indicates a perfect discriminatory power, and a value of 0.5 suggests no discriminatory power (equivalent to random chance). In our analysis, the AUC value of 0.9 for NR4A1 in the TCGA dataset indicates a high accuracy in distinguishing between BLCA and normal samples. Similarly, the AUC value of 0.697 for the GEO dataset suggests a moderate discriminatory power. These results suggest that NR4A1 has potential as a diagnostic biomarker for BLCA.
Finally, we performed a differential analysis of NR4A1 expression levels in TCGA-BLCA, comparing it with stage, N, M, T, age, and sex. As shown in Fig. 6, we observed significant differences in stages, especially in stage I + II compared to stage III and VI respectively. The differences are striking. Furthermore, consistent with Fig. 5D, we observed a significant decrease in the expression of NR4A1 in the cancer group.
3.5 CeRNA network analysis
In order to gain insights into the mechanism of “NR4A1” in BLCA, we employed themultiMiR” (R/Bioconductor) package to identify the miRNAs that potentially regulate NR4A1. multiMiR incorporates eight different predicted miRNA-target gene interaction databases (diana_microt, elmmo, microcosm, miranda, mirdb, pictar, pita, and targetscan), which greatly facilitates research on disease pathogenesis, diagnosis, and treatment based on the regulatory relationship between miRNAs and target genes, as depicted in Fig. 7(A-B).
Based on the results obtained, we focused on the miRNAs that were predicted to target NR4A1 in at least six out of the eight databases and utilized the mirnet website to predict the target genes of these miRNAs. Subsequently, we constructed a ceRNA (competing endogenous RNA) network diagram, as depicted in Fig. 7C. This network diagram provides a visual representation of the interactions between miRNAs, NR4A1, and other target genes, shedding light on the potential regulatory mechanisms involved in BLCA. This network provides novel insights into the post-transcriptional regulation of NR4A1 and may help to reveal potential therapeutic targets for BLCA. The results are shown in the Supplementary Table.
3.6 Tumor mutation burden estimation
Due to the association between tumor mutation burden (TMB) and the response to immunotherapy and prognosis of cancer, we utilized the maftools (R/Bioconductor) tool to analyze and visualize somatic mutation data in tissues with high and low expression of NR4A1. The results of this analysis are presented in Fig. 8.
In Fig. 8C, we performed a differential mutation analysis using Fisher’s exact test on all genes present in the maf files of the high expression and low expression groups. Our analysis revealed that genes such as RYR2, POLN, and CNTNAP2 exhibited significant differences in mutation frequency between the two groups.
3.7 Immune analysis
In this study, our specific objective was to investigate the potential association between NR4A1 expression and the infiltration levels of immune cells in bladder cancer. Understanding this association can provide valuable insights into the role of NR4A1 in modulating immune responses within the tumor microenvironment, potentially leading to the development of novel therapeutic strategies for bladder cancer treatment.
We utilized the xCell, cibersort and estimate (R/Bioconductor) to analyze the differences between immune cells with high and low expression levels of NR4A1. As shown in Fig. 9A, significant differences in StromaScore and MicroenvironmentScore were observed in immune cells including adipocytes, chondrocytes, endothelial cells, fibroblasts, HSCs, endothelial cells, megakaryocytes, mesangial cells, and Pericytes. The stromal score and immune score are shown in Fig. 9B.
We employed cibersort to analyze and compare the differences in the abundance of 22 immune cell types between the NR4A1 high and low expression groups (as depicted in Fig. 9C). The results indicated that Macrophages M1 and Mast cells activated exhibited higher levels in the NR4A1 high expression group compared to the low expression group, and these differences were found to be statistically significant (P < 0.05). Additionally, T cells regulatory (Tregs) showed a significant increase in the NR4A1 low expression group. Figure 9D shows a heat map of high and low expression of NR4A1 in immunoassays. Therefore, we suggest that the NR4A1 may play a crucial role in immune cell regulation in BC.
3.8 pan-cancer analysis
To further analyze NR4A1, we conducted an analysis of NR4A1 expression in 23 different tumor types from the TCGA database, comparing cancer tissues with corresponding normal tissues. Our findings revealed that in 15 cancer types (BLCA, BRCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PRAD, STAD, THCA and UCEC), the expression level of NR4A1 was significantly increased in the corresponding normal tissues (p < 0.05), as depicted in the Fig. 10A.
Furthermore, when considering the overall significance, we observed that NR4A1 plays an important role in five cancers: ACC (p = 0.02), CESC (p = 0.0015), KICH (p = 0.039), KIRC (p = 0.0022), and TGCT (p = 0.029). These results indicate significant differences in NR4A1 expression among different pathological stages, as shown in the Fig. 10(B-F).
We also investigated the correlation between NR4A1 expression and the prognosis of patients with different cancer. Our analysis revealed significant associations between NR4A1 expression and the prognosis of 14 different cancer types. In the Fig. 11, it is evident that high NR4A1 expression is associated with poor prognosis in patients with 9 types of cancer (ACC (p = 0.026), COAD (p = 0.025), DLBC (p = 0.0092), ESCA (p = 0.00068), KIRP (p = 0.037), LUSC (p = 0.041), MESO (p = 0.018), OV (p = 0.015), THCA (p = 0.0027)). Conversely, low NR4A1 expression is associated with poor prognosis in patients with 5 types of cancer (BRCA (p = 0.017), KICH (p = 0.0059), KIRC (p = 0.011), LIHC (p = 0.0071), STAD (p = 0.032)).
In the immune infiltration analysis, we observed significant findings in the BLCA dataset regarding B cells naive and T cells regulatory (Tregs). B cells naive showed a positive correlation, while Tregs showed a negative correlation. These results are depicted in the Fig. 12A.
Regarding the immune infiltration of T cells regulatory (Tregs) and NR4A1 expression, we found a negative correlation in 18 cancer types (BRCA (p = 1.81e-10, r = -0.18), CESC (p = 3.75e-03, r = -0.16), GBM (p = 0.02, r = -0.17), HNSC (p = 2.39e-03, r = -0.13), KIRC (p = 2.15e-05, r = -0.17), LGG (p = 1.32e-04, r = -0.16), LIHC (p = 4.69e-13, r = -0.34), LUAD (p = 4.14e-06, r = -0.19), LUSC (p = 4.13e-03, r = -0.12), MESO (p = 3.59e-03, r = -0.31), OV (p = 6.42e-04, r = -0.16), PCPG (p = 9.87e-05, r = -0.28), PRAD (p = 2.73e-08, r = -0.23), SARC (p = 0.01, r = -0.15), THCA (p = 7.70e-6, r = -0.33), THYM (p = 2.71e-03, r = -0.27), and UCEC (p = 8.09e-06, r = -0.18)). However, in COAD (p = 0.04, r = 0.09), a positive correlation was observed. These correlations are also depicted in the Fig. 12(B-S).
To further investigate immune infiltration, we conducted separate analyses in 32 other cancer datasets. In 17 cancers (BRCA (p = 3.69e-11, r = 0.19), CESC (p = 3.25e-06, r = 0.26), ESCA (p = 4.36e-04, r = 0.25), GBM (p = 3.86e-08, r = 0.40), HNSC (p = 0.03, r = 0.09), KICH (p = 1.99e-03, r = 0.32), KIRC (p = 8.23e-14, r = 0.30), KIRP (p = 1.52e-10, r = 0.35), MESO (p = 1.61e-04, r = 0.39), OV (p = 2.33e-04, r = 0.18), PRAD (p = 1.29e-05, r = 0.18), READ (p = 0.01, r = 0.19), SARC (p = 3.03e-03, r = 0.18), STAD (p = 1.06e-03, r = 0.15), TGCT (p = 4.11e-04, r = 0.28), THYM (p = 5.82e-03, r = 0.25), and UCEC (p = 5.57e-07, r = 0.20)), we found a positive correlation between immune infiltration of B cells naive and NR4A1 expression. Conversely, in LAML (p = 0.02, r = -0.19), a negative correlation was observed. These correlations are shown in the Fig. 13.