Analysis of DNA-seq
The Maftools package in R programming was used to analyze MAF data from 490 patients. Figure 1 displays various plots, such as oncoplots, plotmafSummary, Transition, and Transversions reports, Plotting VAF (Variant Allele Frequencies), Drug-Gene Interactions, Somatic Interactions reports, and Oncogenic Signalling Pathways, to picture the MAF distribution between different groups. Based on Fig. 1A, missense mutations were the most frequent type of mutations. Furthermore, mutations in the TTN, TP53, and SPOP genes were present in 10–11% of patients. Many variants were involved in protein-homeostasis/ubiquitination, genome integrity, and chromatin-SWI/SNF complex (Fig. 1B). The Variant Allele Frequencies plot can estimate the clonal status of the top mutated genes, where clonal genes have an mean allele frequency of about 50% in pure samples.
TP53 was observed to have clonal status in tumor tissue in Fig. 1C. Somatic Interactions analysis showed exclusive or co-occurring events in Fig. 1D. Mutually exclusive events occur when mutations in one gene avoid the arising of mutations in another gene while co-occurring events happen when mutations in two or more genes arise together more commonly than would be anticipated by chance. Identifying mutually exclusive genes shows that these genes may take apart in the same pathway or process, and there could be a functional overlap between them. On the other side, recognizing genes that co-occur may suggest that they collaborate to aid tumour growth, or that their cumulative impact is necessary for cancer development. Figure 1E illustrates the interaction between genes and drugs that target Tyrosine kinase, serine-threonine-kinase, tumour suppressor, and other related processes. The involvement of mutated genes in Oca across different Oncogenic Signalling Pathways, including Wnt, RTK-RAS, Hippo, Notch, and others, is depicted in the Fig. 1F. Figure 1G provided an instance of genes that function as either tumour suppressors or oncogenes within the WNT pathway.
Identification of differentially expressed genes (DEGs) in PCa
To find dysregulated genes in PCa, TCGA data set containing 52 non-cancer samples and 498 tumour samples were used to obtain DEGs with DESEQ2 package. We identified a total of 609 DEGs which 358 of them were downregulated and 251 of them were upregulated. To further show the distribution of DEGs, a heatmap was plotted using heatmap package in R (supplementary Fig. 3) and a volcano plot was plotted using EnhancedVolcano package (Fig. 2A).
Determination of key DEGs in PCa by deep learning
Deep learning method was used to identify the most important DEGs of PC that were obtained before. We spilt the TCGA prostate cancer dataset with 549 samples with the 70/30 ratio into training set and test set. As depicted in table (1), 20 genes of 609 PC DEGs were identified as most significant genes in PCa. The SCNN1B gene was determined as most important gene in PCa with the variable importance of 1. SLC13A2 and SCGB1D1 with variable importance of 0.996 and 0.985 were the next important genes. To further visualize the results, heatmap of 20 identified key genes were plotted using R. 4.2.3(Fig. 2C). The performance of the model was confirmed with the AUC = 1, R2 = 0.96, RMSE = 0.054, MSE = 0.002, PR-AUC = 1, and Accuracy = 93.8%. We consider 20 identified genes as candidate genes for further analysis.
Evaluation of candidate genes in DNA-seq data
Identified genes by deep learning were assessed in DNA-seq data analysis results to evaluate them at the genomic level. For this purpose, we identified and visualized 26 different genetic variants of 20 candidate genes as depicted in the venn diagram (Fig. 2D). As represented in Fig. 2E, the most frequent mutation among candidate genes were missense mutations. SCNN1B and CRTAC1 were mutated in about 17% of patients. Notably, SCNN1B mutations have been identified in people with pseudohypoaldosteronism type 1 (PHA1) also, 16 mutations of SCNN1B is related with Liddle syndrome(14). Also, COSMIC data base showed that CRTAC1 represent point mutations in 5.05% of prostate tissue samples and copy number variation in 0.11 of samples(15). Moreover, SORBS1 and C11orf24 mutations were presented in 11% of patients in our study (Fig. 2F). Figure 2G shows druggable genes that target kinase, serine-threonine-kinase, cell surface, and other related processes.
Functional enrichment analysis of candidate genes
To further explore the function of candidate genes and different pathways they contribute, functional enrichment analysis was performed using ‘clusterProfiler’ package in R.
ClusterProfiler enable us to find shared function and pathways of genes through different biological ontologies such as gene ontology (GO) that annotates genes to biological process (BP), cellular component (CC), and molecular function (MF), also Kyoto Encyclopedia of Genes and Genomes (KEGG) that annotates genes to pathways, and Human Disease Ontology (DO) that annotates genes to diseases. As shown in Fig. 3, GO enrichment of 20 identified genes showed that terms like dicarboxylate symporter activity, transforming growth factor beta receptor activity, and kisseptin receptor binding were significantly enriched in molecular function (MF). In biological process (BP), most genes were involved in response to response to peptide hormone, response to peptide. Also, terms like myeloid cell homeostasis and erythrocyte homeostasis were significantly enriched (Fig. 3A-3C). Do enrichment results showed that candidate genes are involved in hypertension, ovarian dysfunction, hypergonadism, renal carcinoma, vascular diseases and other diseases that are shown in Fig. 3D .Also, KEGG pathway enrichment of key genes demonstrated that Hippo signalling pathway, Aldosterone-regulated sodium reabsorption, Vibrio cholera infection, and GnRH secretion pathways were significantly enriched which is indicative that identified genes are significantly involved in pathogenesis of prostate cancer (Fig. 3E).
STRING database was utilized to construct PPI network of 20 identified genes. Then the PPI network was submitted to the Cytoscape software to visualize the hub genes. Cytohubba app was used to obtain 10 top hub genes. Hub genes were imported into the STRING to reanalyse the network. The constructed network of hub genes is depicted in Fig. 4A.
Identification of prognostic biomarkers
Cox regression analysis was utilized to assess the effects of differentially expressed genes on survival of patients with PCa. Results showed meaningless relation between DEGs and survival of patients in general. Also, we applied Cox analysis to evaluate the impact of confounding factors such as age, race, Gleason-score, clinical -M, clinical -T, case-control, and serum PSA-value on survival of patients which outcomes showed no meaningful relation between these factors and survival of patients with prostate cancer.
We conducted Cox regression analysis for 609 DEGs. Then Kaplan-Meier analysis was used to identify potential diagnostic biomarkers. Results showed that six genes including ANKFY1, CRYBA4, MIR204, QRFP, SNX15, YWHAH are correlated with survival of patients with PCa and could function as prognostic biomarkers (supplementary Fig. 2). Moreover, Kaplan-Meier analysis were performed separately for all 20 key genes identified by deep learning to discover potential prognostic biomarkers. Results showed that, dysregulated expression of 12 genes including ASB12, BLOC1S1, CRTAC1, KCNQ1, KISS1, M2T2A, RNF207, SCGB1D1, SLC13A2, SORBS1, TGFBR3, and WSCD2 genes are correlated with poor survival of PCa and identified as potential prognostic biomarkers (supplementary Fig. 4). Then, total of 18 identified potential biomarkers were searched for any previous report;` six of them including KCNQ1, M2T2A, SORBS1, MIR204, RNF207, and QRFP were reported in different studies as prognostic biomarkers. Therefore, other identified genes were considered as novel prognostic biomarkers. Furthermore, we assessed the function of novel diagnostic biomarkers and their role in other diseases which are represented in Table 2. These results show critical role of identified biomarkers in PCa progression and pathogenesis.
Table 1
List of 20 identified genes by deep learning and their coefficients.
Target category | Variable name | Coefficients of each variable | Performance of model |
Important genes in prostate cancer Identified by Deep learning | SCNN1B | 1 | MSE: 0.002 R^2: 0.96 AUC: 1 Accuracy:93.8% RMSE:0.054 PR-AUC:1 |
SLC13A2 | 0.996 |
SCGB1D1 | 0.985 |
CRTAC1 | 0.967 |
MOB1B | 0.963 |
TGFBR3 | 0.942 |
RNF207 | 0.941 |
KERA | 0.941 |
SORBS1 | 0.931 |
KCNQ1 | 0.930 |
RASSF4 | 0.760 |
C11orf24 | 0.759 |
WSCD2 | 0.751 |
MZT2A | 0.744 |
BLOC1S1 | 0.741 |
ASB12 | 0.739 |
REEP4 | 0.739 |
CKAP2 | 0.730 |
CEP170P1 | 0.695 |
KISS1 | 0.676 |
Table 2
The identified novel prognostic biomarkers and their functions.
Gene name | Gene function | Function in other diseases |
ASB12 | couple suppressor of cytokine signalling (SOCS) proteins/ mediates the ubiquitination /expressed abundantly in testes(36) | Laryngotracheomalacia(35) |
BLOC1S1 | play distinct roles in endosome and lysosome biology | Liver regeneration disorder(35) prognostic function in adult acute myeloid leukemia(AML)(37) |
CRTAC1 | May involved in cell-cell or cell-matrix interactions. up regulated in mesenchymal stem cells | Biomarker of osteoarthritis severity and progression(34) Neurofibromatosis(21) |
MZT2A | Enhances NSCLC survival and invasion by increasing Akt phosphorylation | MZT2A plays an important role in the progress of some tumours/ prognostic value in non–small-cell lung cancer (NSCLC)(38, 39) |
SCGB1D1 | ortholog of prostatein, probably is transcriptionally regulated by steroid hormones, ability to bind androgens, other steroids, may bind and concentrate estramustine, a chemotherapeutic drug used for prostate cancer. | Ovarian carcinoma (33) |
SLC13A2 | sodium-coupled citrate transporter | Upregulated in colorectal cancer(40) |
WSCD2 | Probably is an integral component of membrane | Breast cancer(41) |
ANKFY1 | essential for retinal endothelial cell proliferation and migration | Genetic Steroid-Resistant Nephrotic Syndrome and Human Monocytic Ehrlichiosis(42) |
CRYBA4 | structural components of the vertebrate eye lens | Prognostic gene in renal cell carcinoma(43) |
SNX15 | Involved in intracellular trafficking | Genetic marker in cholangiocarcinoma / Gastric cancer prognosis (44, 45) |
YWHAH | mediate signal transduction by binding to phosphoserine-containing proteins | Oncogene in several types of cancers/involved in cell growth, apoptosis, cell growth, migration, and invasion (46) |
ROC analyses were performed to unravel the potency of identified prognostic biomarkers and to assess their potential in diagnosis of PCa. Results demonstrated that the CRTAC1(AUC = 0.89, SE = 0.78, SP = 0.9), SCGB1D1 (AUC = 0.85, SE = 0.85, SP = 0.71), and TGFBR1 (AUC = 0.89, SE = 0.81, SP = 0.84) are potential diagnostic biomarkers. The ROC curves and related Kaplan-Meier of these genes are depicted in Fig. 4B and Fig. 4C. Also, the combination of CRTAC1 and MOB1B with coefficients of -1.29 and − 1.22 (AUC = 0.9, SE = 0.93, and SP = 0.75) have diagnostic capability. Combination of CRTAC1 and SCGB1D1 with coefficients of -1.22 and − 1.69, AUC = 0.91, SE = 0.94, and SP = 0.76 also, combination of C11orf24 and SORBS1 with coefficients of -0.34 and − 2.25, AUC = 0.9, SE = 0.93, and SP = 0.75 identified as potential diagnostic biomarkers. Moreover, combination of three genes including CRTAC1, SCGB1D1, and SCNN1B were determined as potential biomarker for diagnosis of PCa. The coefficients of genes were − 1.27, -2.03. 0.34 with AUC = 0.92, SE = 0.9, SP = 0.8. Also, combination of CRTAC1, MOB1B, SCGB1D1, and SCNN1B with coefficients of -1.24, -0.27, -1.89, and 0.42, AUC = 0.92, SE = 0.91, SP = 0.8 have diagnostic ability in PCa (Supplementary table 2).
Correlation analysis of clinical/demographic characteristics and genes with PCa
Correlation matrix analysis results showed no significant association between identified genes and clinical/demographic data. Also, correlation analysis of clinical data and PCa showed no meaningful association between them (Fig. 9).
Potential value of CRTAC1 and SCGB1D1 as prognostic biomarkers
ROC analysis data showed that among 18 identified prognostic biomarkers three genes including CRTAC1, TGFBR3, and SCGB1D1 are valuable prognostic biomarkers also potential diagnostic markers with highest AUC, sensitivity, and specificity (Fig. 4B). Surprisingly, they have a higher sensitivity and AUC than most of those reported for current diagnostic biomarkers such as(PSA: AUC = 0.58, SE = 0.09–0.33, PCA3: AUC = 0.73, SE = 0.69, PHI: AUC = 0.815, SE = 0.77)(16, 17). Especially, one of the controversions about these PCa biomarkers is lack of specificity as PSA has specificity of 0.17–0.19, PHI has specificity of 0.36–0.42 for diagnosis of PCa(18, 19). However, both of our novel potential biomarkers including CRTAC1 and SCGB1D1 have specificity of 0.9 and 0.71, respectively. Therefore, we selected these genes for further evaluation. We evaluated identified biomarkers in relevant DNA-seq data and visualized the variants in detail.
As indicated in Fig. 5A, CRTAC1 is mutated in 17% of patients and had 5 variants including two missense mutations, two silent mutations and one nonsense mutation. Also, we assessed TGFRB3 in DNA-seq data which showed one missense variant for this gene (Fig. 5B). Although, SCGB1D1 did not represent any variant in the current PCa DNA-seq dataset. Then, a literature review was performed to assess any previous report of identified biomarkers. TGFRB3 was reported by a group of researchers as a potential prognostic biomarker which validate our data(20). Also, we showed perfect sensitivity and specificity of TGFBR3 as a potential diagnostic biomarker. CRTAC1 and SCGB1D1 considered as two novel candidate biomarkers for more validation.
Regard to function of identified genes and related pathways, SCGB1D1 is a member of the lipophilin protein subfamily, also as an ortholog of prostatein, the main secretory glycoprotein of the rat ventral prostate gland. Probably, this gene is transcriptionally regulated by steroid hormones which bind to androgens and other steroids. SCGB1D1 may bind and concentrate estramustine, a chemotherapeutic drug commonly used for prostate cancer. Also, In GO ontology, protein binding molecular function (MF) was identified and extracellular matrix was showed for cellular component (CC). CRTAC1 gene encodes an extracellular matrix protein that is up regulated in mesenchymal stem cells used as a marker to distinguish chondrocytes from mesenchymal stem cells. Possibly, this protein may be involved in cell-cell or cell-matrix interactions.
In GO ontology, molecular function (MF) related to CRTAC1 is calcium ion binding and protein binding. Also, in biological process(BP) term, this gene is involved in axonal fasciculation, also CRTAC1 is a extracellular protein and located in extracellular exosome in cellular component(CC) term(21). These results show the potential value of these biomarkers in prostate cancer prognosis and diagnosis especially, as they are extracellular exosome proteins and may be present in body fluids.
Validation of CRTAC1 and SCGB1D1 in other PCa datasets
We further evaluated the expression of SCGB1D1 and CRTAC1 as novel biomarkers in other PCa datasets using PCaDB database(22). Downregulation of CRTAC1 in primary PCa was confirmed in other datasets including Cambridge, CIT, GSE59745, GSE32269, GSE8218, GSE29079, ERP00077, SRP119917, GSE30521, GSE133626, GSE80609, GSE22260, GSE114740, GSE32571. Bar plots of validated expression of CRTAC1 is shown in Fig. 6. Downregulation of SCGB1D1 in primary PCa was validated in other datasets such as GSE59745, GSE35988-GPL6848, GSE32269, GSE30521. However, In GSE2443 dataset it was shown that the expression of SCGB1D1 is significantly higher in androgen dependent primary PCa in comparison with androgen independent primary PCa which justify the upregulation of SCGB1D1 in some datasets such as GSE79021. Also, upregulation of SCGB1D1 was observed in metastatic PCa in GSE3325.