Molecular subtyping and prognostic assessment of prostate cancer based on consensus genes

Prostate cancer (PCa) is one of the most common malignancies in men all over the world. We performed molecular subtyping and prognostic assessment based on consensus genes in patients with PCa. Five cohorts containing 1,046 PCa patients with RNA expression proles and recorded clinical follow-up information were included. Univariate, multivariate Cox regression analysis and the Least Absolute Shrinkage and Selection Operator (LASSO) Cox regression was used to select prognostic genes and establish the signature. Immunohistochemistry staining, cell proliferation, migration and invasion assay were used to assess the biological function of key genes. 39 intersecting consensus prognostic genes from ve independent cohorts were identied (P < 0.05). Subsequently, an eleven-consensus-gene-based classier was established. Besides, multivariate Cox regression analyses proved that the classier served as an independent indicator of recurrence-free survival in three of ve cohorts. Combined receiver operating character (ROC) achieved synthesized effects by combining the classier with clinicopathological features in four of ve cohorts. SRD5A2 inhibits the cell proliferation, while ITGA11 promote cell migration and invasion, of which might through PI3K/AKT signaling pathway. To conclude, we establish and validate an eleven-consensus-gene-based classier, which adds prognostic value to the currently available staging system.

may be useful for disease classi cation independent of the available prognostic factors, severing as the clinical implementation. Although the gene expression classi er could be constructed and validated in the publicly released dataset, or even a single-center study, there remains a gap invalidating the identi ed classi er in large cohorts.
We aimed to assess the usage of consensus recurrence-free survival (RFS) associated genes derived from ve independent cohorts to generate the molecular subtyping with different clinical outcome for PCa patients.

| Data preparation and processing
We searched the Gene Expression Omnibus (GEO: http://www.ncbi.nlm.nih.gov/geo/) to enroll the eligible datasets which meet the following criteria: 1) PCa cases with available expression data; and 2) with available clinicopathological features, particularly for RFS status and time. Then, the gene expression pro les were generated from four eligible GEO datasets [GSE116918, GSE70769, GSE70768, and GSE21032/Memorial Sloan Kettering Cancer Center (MSKCC)], as well as the gene expression pro le from The Cancer Genome Atlas (TCGA) database (https://cancergenome.nih.gov). The matched clinicopathological data were also downloaded along with the expression pro les. Patients who lack the data of the pathology T stage were excluded.

| Univariate Cox regression analysis and consensus gene selection
Univariate Cox regression analysis was conducted to screen out the consentaneous RFS-related genes observed from the ve datasets. Subsequently, we intersected these RFS-related candidates that appeared in all ve datasets, with a cut-off P-value of less than 0.05. The following analyses were performed based on these selected genes.

| Classi er establishment and validation
According to the results provided by univariate Cox regression analysis, we employed LASSO Cox regression to select more stable candidates. The classi er was established referring to the expression and co-e cient of each candidate based on the MSKCC cohort. We calculated the risk score for each patient along with the below formula: Median value of the risk score set as the cut-off value in each cohort, and for these patients with lower risk scores than the median value were assigned into the low-risk subgroup, while others belonged to the high-risk subgroup. The risk score of patients in the other four external validation cohorts was also calculated by this risk formula and then dichotomized these patients into two different risk subgroups by the median risk score in each cohort.

| Survival and receiver operating characteristic (ROC) analyses
Survival analyses were executed using the "survminer" package, with the RFS as the endpoint. Furthermore, the area under the ROC curve (AUC) was employed to assess the predictive value of the formula. Furthermore, subgroup analyses were executed to test the accuracy of the classi er in different clinicopathological subgroups, such as different Gleason score ( < = 7 vs. > 7), pathological tumor stage (T1 + T2 vs. T3 + T4), and age ( < = 60 vs. > 60).

| Assay of cell proliferation, migration, and invasion
To evaluate the impact of SRD5A2 and ITGA11 on prostate cancer cells, we employed MTT assay and colony formation assay to assess the alteration of cell proliferation, while Transwell-based invasion and migration assay was used to evaluate the ability of cell migration and invasion.
For MTT assay, 5000 cells seeded in per well of 24-well plates, and then collected the results by add 50 μL prepped 5 mg/mL MTT reagent to 450 μL refreshed medium (with a concentration of 0.5 mg/mL) and incubated at 37°C for 1.5 hours. On the sixth day, add the DMSO solution to all plates (0, 2 days, 4 days, and 6 days) to dissolve the formazan crystals and then read the optical density value at 570 nm to display the cell viability. For colony formation, 800 cells were seeded per well and grew for 12 days. The cells in plates will be xed by 4% paraformaldehyde for 20 minutes and then the 0.05% crystal violet was used to stain these xed cells for another 20 minutes.
For migration assay, Transwell Permeable Supports (Corning Inc., Maine, USA) was used. A total of 1 × 10 5 cells was seeded in the upper chamber for each well, and 500 μL fresh medium with 10% FBS added into the lower chamber, and then incubated at 37°C and 5% CO 2 incubator for 24 hours. The cells migrated to the bottom of the membranes were permeabilized by methanol and stained with 0.01% Crystal Violet. The steps of invasion assay are similar to migration assay, which also used Permeable Supports, but with extra Matrigel (Biocoat, Corning, New York, USA) diluted and coated at the upper chambers and incubated for 36 hours. The cell numbers were calculated by counting three random elds.

| Functional prediction
Increasing evidence indicated that highly co-expressed genes potentially have similar biological functions 16-18 , and we used Pearson correlation analysis to nd out the highly co-expressed genes of ITGA11 on the ve cohorts, respectively. After overlapping these co-expressed genes, we performed the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis to subclassify their functions based on the "clusterPro ler" R package 19 . Besides, we also used Cytoscape (v3.5.1, San Diego, La Jolla, California, USA) to visualize the functional network. Two external gene sets, HALLMARK PI3K/AKT SIGNALING and REACTOME PI3K/AKT ACTIVATION, were employed to assess the association between ITGA11 expression and the activation PI3K signaling pathway. The single sample gene set enrichment analysis (ssGSEA) 20,21 , implemented in the GSVA R package, was applied to calculate the normalized enrichment score (NES) of the above 2 gene sets. Next day, after incubating with HRP-conjugated secondary antibodies for one hour, the membranes were visualized using the ECL system (Pierce; Thermo Fisher Scienti c, Inc., USA).

| Statistics
All the statis Comparisons of continuous data between two subtypes were performed by Student's t-test and Mann-Whitney U test for normal and non-normal distribution data. Correlations between staining intensity subgroups and clinicopathological subgroups were evaluated by Fisher's exact test. Spearman's correlation analysis was utilized to explore the correlation between continuous variables. For all statistical analyses, a two-tailed P value less than 0.05 was considered statistically signi cant.

| Construction of the eleven-consensus-gene-based classi er
In the current study, we enrolled a total of 1,046 PCa patients from the MSKCC (n = 140), TCGA-PRAD (n = 489), GSE116918 (n = 223), GSE70769 (n = 85), GSE70768 (n = 109) datasets. The clinicopathological features of all the PCa patients were listed in Table 1. We overlapped the signi cant genes (P < 0.05) generated by univariate Cox regression analyses in ve independent cohorts, and found that there were 39 consensus candidates, and signi cantly associated with the RFS of PCa patients in all the ve datasets ( Fig. 1A, and Table S1). The expression landscape of these 39 genes in the MSKCC cohort showed in Fig  The median risk score was set as the cut-off value in each cohort, and these patients with lower risk scores were assigned into the low-risk subgroup, while others were classi ed into the high-risk subgroup (Table S2). Further, we correlated the expression of the eleven genes with clinicopathological features, and the results indicated that UBE2J1, SRD5A2, OGN, MYBPC1, KIF13B, and DPP4 were negatively correlated with Gleason score, PSA level, and pathological tumor stage, while opposite results were obtained for STMN1, NOX4, ITGA11, COL1A1, and CDKN3 genes (Fig. 1E).

| Analysis results based on consensus genes and prognostic assessment in ve cohorts
Referring to the subgroups, in the training MSKCC cohort, 70 patients were divided into the high-risk group, and other 70 patients belonged to the low-risk group ( Fig. 2A), and patients in the high-risk subgroup showed an unfavorable prognosis (log-rank P < 0.001, Fig. 2F), with the AUC values of 0.908 in 1-year, 0.898 in 3-year, and 0.857 in 5-year (Fig. 2K). The predicting classi er also applied well in four external datasets. The K-M curves showed the similar RFS outcomes in GSE116918 (log-rank, P < 0.001), GSE70768 (log-rank, P = 0.049), GSE70769 (log-rank, P < 0.001) and TCGA-PRAD cohorts (log-rank, P < 0.001) (Fig. 2B- To further investigate the clinical application value of the eleven-consensus-gene-based classi er in different clinicopathological subgroups, K-M analyses were performed. The signature precisely subclassi ed the high-and low-risk groups of PCa patients into different subgroups with adequate samples but failed in some conditions, potentially attributing to the small sample size (Fig. 3, Figure S1). For example, in the MSKCC cohort, which comprised 140 PCa cases, and the results suggested that the classi er signi cantly discriminated the high-and low-risk subgroups in different age ( > = 60 vs. < 60 years old), PSA level (> 10 vs. <= 10 ng/dl), pathological tumor stage (T3 + T4 vs. T1 + T2), and Gleason score (> 7 vs. <= 7) subgroups (All, log-rank P < 0.05). For the GSE70769 cohort, which comprised 85 PCa cases, we also revealed the classi er signi cantly discriminated the high-and low-risk subgroups of the different PSA (> 10 vs. <= 10), tumor stage (T3 + T4 vs. T1 + T2) subgroups (log-rank P < 0.05), Gleason score (> 7 vs. <= 7) subgroups, and surgical margin (negative vs. positive) (All, log-rank P < 0.05). Further well-designed studies are warranted to verify our ndings.

| Analysis results of multivariate Cox regression, and Combined ROC
To determine the independence of the eleven-consensus-gene-based classi er in each cohort, we performed multivariate Cox regression analyses. Our results showed that for the cohorts, whose recurrence rate > 15%, the classi er was serving as an independent indicator for RFS (  Fig. 4A-C, and Figure S3A-B).
In addition, we performed Combined ROC analyses by combining the classi er with critical clinicopathological features in each cohort. We found that the Combined obtained an increased predictive

| IHC validation the protein of SRD5A2 and ITGA11
To address and con rm the associations of SRD5A2 and ITGA11 protein levels with clinicopathological features, we used IHC assay on prostate cancer tissue array, which contains tumor tissues from 42 patients. The standard de nition of the SRD5A2 protein level was mentioned in the Methods section. Then, we calculated the staining density of each tissue. Tissues with score equal to or higher than 3 were regarded as positive, while those with score less than 3 were negative.
We observed the decreased protein expression of SRD5A2 in advanced tumors stage (Gleason < = 7 vs. Gleason > 7, P = 0.031, Stage I + II vs. Stage III + IV, P = 0.148, Fig. 5A). Moreover, with the help of Fisher's extract test, we investigated the different distribution of SRD5A2 expression (strong positive, weak positive, or negative) in different clinicopathological subgroups. We revealed the negative association of SRD5A2 and tumor progression, re ecting by Gleason score (P = 0.013), and pathological tumor stage (P = 0.047) ( Table 2). As to ITGA11, we observed an elevated protein expression in advanced stage than early-stage (Gleason < = 7 vs. Gleason > 7, P = 0.067, Stage I + II vs. Stage III + IV, P = 0.014, Fig. 5B).
Fisher's extract test of the categorical variables illustrated similar results. These patients whose Gleason score higher than 7 or were in tumor stage III and IV showed strong positive staining of ITGA11 (Gleason score, P = 0.049, Pathological tumor stage, P = 0.022, Table 3). All these results indicated that SRD5A2 protein is negatively associated with the progression of PCa, while ITGA11 protein is positively associated with the advanced stage.  III-IV 6 (100.00%) 0 (0.00%) *, P < 0.05

| Knockdown SRD5A2 and ITGA11 impact prostate cancer cell behaviors
After knocking down the expression of SRD5A2 ( Figure S3), we found the cell proliferation was signi cantly increased determined by MTT assay and colony formation assays in C4-2 and PC-3 cells (all P < 0.05, Fig. 6A-B). Since the functional role of SRD5A2 in regulating PCa cell migration and invasion has been investigated by Suruchi Aggarwal et al. 22 , we only focused on the proliferation effects here. On the contrary, we found that silencing ITGA11 expression decreased the cell migration and invasion in C4-2 and PC-3 cells, instead of cell proliferation (all P < 0.05, Fig. 6C-D, and Fig. 7D). These results con rm the tumor-suppressive role and oncogenic role of SRD5A2 and ITGA11, respectively.
3.6 | Exploring the underlying mechanisms of how ITGA11 regulating PCa progression Many studies have already illustrated the mechanisms of how the selected 11 genes in uence the progression of PCa, while few studies have focused on the role of ITGA11. We conducted Pearson correlation analyses to nd out the highly co-expressed genes in each cohort and overlapped these genes derived from all the ve cohorts. Then, KEGG analysis was used to enrich the signi cant signaling pathways. We found that ITGA11 might be involved in the activity regulation of calcium signaling, Rap1 signaling, Ras signaling, and PI3K/Akt signaling pathways (Fig. 7A). Moreover, we collected two gene sets which could re ect the activation status of PI3K/AKT signaling and calculated the NES score of each patient by ssGSEA. We observed that the expression of ITGA11 is positively correlated with the NES score of HALLMARK PI3K/AKT/mTOR signaling gene set (P < 0.05, r = 0.43, Fig. 7B), and REACTOME PI3K/AKT activation gene set (P < 0.05, r = 0.34, Fig. 7C). Besides, we validated the impaction of ITGA11 expression on the activation of PI3K/AKT signaling in vitro. After knocking down the expression of ITGA11, we found that the Ser473 phosphorylated-AKT1/2/3 was signi cantly decreased instead of total AKT1/2/3, indicating that ITGA11 promotes the migration and invasion of PCa potentially by activating PI3K/AKT signaling (Fig. 7D).

| Discussion
The interpatient heterogeneity in PCa is well recognized [23][24][25] . However, the molecular strati cation of PCa based on predicting biomarkers to guide treatment selection has not yet been applying in the clinic.
In our study, we analyzed ve datasets derived from GEO (four) and TCGA databases. We rstly employed univariate Cox regression analyses and identi ed 39 candidate genes that are closely related to the RFS of PCa patients in all the ve datasets. The RFS predicting classi er was established by the LASSO Cox regression analysis based MSKCC dataset. The classi er showed satisfying molecular subtyping accuracy determined by the log-rank, K-M, and ROC analyses in both the training and four external validation cohorts. Further, the multivariate analyses suggested that the classi er served as an independent indicator of RFS in a set of cohorts. Notably, the Combined ROC, which synthesized the classi er with clinicopathological features, added prognostic values to the currently available staging system. As studies indicated that highly co-expressed genes potentially have similar biological functions 16-18 , therefore, we conducted Pearson correlation analyses to nd out the highly co-expressed genes in each cohort and overlapped these genes derived from all the ve cohorts, then employed the KEGG pathway analyses to reveal the underlying mechanisms of these critical candidates in uencing tumor progression. For the eleven candidates 22,26−38 , few studies focused on the role of ITGA11 on PCa progression. Thus, we predicted the biological function of ITGA11 through a bioinformatic way indicated above, and the results indicated that ITGA11 might be involved in the activity regulation of calcium signaling, Rap1 signaling, Ras signaling, and PI3K-Akt signaling pathways. Consistently, studies also reported that SRD5A2 regulates cell migration and invasion by indirectly regulating ERK/MAPK pathway 22 . We further investigated the functional role of these two candidates in regulating cancer cell fates, as well as the protein expression in clinical samples. Our results con rm the tumor-suppressive role of SRD5A2 and the oncogenic role of ITGA11 in PCa. Nextly, we validated the pathway enrichment results based on co-expressed genes by western blot assay. Our results suggested that silencing ITGA11 suppresses the activity of AKT signaling, indicating that ITGA11 promotes PCa cell progression potentially through activating AKT signaling. Overall, we successfully establish a solid prognosis predicting system.
In recent days, several prognostic classi ers were developed to predict the outcome of PCa patients, based on the clinical features, gene genetics, or epigenetics. Zhao et al. 39 reported that the PD-L2 is a prognostic biomarker for PCa based on patients, and they also reported that the in ltration of T cells and Macrophages are increased in the poor outcome group, which is also consistent with our work that M2 macrophages link with the unfavorable prognosis, while the immunocytes and clinical feature-based Combined could distinguish the different ending of recurrence (AUC = 0.91) 40 . Bhargava et al. 41 illustrated an African-American speci cally automated stromal classi er, which has the potential to substantially improve the accuracy of prognosis and risk strati cation. Yang et al. 42 established a 28hypoxia-related-gene prognostic classi er for localized PCa, which could predict the biochemical recurrence and metastasis events. Gleason score 4 + 3 is associated with a 3-fold increased risk of incidence of metastasis at diagnosis than Gleason score 3 + 4 though the overall incidence is low 43,44 . In our study, we investigated the value of this classi er in distinguishing patients with Gleason score 3 + 4 and 4 + 3 subgroups in MSKCC, TCGA-PRAD, GSE70768, and GSE70769 datasets. We observed that patients with Gleason score 4 + 3 have a higher risk score as compare with Gleason score 3 + 4 patients.
Besides, the risk score distinguished the 3 + 4/4 + 3 subgroups with good accuracy, a result consistent with Fisher's Extract test. We found that more patients with the high-risk belonged to the Gleason score 4 + 3 subgroup. Another limitation of these studies was that their ndings were not validated in two more independent cohorts, and the potential mechanisms of how these markers in uence tumor progression were not predicted or investigated. Herein, we established an eleven-consensus-gene-based classi er and validated its usage in ve independent cohorts. We also predicted the potential mechanisms through bioinformatic ways. Thus, our ndings are stable and convincing.
The advantages of the current study are summarized and presented as follows. Firstly, we identi ed a 39consensus-gene-cluster from ve independent cohorts, and with the help of the LASSO Cox regression analysis, we chose the most suitable eleven candidates to establish the RFS predicting classi er. The classi er showed satisfying molecular prognostic subtyping accuracy determined by the log-rank, K-M, and ROC analyses in all ve cohorts. Second, we proved that the classi er serves as an independent indicator of RFS in a set of cohorts and adds prognostic value to the currently available staging system.
Thirdly, we predicted the potential mechanisms of how these critical candidates in uence the progression of PCa, which would bene t the development of targeted drugs. However, lack of survival analysis in our samples and lack of systematic functional studies to prove the function and mechanisms of these consensus genes are the major limitations of the current study.
Our study successfully subtypes PCa patients with different prognostic outcomes based on an elevenconsensus-gene-based classi er containing 1,046 cases. The classi er adds prognostic value to the currently available staging system. These ndings propel the development of personalized medicine for PCa patients shortly.    Knockdown of SRD5A2 and ITGA11 altering the cell proliferation, migration, and invasion. Knockdown SRD5A2 promoted cell proliferation in both PC-3 and C4-2 cell lines, assessed by MTT assay (a) and colony formation assay (b). Knockdown ITGA11 inhibited the cell migration in PC-3 and C4-2 cell lines (c), as well as the cell invasion (d). *, P < 0.05, **, P < 0.01, ***, P < 0.001.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. TableS1.doc TableS2.xlsx