Data Collection
Transcriptomic data and associated clinical information for prostate cancer samples, downloaded from the TCGA database, encompass a total of 554 samples. This dataset includes samples from 502 patients diagnosed with prostate cancer and 52 samples from a healthy control group. This valuable resource facilitates the investigation of the molecular mechanisms underlying prostate cancer and contributes to the advancement of clinical treatment strategies. The dataset GSE137829 was downloaded from the GEO database, which includes single-cell transcriptome data from six prostate cancer samples.
Genetic analysis of male hormones
In this study, we have selected a set of genes related to male hormones from the Genecard database, including CGA, CYP17A1, HSD17B12, HSD17B3, HSD3B1, HSD3B2, LHB, POMC, SRD5A1, SRD5A2, and SRD5A3. The objective of our study is to conduct a cluster analysis utilizing the expression data of these specific genes as the fundamental basis for our investigation. Initially, we will use the normalizeBetweenArrays function from the limma package in R to normalize the raw data, and save the processed data as a file. This step ensures that our analysis results will not be affected by the scale of the data or the measurement units. Subsequently, we will obtain the expression levels of these genes and use the ConsensusClusterPlus package to perform cluster analysis. In this process, we will set the maximum number of clusters to 9, the number of repetitions to 50, and the sampling ratios for items and features to 0.8 and 1, respectively. We choose “PAM” as the clustering algorithm and “euclidean” as the distance metric. Finally, we will choose 2 as the final number of clusters.
Clinical correlation analysis
In the realm of clinical research on prostate cancer, Progression-Free Survival (PFS) is deemed a pivotal assessment indicator, quantifying the duration from the initiation of random patient assignment to the first instance of tumor progression or patient mortality. To precisely evaluate the disparities in PFS between the C1 and C2 cohorts, we utilized the Kaplan-Meier survival curve as our assessment instrument. Moreover, we stratified the target genes based on expression levels into high and low groups, and constructed corresponding survival curves. This was done to assess the potential influence of these genes on patient prognosis.
Tumor mutational burden analysis
In our study, we initially obtained the gene mutation information of prostate cancer patients from the TCGA database and utilized the maftools package for processing and analysis, visualizing genes with high mutation frequency. Subsequently, we created a waterfall plot for both Group C1 and Group C2, illustrating the mutation situation of the 20 most common genes in each sample. In addition, we also generated an interaction plot, demonstrating the mutation relationship among the 20 most frequently mutated genes.
Immunoinfiltration analysis
Initially, the prepared transcriptome data from groups C1 and C2 are processed through the estimate function to analyze the estimation of tumor purity. The scores are then visualized using a violin plot. Subsequently, the CIBERSORT function is executed to analyze the proportions of 22 characteristic genes of immune cells, thereby evaluating the proportion of various immune cells in each sample. Upon completion of the analysis, visualization tools can be employed to display the results of the proportions of different immune cell types in different samples through a box plot.
Drug susceptibility analysis
We utilize the pRRophetic R package to conduct a drug sensitivity analysis between two groups of samples, and construct a statistical model to predict the response of tumor biological specimens to chemotherapy drugs. Initially, functions within the pRRophetic package will be employed to estimate the half-maximal inhibitory concentration (IC50) of the drug. During this process, we set a filter condition of p < 0.001 to ensure the statistical significance of the results.
Gene set enrichment analysis
Transcriptome data files for groups C1 and C2 have been prepared and pathway enrichment analysis has been conducted using the GSEA4.3.2 software. During the analysis, the c2.cp.kegg_medicus.v2023.2.Hs.symbols gene set database was selected as a reference. To ensure the reliability of the results, we set the screening criteria: NES score greater than 1.5, and FDR q-val less than 0.05.
Analysis of differentially expressed genes (DEGs)
In the analysis of DEGs between the two groups of samples, the Limma software package was utilized to identify genes with significant differences. By setting a threshold of an absolute log fold change (|LogFC|) greater than 0.8 and a P-value less than 0.05, genes with statistically significant differences were able to be filtered out.
Weighted gene co-expression network analysis
In the process of conducting a WGCNA analysis between groups C1 and C2, the initial step involves the reading and organization of data, with a focus on the top 25% of genes exhibiting the greatest fluctuation for the analysis. This is followed by the clustering of samples and the generation of a sample clustering diagram. Any samples below the cut line in the sample tree are then removed. Upon the preparation of clinical data, a heatmap of the sample clustering is created. The optimal power value is selected and a scatter plot of power values is generated. The data is then transformed into an adjacency matrix and a TOM matrix is calculated. Gene clustering is performed and a gene clustering diagram is generated. Dynamic modules are identified and module clustering is conducted. A heatmap of module genes is created, and the correlation between the module and trait data is calculated, followed by the creation of a correlation heatmap. Finally, the correlation and its P-value between gene module membership and gene trait significance are calculated.
Go and KEGG enrichment analysis
We employ the clusterProfiler package to perform Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. Initially, we conduct DEGs and WGCNA, and then use the intersection of the two as the gene list. Subsequently, we utilize functions within the clusterProfiler package to identify the enrichment of these genes in GO and KEGG pathways. Finally, we visualize the results using bar and bubble charts.
Protein-protein interaction analysis
The intersection genes obtained from the above analysis are input into the STRING database for Protein-Protein Interaction (PPI) analysis. When uploading the gene list, we select the “Multiple proteins” option for batch upload. Upon completion of the analysis, the results are exported and opened in Cytoscape software for further analysis. Within Cytoscape, we utilize built-in network centrality algorithms, such as MCC, Degree, and Closeness, to identify and select the top 10 hub genes in the network.
Machine learning
In the process of key gene selection, we initially prepared the expression data of 21 genes obtained from PPI analysis and the PFS clinical data of prostate cancer patients. Subsequently, we employed three machine learning methods: Random Forest (RF), Support Vector Machine-Recursive Feature Elimination (SVM-RFE), and Least Absolute Shrinkage and Selection Operator (Lasso). In the RF analysis, we trained the model and obtained the importance score for each gene, selecting the gene with the highest score. In the SVM-RFE analysis, we selected the feature number with the highest accuracy and the smallest error. In the Lasso analysis, we trained the model and obtained the coefficient for each gene, selecting the gene with a non-zero coefficient as the key gene. Finally, we compared the key genes obtained from the RF, SVM-RFE, and Lasso analyses, selecting the genes that were chosen in all methods as the final key genes.
Immune checkpoint correlation analysis
Following the normalization of the transcriptomic data from groups C1 and C2, we utilized the t-test statistical method to conduct a differential expression analysis, aiming to investigate the significant disparities in immune checkpoint genes between the two groups. These disparities were intuitively visualized using a box plot. Furthermore, we examined the Pearson correlation between the differentially expressed genes identified by machine learning and the immune checkpoint genes. The outcomes were presented via a correlation heatmap, unveiling potential associations between them.
The Human Protein Atlas analysis
We utilized the immunohistochemical staining image data from the HPA database to evaluate the distribution differences of proteins in normal and cancerous tissues. Specifically, we conducted immunohistochemical analysis on the expression levels of BIRC5, CENPA, and MMP11 proteins in normal prostate tissue, low-grade prostate cancer, and high-grade prostate cancer tissues.
Single-cell analysis
In the processing of the GSE137829 dataset, quality control is initially performed on the single-cell transcriptome data from the six included prostate cancer samples, adhering to the criteria of nFeature_RNA > 500, percent.mt < 20%, and nCount_RNA > 1000. Subsequently, the Harmony package is utilized for data integration and dimensionality reduction to eliminate batch effects. Cluster analysis is conducted to identify distinct cell populations, and these populations are annotated via marker genes to ascertain their biological characteristics. Visualization of the enriched cell populations is carried out for specific genes, namely BIRC5, CENPA, and MMP11. Thereafter, these cell populations undergo Gene Set Enrichment Analysis (GSEA) to understand the potential roles of these genes in prostate cancer.
Virtual screening
Initially, we identified the RIRC5 protein’s crystal structure, designated as 4a0i, from the Protein Data Bank (PDB) to serve as our research target. Following this, we constructed a simulation box centered at the spatial location of the ligand. Post protonation of the receptor, we executed a molecular docking study using Autodock Vina software, involving it and 1662 natural products that had been subjected to energy minimization. Ultimately, we filtered out potential therapeutic drugs based on the Binding Affinity values of each molecule.