The quality control and preprocessing of single-cell RNA-seq
The general workflow of this study is displayed in Fig. 1. The “nFeature_RNA”, “nCount_RNA”, “percent.mt” of the single cell RNA-seq data were displayed in Fig. S1A. There is no significant correlation between percent.mt and nCount_RNA while the correlation between nFeature_RNA and nCount_RNA is positive (Fig. S1B). The most variable genes between tumour cells were calculated and displayed in a volcano plot (Fig. S1C). The top 1500 variable genes were displayed with the red dots and the top 10 significant DEGs was labeled (Fig. S1C). After PCA analysis, dimension of all tumour cells were reduced and displayed in Fig. S1D. 11 different colors represent tumours from different patients. The top 30 characteristic genes of the first four PCs were visualized in the Fig. S1E. We calculated p-values of every PC (dims: 1-15, Fig. S1F) and selected 1-10 PCs in the downstream analyses.
Cell clustering analysis
We performed tSNE to further identified the marker genes of each cell cluster which provided nice visualization for distinguishing different cell clusters (Fig. 2A). Subsequently we annotated cell types by exploiting the cell markers and labeled them in Fig. 2B. Through Pseudotime trajectory analysis, we displayed the developmental process between the cell subpopulations of single cells and showed how clusters were defined by expression of genes at the beginning of cell fate progressions. All cells were ordered along pseudo-time to establish a common pseudotime axis (Fig. 2C, D). We also annotated cell clusters and types by exploiting the cell markers and labeled them in Fig. 2E, F. Then we compare different trajectories and displayed all significant DEGs between in STable 1.
GO and KEGG mainly indicated immune and inflammatory response
GO and KEGG pathway enrichment analyses of DEGs indicated that the genes were mainly associated with the immune- and inflammation-related activities such as antigen processing and presentation, and virus infection (Fig. 3A-F).
Identification of 157 DEGs between normal and tumour tissues
We compared the 339 DEGs among trajectories in 111 normal and 1053 tumour tissues from the TCGA database and further identified 157 DEGs. They were displayed in a heatmap (Fig. S2).
Identification of a 20-gene signature in the TCGA cohort
After univariate Cox regression analysis, 29 genes that met the criteria of p<0.05 were retained for further analysis. Among them, 10 genes were associated with increased risk with HRs >1, while the other 19 genes were protective genes with HRs <1 (Fig. 4). By performing Cox regression analysis, a 20-gene signature was constructed. The risk score was calculated using the data in Table 1.
Table 1
The genes involved in the signature and their coefficients.
No.
|
Gene name
|
Coef
|
No.
|
Gene name
|
Coef
|
1
|
LIMCH1
|
0.231177284382533
|
11
|
PSMB8
|
-0.666443350442645
|
2
|
ABRACL
|
0.306190466953879
|
12
|
CCR7
|
0.177570702490536
|
3
|
MDK
|
-0.160390820289339
|
13
|
BCL2A1
|
-0.362345417414752
|
4
|
SDC1
|
0.198276495436628
|
14
|
GMFG
|
0.378359128956913
|
5
|
CLEC3A
|
0.0785617360450329
|
15
|
HSPB8
|
0.119812852617582
|
6
|
PMAIP1
|
-0.118183133869402
|
16
|
HSPH1
|
0.26174024222534
|
7
|
NUDT19
|
0.251669376270532
|
17
|
MORF4L2
|
0.339675654431019
|
8
|
RGS2
|
-0.136596848127705
|
18
|
TFF1
|
-0.100258283031602
|
9
|
PSMB9
|
0.621798003179615
|
19
|
RAC2
|
-0.30540190431927
|
10
|
IGHG1
|
-0.0747069021989824
|
20
|
CXCL13
|
-0.121340461203996
|
Validation of the risk signature
Patients from the TCGA dataset were stratified into low- and high-risk groups based on the median. A notable difference in OS was detected between the low- and high-risk groups (Fig. 5A). Time-dependent receiver operating characteristic (ROC) analysis was applied to evaluate the sensitivity and specificity of the prognostic model, and the area under the ROC curve (AUC) was 0.769 for 1-year survival, 0.795 for 3-year survival, and 0.756 for 5-year survival (Fig. 5C).
External validation of the risk signature
Based on the median risk score from the TCGA cohort, 620 patients from the GEO dataset were divided into low- and high-risk groups. A notable difference in OS time was detected between the low- and high-risk groups (Fig. 5B). The area under the ROC curve (AUC) was 0.756 for 1-year survival, 0.719 for 3-year survival, and 0.660 for 5-year survival (Fig. 5D).
The risk model was an independent prognostic factor
Univariate Cox regression analysis indicated that the risk score was an independent factor capable of predicting poor survival for the TCGA cohort(HR=1.387, 95% CI: 1.301–1.478, Fig. 5E). The multivariate analysis also revealed that, after adjusting for other confounding factors, the risk score was a prognostic factor (HR=1.340, 95% CI: 1.245–1.442, Fig. 5F) for patients with BRCA in TCGA cohort.
The relationships between genes in our signature and particular cell types
We further investigated the relationships between the expression of the genes in our signature and particular clusters (Fig. 6A) and cell subpopulations (Fig. 6B). Most of the genes are highly expressed in cluster 0, 2, 7 and Epithelial_cells, T_cells, DC.
Identification of prognosis-related genes in the signature
Eight genes were related to prognosis independently in the signature and we plotted Kaplan–Meier survival curves of the eight genes (Fig. S3). Seven genes are positively related to the prognosis while only gene(SDC1) is negatively.
The mutational status of high-risk group is generally lower
We identified and displayed the top 20 genes most frequently mutate in the TCGA cohort. The mutational status of high-risk group(Fig. 7A) is generally lower than low-risk group(Fig. 7B).
GSEA of the riskscore
The significant pathways associated with high riskscore are mainly enriched in metabolism activities such as ascorbate_and_aldarate_metabolism, drug_metabolism_cytochrome_p450 (the top 5 significant pathways are shown in Fig. 8A. The significant pathways associated with low riskscore are mainly enriched in immune passways such as allograft_rejection, primary_immunodeficiency (the top 5 significant pathways are shown in Fig. 8B.
Correlations between riskscore and immune checkpoints, immune cells, immune-related activities
The correlations between riskscore and immune checkpoints were displayed in Fig. 8C. The correlations between riskscore and immune activities were displayed in Fig. 8D. The red circle indicate positive correlations and the blue circle indicate negative correlations (*p<0.05).
The TIDE score is significantly different between groups
The immune escape score (TIDE) is higher in high-riskscore group compared with low-riskscore group (***p<0.001, Fig. 8E).