ScRNA-seq profiling demonstrated gene expression patterns during disease development of MM
From dataset GSE118900, a total of 23398 transcripts in 597 individual cells isolated from 15 MM patients at different stages were downloaded and pre-processed for scRNA-sequencing analysis. The average expression value of 54 duplicate genes was calculated. After filtering the total number of genes that are expressed in a single cell and the percentage of mitochondrial reads (Fig. 1a-c), 597 cells with 16568 expressed genes were identified. An unbiased PCA of the 597 cells was performed using highly variable genes to examine the global transcriptome patterns in the scRNA-seq data. The 16 statistically significant principal components in the PCA were reduced to two dimensions using t-SNE (Fig. 1d). The 597 cells were divided into 7 distinct clusters, each consisting of cells from MM patients at different stages (Fig. 1e). The expression patterns of markers in individual cells of each cluster were presented in Fig. 1f. A summary of clinical information, fluorescence in situ hybridization (FISH) results and clusters were shown in Table 1.
As suggested in Table 1, a majority of the MM cells from the 3 MGUS patients were clustered into cluster 1 (92%), but there were fewer cells in clusters 0, 2, or 5 (2%, 2% and 4%, respectively). most of cells from patients (RRMM1, RRMM2 and SMM0) with t(4;14) translocation were clustered into clusters 2, 4, and 5 (99%, 100% and 88%, respectively). All the cells from patients (NDMM7 and RRMM4) with t(11;14) translocation were clustered into clusters 1 and 6, respectively. All cells from the samples (SMM2, SMM3, SMM4, NDMM3, NDMM5, NDMM6, and NDMM8) were mainly distributed in clusters 0, 1, and 3.
The development of human MM cells
To determine the relationship between these cell clusters and states, the differentiation trajectory and pseudo-time analysis was investigated using the Monocle2 R package based on the identified marker genes from each cluster (Fig. 2a-c). The MM cells could be divided into early, middle, and late stages. Based on the ordering of pseudotime, the MM cells appear to start principally from clusters 1 and 6 (state 1, most of the cells are from MGUS, NDMM or RRMM patients with cytogenetically favorable t(11;14) translocation), and moved towards clusters 0 and 3 (state 2, 3, 5, 6 and 7, all the cells from SMM or NDMM patients were without t(11;14) or t(4;14) translocation) and finally to clusters 5, 2, and 4 (state 4, most of the cells are from SMM or RRMM patients with cytogeneticly high risk t(4;14) translocation) (Fig. 2a-c). These results suggested that the MM cells were ordered in pseudotime that was consistent with the actual developmental stages. The t-SNE coloured by the states showed that cells at late stage (state 4) and early/middle stage (other states) can be distinguished clearly (Fig. 2d). Venn diagram indicated that there were 294 (34.7%) different marker genes between late stage and early/middle stage (Fig. 2e). To better understand the key genes that drive the ordinal construction of the manifold, representative genes of MM cells across the clusters were examined. As shown in Fig. 2f-g, the expression levels of CD38 were significantly increased and that of CD19 were markedly decreased, which in turn were consistent with the immunephenotype of malignant plasma cells. The common molecular cytogenetic abnormalities detected by FISH, including CKS1B, CDKN2C, TP53 and IgH partner genes (CCND1, FGFR3, MAF and MAFB), are related with the prognosis of patients.The expression levels of these genes were also detected. It is noteworthy that the MM cells during development showed a significantly increased expression levels of CCND1 in the early stage cells (cluster 1 and 6), but strikingly decreased expression levels in the late stage cells (cluster 2, 4, and 5). Furthermore, the expression level of FGFR3 was decreased in both early and middle stage cells (cluster 0, 1, 3 and 6), but was slightly or markedly increased in the late stage cells (cluster 2, 4 and 5).
Functional enrichment analysis of markers of MM cells in the late stage
To understand the functional insights into the markers of clusters 2, 4, and 5 in the late stage as identified by pseudotemporal analysis, an enrichment analysis using the clusterProfiler in R was performed. The top 10 GO biology processes (BP) and the top 9 KEGG terms are shown in Fig. 3. The results showed that the markers of clusters 2, 4, and 5 demonstrated significant enrichment in protein localization to endoplasmic reticulum, protein targeting, and RNA catabolic process. The significantly enriched KEGG pathways of the markers belonging to clusters 2, 4, and 5 showed protein processing in endoplasmic reticulum, ribosomes, lysosomes, and phagosomes. Taken together, these markers were mainly involved in protein processing, which might be associated with the characteristics of excessive accumulation of abnormal proteins in myeloma cells.
Identification and validation of gene signature associated with MM progression
GSE24080 was used as the training set to explore the prognostic value of 463 marker genes from MM cells in the late stage (clusters 2, 4, and 5) using univariate Cox regression analysis. All 90 genes that showed significant association with OS in MM patients were screened for LASSO regression. Eventually, the score formula comprised of 20 optimal genes that was developed by LASSO: risk score = 0.302*(expression level of B4GALT3) + 0.085*(expression level of EDEM3) + 0.083*(expression level of MTX1) + 0.061*(expression level of STK17B) + 0.047*(expression level of GGH) + 0.046*(expression level of YBX1) + 0.04*(expression level of ITM2A) + 0.035*(expression level of COPA) + 0.031*(expression level of LGALS1) − 0.002*(expression level of DDX3Y) -0.01*(expression level of ITM2C) − 0.034*(expression level of MAP3K14) − 0.043*(expression level of TAPBPL) − 0.073*(expression level of JUNB) − 0.075*(expression level of CSGALNACT1) − 0.083*(expression level of PLEK) − 0.094*(expression level of NUCB2) − 0.107*(expression level of PECAM1) − 0.11*(expression level of ISCU) − 0.17*(expression level of PPCDC). The risk score of each sample in the training set and the optimal cut-off point of risk score (-0.34) as determined by ROC analysis were calculated and was defined as the threshold. A total of 559 patients were divided into high-risk group (n = 94) and low-risk group (n = 465). Kaplan-Meier analysis demonstrated a significant difference in the survival rates between high-risk and low-risk groups (HR = 3.759 with 95% CI 2.746–5.145, Log-rank test P < 0.001, Fig. 4a). The median OS of high-risk group was 34.0 months, while that of low-risk group was not reached. Furthermore, the AUC obtained from the time-dependent ROC analysis for predicting OS was 0.751 at 3 years, demonstrating the performance of this signature predicting survival of MM patients better than those of the previously published signatures (EMC92, UAMS70, UAMS17, and ISS stages)(Fig. 4b). It was worth noting that AUC of our 20-gene signature based on microarray dataset was also higher than that of RNAseq_signature recently developed from RNA sequencing dataset MMRF-CoMMpass (AUC = 0.653)(Fig. 4b). The distribution of risk score, survival status and the heatmap of gene expression in patients from GSE24080 dataset were shown in Fig. 4c-e.
The power of the 20-gene prognostic model in predicting OS in patients with MM was validated as an independent dataset GSE9782. As shown in Fig. 5, patients with relapsed MM in GSE9782 dataset were divided into high-risk group (n = 94) and low-risk group (n = 170) by using the 20-gene signature as the same risk score formula and threshold, which was similar in the training set. Patients with high-risk exhibited poorer OS than those with low-risk (HR = 2.612 with 95% CI 1.894–3.603, P < 0.0001, Fig. 5a). The median OS of high-risk group was 11.3 months, while that of low-risk group was 22.8 months. As the platform GPL96 did not have all the probes of EMC92, UAMS70 or UAMS17, the performance of our 20-gene signature was unable to be compared with those of the published gene signatures. However, time-dependent ROC analysis still showed that the 20-gene signature achieved an AUC value of 0.743 at 2 years of OS, which was better than that of the ISS and RNAseq_signature(AUC = 0.627 and AUC = 0.646), (Fig. 5b). The distribution of risk score, survival status and the heatmap of gene expression in patients belonging to the dataset GSE9782 were shown in Fig. 5c-e.
Pathway analysis by GSEA
Next, the GSEA was used to investigate the possible pathway in which the 20 genes might be involved. The expression profiles of high-risk and low-risk MM patients classified by 20-gene signature in the GSE24080 dataset were compared. The GSEA results revealed that the signaling pathways of hallmark E2F Targets, MYC Targets, G2M Checkpoint, Unfolded protein response, and DNA Repair showed significant activation in the high-risk group, while the hallmark KRAS signaling, inflammatory response, and TNFA signaling via NFKB were suppressed (Fig. 6).
The 20-gene signature is independent of conventional clinical factors
The univariate and multivariate Cox regression analyses were performed to assess the independent predictive value of the 20-gene signature in the GSE24080 and GSE9782 datasets. In the training dataset GSE24080, the 20-gene signature and the clinical covariates of age, C-reactive protein (CRP), serum β2-microglobulin (β2M), serum creatinine (sCr), serum lactate dehydrogenase (LDH), serum albumin (ALB), hemoglobin (HGB) and bone marrow plasma cells (BMPC) demonstrated some prognostic value with the univariate Cox regression analysis (Table 2). The results revealed that HR calculated by the 20-gene risk fraction model (HR = 4.337, 95% CI = 3.206–5.868, P = 1.83e-21) was higher than any single clinical covariate, demonstrating its higher prediction efficiency. After adjusting by sex and IgA_isotype, multivariate Cox regression analysis showed that the 20-gene prognostic model maintained a significant correlation with OS (HR = 3.532, 95% CI = 2.625–4.87, P = 1.38e-14), indicating that the prognostic value of the 20-gene signature was an independent conventional prognostic factor for predicting patients with NDMM (Table 2). In the validation dataset GSE9782, both univariate and multivariate Cox regression analyses again showed 20-gene signature with highest HR as an independent prognostic factor in patients with relapsed MM (Table 2).
Clinical implications of genes associated with MM progression
Using the APEX trial dataset (GSE9782) with available treatment information, this 20-gene signature succeeded in robustly discriminating between high- and low-risk patients in the bortezomib (PS-341) treatment group (HR = 2.884, 95% CI = 1.994–4.172, P = 1.89e-8; Fig. 7a) but not in the dexamethasone (DEX) treatment group (HR = 1.956, 95% CI = 0.983–3.891, P = 0.051; Fig. 7b).
Validation of the hub genes
QRT-PCR was used to further validate the expression levels of 20 hub genes (B4GALT3, EDEM3, MTX1, STK17B, GGH, YBX1, ITM2A, COPA, LGALS1, DDX3Y, ITM2C, MAP3K14, TAPBPL, JUNB, CSGALNACT1, PLEK, NUCB2, PECAM1, ISCU and PPCDC) in the primary plasma cells from 15 patients without t(4;14) and 5 patients with t(4;14). As shown in Fig. 8, except for the three genes (STK17B, JUNB, and PLEK), the expression levels of the other hub genes were up-regulated in the primary plasma cells from MM patients with t(4;14) compared to those without t(4;14), which was consistent with the markers at the late stage predictedby above mentioned bioinformatics analysis. These results confirmed that the 20-gene signature is a potential biomarker for MM.