2.1 quality control, dimensionality reduction, cell clustering and annotation of breast cancer single cell sequencing data
In this study, two sets of data (GSE75688 and GSE118389) were downloaded. GSE75688 contains single-cell RNA sequencing data of primary and metastatic breast cancer, a total of 563 cells. Among them, 12 samples were applied bulk sequencing. Except for non-single cell sequencing data, 549 single cell data were retained. Among the 549 data, 441 cells are primary breast cancer and 108 cells are metastatic breast cancer. GSE118389 is the scRNA sequencing data of triple-negative breast cancer, which contains 1534 cells. In the quality control of single-cell transcriptome data, the number of feature genes greater than 200 and less than 2500, the proportion of mitochondrial (percent.mt) less than 10% are used as a threshold for data screening and filtering. The first threshold is set to eliminate the empty oil droplets. To avoid the small number of RNA, data less than 200 is eliminated. The second threshold is set to eliminate more than two cells into one oil droplet. Subsequently, the data is analyzed by PCA and UMAP dimensionality reduction, and the results are visualized in the form of heat map, JackstrawPlot, and ElbowPlot (Fig. 1).
Next, the Find Integration Anchors function is used to integrate two single-cell transcriptome sequencing data from two data sets (non-merge, because merge is only A data merge, which cannot remove batch effect), which minimizes the error caused by different batches of experiments, and is used to construct the final S4 object. Subsequently, the Scale Data was used for data centralization and standardization, Seurat was continued to be used for PCA and UMAP dimensionality reduction analysis. According to references and debugging effects, the clustering analysis were performed with a threshold of resolution = 0.2, and 14 clusters were obtained. The cell clusters were annotated according to the marker genes in the Cellmarker, PanglaoDB database and references. The results showed that 14 clusters were endothelial cells、DCs、basal cells、acinar cells、T cells、NK cells、B cells、macrophage cells、Fibroblast cells、neutrophils、epithelial cells、neurons cells、HPCs、ductal cells (Fig. 2A). The Find Variable Features function is used to find the genes with the largest differences among the difference cells clusters. The results show that HP, KCNJ, SCGB2, CPB1 and SPP are the first five significantly different genes (Fig. 2B). Meanwhile, expression of common cell marker CD3D、MS4A1、AIF1、LUM、S100A8、KRT14、TFF3、CD34、CLDN1 in 14 clusters was analyzed, Violin map is used to show the results of marker genes in each cluster (Fig. 2C).
2.2 differential gene expression and multi factor interaction analysis of immune cells in breast cancer
In order to analyze the differentially expressed genes in each cluster, the Find All Markers function is used to calculate the expression of differential genes in each cluster, and the Do Heatmap function is used to plot the distribution of differential genes in different cell types (Fig. 3A). According to the conventional genes of cell annotation in the literature, the expression of marker genes of 6 types of immune cells including NK, DCs, macrophages, B cells, T cells and neutrophils was analyzed (Fig. 3B). Subsequently, the frequency of cells in each cluster is counted and used to explore the enrichment ratio of immune cell populations. The results show that NK cells are significantly aggregated in triple-negative breast cancer, the proportion of macrophages is significantly increased in primary breast cancer, and B cells, T cells and neutrophils may play important role in metastatic breast cancer (Fig. 3C-D). Excitingly, it has been reported that T cells and neutrophils are involved in metastasis of breast cancer[14]. Next, the MSigDB database is used to perform functional annotation analysis of cell types, which is conducive to revealing the functional status of immune cells. The analysis results showed that the functions of specifically expressed genes in T cells, Macrophage cells, B cells, DC cells, and Neutrophils were significantly enriched in 20, 10, 7, 6, 3, and 3 terms, respectively (Fig. 3E).
On the basis of clarifying the effect of differential gene expression in breast cancer samples on immune cells. The cellphoneDB software is used to analyze the ligand-receptor relationship between cells. In the output of the results of the ligand receptor (Supplementary Table 1 result ligand receptor), the heatmap plot function is used to analyze the interaction between immune cells. The results showed that macrophages and DCs cells were significantly active and interacted with a variety of cells (Fig. 4A). In order to further analyze the interaction between ligand and receptor, the TRRUST database was applied, and the hypergeometric test method was used to analyze the interaction between differential genes, immune cell marker genes, and ligand-receptor. The multi-factor interaction network between immune cells was constructed and visualized with Cytoscape 3.7.2 (Fig. 4B).
2.3 WGCNA Analysis reveals key modules associated with breast cancer progression and patient survival
Based on the genes in the immune cell multi-factor interaction network, the grouping information of single-cell data (primary, metastatic, and triple-negative breast cancer) and expression matrix, the co-expression network was constructed and WGCNA analysis was performed. The average-linkage hierarchical clustering method is used for gene cluster analysis. According to the standard of hybrid dynamic shearing tree, the minimum number of genes (soft threshold) of each gene network module is set. The results indicate that power = 18 is used as the threshold for subsequent analysis (Fig. 5A). After the eigengenes are calculated, the modules are subjected to cluster analysis, and finally 6 modules are obtained (Fig. 5B). According to different groups, age, gender, race, N, stage, T, radiotherapy, overall survival status, and overall survival time were used as indicators to screen the prognostic module with the highest correlation with breast cancer survival. The Pearson correlation coefficient between the ME of each module and the sample feature is calculated (the higher the module, the more important it is). The results showed that the ME blue module was most correlated with the overall survival time of triple-negative breast cancer and the difference was the most significant (R = 0.13, P = 3e-05) (Fig. 5C). The contained genes are the main components that representing the function and characteristics of the module. The MEblue module contains 144 genes (Fig. 5D). These results indicate that ME blue may be a prognostic-related module in triple-negative breast cancer, and plays an important role in predicting the development of the disease and the overall survival of the patient. The GO enrichment analysis of the modular genes on the Metascape website shows that the ME blue module is mainly enriched in terms related to cell differentiation, movement, and proliferation, such as developmental process, locomotion, cell proliferation, multicellular organismal process, etc. (Fig. 5E). In order to further understand the mechanism of the ME blue module involved in the occurrence and progression of triple-negative breast cancer, the KOBAS website was used to perform KEGG enrichment analysis on the genes in the blue module. The results showed that the disordered genes in the ME blue module are mainly involved in pathway in cancer, transcriptional mis-regulation in cancer, PI3K-AKT signaling pathway, Ras signaling pathway, MAPK signaling pathway, cytokine-cytokine receptor interaction, AMPK signaling pathway (Fig. 5F).
2.4 Univariate regression analysis combined with KM survival analysis to identify prognostic genes in ME-blue modules
The coxph function in the survival package was used to analyze the relationship between the genes in the blue module and the overall survival (OS) of 1194 samples in the TCGA data. 144 genes in the blue module were analyzed by univariate regression, and 130 genes with significant regression differences were obtained with P < 0.05 as the threshold (Supplementary Table 2 Univariate Cox) were obtained (Fig. 6A). Next, these 130 genes were subjected to lasso dimensionality reduction analysis, and the number of output genes was still 130 (Fig. 6B). On the other hand, the Kaplan-Meier method was used to perform an overall survival analysis of 130 genes in the blue module, and 24 genes were found to be associated with prognosis of breast cancer (P < 0.05, Supplementary Table 3 KM-survival. gene.0.05). This means that in the key module, there are 24 genes that are significantly different in regression analysis and have significant prognostic properties in survival analysis (Fig. 6A). Two genes (ABCC9, NPR1) were randomly selected for survival curve display (Fig. 6C). Next, the correlation coefficients and the univariate regression analysis results of 24 genes were visualized (Fig. 6D and E).
2.5 Multivariate regression analysis constructed a risk model of breast cancer prognosis-related genes
In order to identify prognostic markers of breast cancer, the 24 prognostic-related genes in the blue module combined with the clinical factors of TCGA (age, lymph node metastasis status (N0 vs Nn (excluding Nx)), T stage, radiotherapy or not, race, breast Cancer stage) were subjected to multivariate regression analysis. The results suggest that the expression of ECM2, PCDH12, EPAS1, CD93, DLL4, and ARHGEF15 are significantly different in age, N (lymph node metastasis status), and radiotherapy or not (Fig. 7A). Subsequently, the sample data were scored and grouped according to the median value of prognosis related gene expression. The Kaplan-Meier survival analysis of high risk and low risk group showed that the prognosis of high-risk group was poor (Fig. 7B, P < 0.001). The results of time-dependent ROC (receiver operating characteristic) analysis found that the above-mentioned prognostic genes showed good predictive effects on the 1-year, 3-year, and 5-year survival of breast cancer (AUC were all greater than 0.75) (Fig. 7C). Among them, the AUC of 3-year survival model was 0.801, which confirmed good accuracy of the prediction model (Generally, AUC > = 0.7 is considered to be a good predictor).
2.6 Validation of a risk model constructed by genes related to the prognosis of breast cancer
In this part of the analysis, breast cancer data from ICGC was used to verify the model constructed by the prognostic-related genes in the ME blue module. 1542 sample data with overall survival time and overall survival status are used to verify the accuracy and validity of the above model. The results indicate that the prognosis of high-risk group in the model is poor (P < 0.001, Fig. 8A). The time-dependent ROC results show that the AUC values for 1, 3 and 5 years are 0.614 (Fig. 8B), 0.634 (Fig. 8C), and 0.632 (Fig. 8D), respectively. In addition, In addition, according to our prognostic model, age, N lymph node metastasis and radiotherapy showed significant differences in breast cancer samples with different risk scores (high-risk and low-risk) (Fig. 8E-G). Moreover, the verification results of ROC also confirmed that the AUC value of age, N lymph node metastasis was high, which indicated good accuracy of ROC (Fig. 8H-J). These results indicate that our model has practical application value in the prognosis and survival of breast cancer.