Bioinformatics Analysis of Pivotal Module and Biomarkers Related to the Prognosis of Breast Cancer Based on Single-cell Transcriptome Data

Rui Liu The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Xin Yang Yunnan Provincial People's Hospital: First People's Hospital of Yunnan Yuhang Quan The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Yiyin Tang The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Yafang Lai Guiyang Railway Health Supervision Institute of China Railway Chengdu Bureau Group Limited Company Chunxiang Li The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Dechun Yang The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Maohua Wang The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital Anhao Wu (  529691000@qq.com ) The Third A liated Hospital of Kunming Medical University: Yunnan Cancer Hospital


Introduction
Breast cancer ranks rst in the incidence of female cancer, and shows an increasing trend year by year [1].
Globally, about 2.1 million cases of breast cancer were newly diagnosed in women in 2018, accounting for nearly a quarter of cancer cases in women [2]. Due to advances in early diagnosis and comprehensive treatment strategies, the prognosis of breast cancer patients has been greatly improved in recent years, but about 30% of breast cancer patients still develop metastases after diagnosis and treatment. year overall survival rate for patients with non-metastatic breast cancer was greater than 80%; however, the survival rate for patients with metastatic breast cancer was less than 30% [3,4].
Although the expression of estrogen receptor (ER), progesterone receptor (PR) and ERBB2 receptor (HER2) laid the foundation for the classi cation of breast cancer, and breast cancer is divided into at least ve molecular subtypes ( ie Luminal A, Luminal B, Her2-enriched, Basal-like and Normal-like) based on gene expression. However, as research progresses, the genomic/transcriptome level of breast cancer typing continues to increase [5]. These studies con rm the heterogeneity of breast cancer. More importantly, heterogeneity is also reported to be one of the leading causes of breast cancer treatment failure, recurrence, and patient death [6]. Tumor heterogeneity not only leads to differences in survival and prognosis of different patients, but also causes different biologic characteristics of cancer cells and different responses to chemotherapy drugs [7].
Although the heterogeneity of breast cancer has been found and con rmed, the implementation of individualized treatment of breast cancer also needs to take into account the existence of different molecular subtypes of breast cancer and different cell subsets within the same tumor tissue [8]. However, the biological relationships between different clonal subsets and between clones and microenvironment in breast cancer tissues are still unclear. Traditional gene sequencing methods can only detect population cells and cannot re ect genetic characteristics at the single-cell level. Single-cell sequencing technology is helpful to study tumor heterogeneity from the differences at the single-cell level and facilitate the comparison of the differences between different subtypes of the same tumor [9]. In this study, single-cell sequencing data were used to identify the inter-tumor and intra-tumoral heterogeneity of breast cancer samples, and a multi-factor interaction network of receptor-ligand-transcription factors in breast cancer was constructed by identifying the characteristic genes of immune cell subtypes and combining with known immune cell marker genes. WGCNA was used to identify prognostic signature, and a prognostic model was constructed for evaluation and veri cation.

Data sources and processing
Two sets of data, GSE75688 and GSE118389, were downloaded from GEO, among which GSE75688 contained single cell sequencing data (sRNA-seq) of primary breast cancer and metastatic breast cancer, and GSE118389 was sRNA-seq data of triple negative breast cancer. The Create Seurat Object function is used to process the Seurat object. After the two sets of data were analyzed by PCA, the Find Integration Anchors function is used to integrate the two sets of data in the S4 object. The data was nally divided into three groups: primary breast cancer, metastatic breast cancer, and triple negative breast cancer.

PCA dimension reduction, cell clustering and annotation
The Seurat package in R is used to preprocess the integrated data. After PCA dimension reduction, JackstrawPlot and ElbowPlot are used to show the overall situation of the data. According to experience and debugging, 0.2 is selected as the threshold value for cell clustering, and 14 cell clusters are obtained.
Subsequently, marker genes in Cell marker and Panglao DB databases and genes reported in the literature were used to annotate the cell clusters [10,11]. The Find All Markers function in the Seurat package was used for differential analysis of single-cell data. According to the expression levels of top5 genes and marker genes of immune cell reported in the literature, dotplot and violin plots were used to display marker genes in each cluster. In addition, dotplot were used to show the proportion of immune cells in different groups according to the frequencies of individual cells in the Primary BC group, Metastatic BC group and TNBC group.

Construction of ligand-receptor network and joint analysis of transcription factors
In this section, the cellphoneDB software (https://github.com/Teichlab/cellphonedb) is adopted. After downloading the software to build a good environment, according to the code cellphonedb method statistical_analysis meta.txt counts.txt, the interaction between ligands and receptors in each group of data is analyzed [12]. Among them, the meta.txt le is the barcode and corresponding annotated cells; counts.txt is the barcode and gene expression matrix. In the results, P values of cell types and enriched interaction ligands and receptors were shown. With P<0.05 as the threshold, cellphonedb plot heatmap_plot meta.txt p values. In order to further analyze the transcription factor regulation of ligands and receptors, the database TRRUST (https://www.grnpedia.org/trrust/) was applied, and the hypergeometric test method are used to trace the transcription factors of target ligand and receptor genes, and Cytoscape 3.7.2 is used in the visual display of results.

Weighted gene co-expression network analysis
WGCNA is a systems biology approach to construct scale-free networks using gene expression data. WGCNA package of R was used to construct a weighted co-expression network based on the expression pro le data of the multifactorial network genes [13]. The prognostic modules were screened according to the characteristics of overall survival time and overall survival status. GO enrichment analysis was performed on each module by Metascape to analyze the biological functions of each module.On the other hand, the website KOBAS is used to perform KEGG Pathway enrichment analysis of co-expressed genes in modules related to the prognosis of breast cancer.

Univariate regression analysis, KM-survival prognostic analysis and multivariate regression analysis
Univariate Cox regression analysis was used to screen for genes signi cantly expressed in key modules. Then, the genes with signi cant differences in the univariate regression analysis were analyzed by Lasso dimensionality reduction, and the output results were used as candidate genes. On the other hand, K-M survival analysis was used to identify prognostic related genes, and P<0.05 was used as a threshold to screen genes with signi cant prognostic effects. The candidate genes and prognostic genes were intersected and visualized by Venn diagram. In this study, 24 genes associated with prognosis of breast cancer were identi ed by univariate Cox regression analysis. Forest maps were used to show the results of univariate cox regression for 24 genes. The survminer and survival packages are used to perform multivariate regression analysis. Age, lymphatic node metastatic status (N0 vs NN (excluding NX)), T stage, radiotherapy, race, breast cancer stage, and overall survival were combined with 24 genes for multivariate regression analysis to identify pivotal genes that were signi cantly associated with breast cancer prognosis.
The breast cancer samples were divided into high-risk and low-risk groups based on the median expression value of the screened key genes associated with breast cancer progression. Time-dependent ROC results showed that the AUC of the combined Signature 3-year model group was 0.801 for the analysis of immune in ltration levels in the low-risk group. The time-dependent ROC is used to re ect the accuracy and precision of the prediction model. It is generally believed that the model with AUC>=0.7 can be used to predict the prognostic outcome at a speci c time.

Results
2.1 quality control, dimensionality reduction, cell clustering and annotation of breast cancer single cell sequencing data In this study, two sets of data (GSE75688 and GSE118389) were downloaded. GSE75688 contains singlecell RNA sequencing data of primary and metastatic breast cancer, a total of 563 cells. Among them, 12 samples were applied bulk sequencing. Except for non-single cell sequencing data, 549 single cell data were retained. Among the 549 data, 441 cells are primary breast cancer and 108 cells are metastatic breast cancer. GSE118389 is the scRNA sequencing data of triple-negative breast cancer, which contains 1534 cells. In the quality control of single-cell transcriptome data, the number of feature genes greater than 200 and less than 2500, the proportion of mitochondrial (percent.mt) less than 10% are used as a threshold for data screening and ltering. The rst threshold is set to eliminate the empty oil droplets. To avoid the small number of RNA, data less than 200 is eliminated. The second threshold is set to eliminate more than two cells into one oil droplet. Subsequently, the data is analyzed by PCA and UMAP dimensionality reduction, and the results are visualized in the form of heat map, JackstrawPlot, and ElbowPlot ( Fig. 1).
Next, the Find Integration Anchors function is used to integrate two single-cell transcriptome sequencing data from two data sets (non-merge, because merge is only A data merge, which cannot remove batch effect), which minimizes the error caused by different batches of experiments, and is used to construct the nal S4 object. Subsequently, the Scale Data was used for data centralization and standardization, Seurat was continued to be used for PCA and UMAP dimensionality reduction analysis. According to references and debugging effects, the clustering analysis were performed with a threshold of resolution = 0.2, and 14 clusters were obtained. The cell clusters were annotated according to the marker genes in the Cellmarker, PanglaoDB database and references. The results showed that 14 clusters were endothelial Fig. 2A). The Find Variable Features function is used to nd the genes with the largest differences among the difference cells clusters. The results show that HP, KCNJ, SCGB2, CPB1 and SPP are the rst ve signi cantly different genes (Fig. 2B). Meanwhile, expression of common cell marker CD3D MS4A1 AIF1 LUM S100A8 KRT14 TFF3 CD34 CLDN1 in 14 clusters was analyzed, Violin map is used to show the results of marker genes in each cluster (Fig. 2C).

differential gene expression and multi factor interaction analysis of immune cells in breast cancer
In order to analyze the differentially expressed genes in each cluster, the Find All Markers function is used to calculate the expression of differential genes in each cluster, and the Do Heatmap function is used to plot the distribution of differential genes in different cell types (Fig. 3A). According to the conventional genes of cell annotation in the literature, the expression of marker genes of 6 types of immune cells including NK, DCs, macrophages, B cells, T cells and neutrophils was analyzed (Fig. 3B). Subsequently, the frequency of cells in each cluster is counted and used to explore the enrichment ratio of immune cell populations. The results show that NK cells are signi cantly aggregated in triple-negative breast cancer, the proportion of macrophages is signi cantly increased in primary breast cancer, and B cells, T cells and neutrophils may play important role in metastatic breast cancer (Fig. 3C-D). Excitingly, it has been reported that T cells and neutrophils are involved in metastasis of breast cancer [14]. Next, the MSigDB database is used to perform functional annotation analysis of cell types, which is conducive to revealing the functional status of immune cells. The analysis results showed that the functions of speci cally expressed genes in T cells, Macrophage cells, B cells, DC cells, and Neutrophils were signi cantly enriched in 20, 10, 7, 6, 3, and 3 terms, respectively (Fig. 3E).
On the basis of clarifying the effect of differential gene expression in breast cancer samples on immune cells. The cellphoneDB software is used to analyze the ligand-receptor relationship between cells. In the output of the results of the ligand receptor (Supplementary Table 1 result ligand receptor), the heatmap plot function is used to analyze the interaction between immune cells. The results showed that macrophages and DCs cells were signi cantly active and interacted with a variety of cells (Fig. 4A). In order to further analyze the interaction between ligand and receptor, the TRRUST database was applied, and the hypergeometric test method was used to analyze the interaction between differential genes, immune cell marker genes, and ligand-receptor. The multi-factor interaction network between immune cells was constructed and visualized with Cytoscape 3.7.2 (Fig. 4B).

WGCNA Analysis reveals key modules associated with breast cancer progression and patient survival
Based on the genes in the immune cell multi-factor interaction network, the grouping information of single-cell data (primary, metastatic, and triple-negative breast cancer) and expression matrix, the coexpression network was constructed and WGCNA analysis was performed. The average-linkage hierarchical clustering method is used for gene cluster analysis. According to the standard of hybrid dynamic shearing tree, the minimum number of genes (soft threshold) of each gene network module is set. The results indicate that power = 18 is used as the threshold for subsequent analysis (Fig. 5A). After the eigengenes are calculated, the modules are subjected to cluster analysis, and nally 6 modules are obtained (Fig. 5B). According to different groups, age, gender, race, N, stage, T, radiotherapy, overall survival status, and overall survival time were used as indicators to screen the prognostic module with the highest correlation with breast cancer survival. The Pearson correlation coe cient between the ME of each module and the sample feature is calculated (the higher the module, the more important it is). The results showed that the ME blue module was most correlated with the overall survival time of triplenegative breast cancer and the difference was the most signi cant (R = 0.13, P = 3e-05) (Fig. 5C). The contained genes are the main components that representing the function and characteristics of the module. The MEblue module contains 144 genes (Fig. 5D). These results indicate that ME blue may be a prognostic-related module in triple-negative breast cancer, and plays an important role in predicting the development of the disease and the overall survival of the patient. The GO enrichment analysis of the modular genes on the Metascape website shows that the ME blue module is mainly enriched in terms related to cell differentiation, movement, and proliferation, such as developmental process, locomotion, cell proliferation, multicellular organismal process, etc. (Fig. 5E). In order to further understand the mechanism of the ME blue module involved in the occurrence and progression of triple-negative breast cancer, the KOBAS website was used to perform KEGG enrichment analysis on the genes in the blue module. The results showed that the disordered genes in the ME blue module are mainly involved in pathway in cancer, transcriptional mis-regulation in cancer, PI3K-AKT signaling pathway, Ras signaling pathway, MAPK signaling pathway, cytokine-cytokine receptor interaction, AMPK signaling pathway (Fig. 5F).

Univariate regression analysis combined with KM survival analysis to identify prognostic genes in ME-blue modules
The coxph function in the survival package was used to analyze the relationship between the genes in the blue module and the overall survival (OS) of 1194 samples in the TCGA data. 144 genes in the blue module were analyzed by univariate regression, and 130 genes with signi cant regression differences were obtained with P < 0.05 as the threshold (Supplementary Table 2 Univariate Cox) were obtained (Fig. 6A). Next, these 130 genes were subjected to lasso dimensionality reduction analysis, and the number of output genes was still 130 (Fig. 6B). On the other hand, the Kaplan-Meier method was used to perform an overall survival analysis of 130 genes in the blue module, and 24 genes were found to be associated with prognosis of breast cancer (P < 0.05, Supplementary Table 3  This means that in the key module, there are 24 genes that are signi cantly different in regression analysis and have signi cant prognostic properties in survival analysis (Fig. 6A). Two genes (ABCC9, NPR1) were randomly selected for survival curve display (Fig. 6C). Next, the correlation coe cients and the univariate regression analysis results of 24 genes were visualized ( Fig. 6D and E).

Multivariate regression analysis constructed a risk model of breast cancer prognosis-related genes
In order to identify prognostic markers of breast cancer, the 24 prognostic-related genes in the blue module combined with the clinical factors of TCGA (age, lymph node metastasis status (N0 vs Nn (excluding Nx)), T stage, radiotherapy or not, race, breast Cancer stage) were subjected to multivariate regression analysis. The results suggest that the expression of ECM2, PCDH12, EPAS1, CD93, DLL4, and ARHGEF15 are signi cantly different in age, N (lymph node metastasis status), and radiotherapy or not (Fig. 7A). Subsequently, the sample data were scored and grouped according to the median value of prognosis related gene expression. The Kaplan-Meier survival analysis of high risk and low risk group showed that the prognosis of high-risk group was poor (Fig. 7B, P < 0.001). The results of time-dependent ROC (receiver operating characteristic) analysis found that the above-mentioned prognostic genes showed good predictive effects on the 1-year, 3-year, and 5-year survival of breast cancer (AUC were all greater than 0.75) (Fig. 7C). Among them, the AUC of 3-year survival model was 0.801, which con rmed good accuracy of the prediction model (Generally, AUC > = 0.7 is considered to be a good predictor).

Validation of a risk model constructed by genes related to the prognosis of breast cancer
In this part of the analysis, breast cancer data from ICGC was used to verify the model constructed by the prognostic-related genes in the ME blue module. 1542 sample data with overall survival time and overall survival status are used to verify the accuracy and validity of the above model. The results indicate that the prognosis of high-risk group in the model is poor (P < 0.001, Fig. 8A). The time-dependent ROC results show that the AUC values for 1, 3 and 5 years are 0.614 (Fig. 8B), 0.634 (Fig. 8C), and 0.632 (Fig. 8D), respectively. In addition, In addition, according to our prognostic model, age, N lymph node metastasis and radiotherapy showed signi cant differences in breast cancer samples with different risk scores (highrisk and low-risk) (Fig. 8E-G). Moreover, the veri cation results of ROC also con rmed that the AUC value of age, N lymph node metastasis was high, which indicated good accuracy of ROC (Fig. 8H-J). These results indicate that our model has practical application value in the prognosis and survival of breast cancer.

Discussion
The heterogeneity of breast cancer is the main reason for treatment failure and recurrence. In recent years, the development of single-cell sequencing technology has given us a deeper understanding of the heterogeneity of breast cancer. Cell types and speci c gene expression characteristics in tumor tissues can be accurately distinguished by single-cell transcriptome. In fact, not only breast cancer cells show signi cant heterogeneity, but also non cancer cells are the main content of heterogeneity in breast cancer [15,16]. Non-cancer cells in breast cancer include broblasts, adipocytes, endothelial cells and various immune cells [17]. Among non-cancer cells, the role of immune cells is particularly signi cant. The progression of breast cancer is characterized by increased immune cell in ltration in tumor parenchyma and stroma, including CD4 + and CD8 + granzyme B + cytotoxic T cells, B cells, macrophages and dendritic cells [18]. In addition, tumor-in ltrating lymphocytes have been reported as a prognostic indicator of breast cancer chemotherapy response and patient survival [19]. In this study, NK cells were found to be signi cantly aggregated in triple-negative breast cancer, and the number of macrophages was signi cantly increased in primary breast cancer, while the proportion of B cells, T cells, and neutrophils was raised in metastatic breast cancer. Although the high total number of NK cells re ects a good survival rate, the in ltration and activation of NK cells vary greatly with different types of breast cancer. The heterogeneity of NK cells and their actual role in the microenvironment of breast cancer need to be further elucidated [20]. Tumor-associated macrophages (TAM) are the main component of breast cancer microenvironment [21].The increased density of macrophages in breast cancer tissues is related to the poor prognosis of patients. Macrophages are not only involved in the immune escape of breast cancer, but also involved in the angiogenesis of tractable tumors [22]. In addition, B cells, T cells and neutrophils have all been reported to participate in the immune escape and metastasis of breast cancer [14,23,24]. These results indicate that our analysis results are correct and reasonable in terms of cell clusters and immune cell in ltration.
After the cells are clustered and annotated, a multi-factor interaction network of ligand-receptor combined with transcription factors is constructed and used to discover modules that are signi cantly related to the prognosis of breast cancer. The results showed that the blue module had the highest correlation with the overall survival time of breast cancer (P = 3e-05). The functions of the blue module are mainly enriched in terms related to cell developmental, locomotion, and proliferation.The decrease of cell development is related to the poor differentiation level and cell stem characteristics of breast cancer; the enhancement of cancer cell motility is related to tumor invasion and metastasis, the disorder of cell proliferation is the basis of tumor tumorigenesis and progression. The enrichment results of KEGG showed that the main signaling pathways for differential gene enrichment in the blue module include pathway in cancer, transcriptional mis-regulation in cancer, PI3K-AKT signaling pathway, Ras signaling pathway, MAPK signaling pathway, cytokine-cytokine receptor interaction, MAPK signaling pathway, etc. PI3K-AKT is overactivated in most breast cancers and promotes the excessive proliferation of cancer cells through the mTOR complex [25]. For example, the loss of expression of the negative regulatory proteins PTEN and INPP4B (tumor suppressor genes) of the PI3K-AKT pathway is associated with the occurrence and progression of triple-negative breast cancer, and the loss of PTEN expression is found in more than half of TNBC patients [26]. In the Ras signaling pathway, activated Ras promotes cell cycle and cell proliferation by recruiting Raf1 protein to initiate a kinase cascade to activate MAPK (ERK1/2) and transcription factors Fos and c-Jun [27,28]. In addition, activation of the Ras-MAPK pathway has been reported to promote TNBC immune escape [29]. p38MAPK signal was found to promote the invasion and metastasis of breast cancer by enhancing the epithelial-mesenchymal transition of cancer cells [30]. AMPK and its downstream mTOR are involved in regulating the material and energy metabolism of cancer cells 30903363. For example, AMPK-mediated lipid metabolism reprogramming promotes breast cancer cell proliferation and migration [31]. These results indicate that the cellular functions and signal pathways enriched in the blue module play a key role in the occurrence and progression of breast cancer. The validation results of the breast cancer prognosis model constructed by multivariate regression risk analysis showed that PCDH12, SLIT3, ACVRL1 and DLL4 genes are signi cantly different in high-risk group and low-risk breast cancer, which can be used as risk factors for breast cancer prognosis.
In recent reports, the high expression of PCDH12was found to be associated with the high pathological grade of papillary renal cell carcinoma [32]. As a new type of tumor suppressor gene, SLIT3 has been reported to play a role in breast, liver, lung, and colon cancer, and the promoter methylation of SLIT3 has been reported to be associated with tumor occurrence and progression [33,34]. ACVRL1 (activin receptor like protein 1) encodes ALK1, which is a member of transforming growth factor -β receptor family and is associated with angiogenesis [35]. ACVRL1 expression can be used as a prognostic marker for metastatic colorectal cancer patients receiving chemotherapy and bevacizumab [36]. DLL4, as one of the main components of the Notch pathway, is reported to be highly expressed in breast cancer and associated with advanced stage and distant metastasis of the patient [37]. these studies con rm the correctness of PCDH12 SLIT3 ACVRL1 and DLL4 genes as risk factors for breast cancer prognosis.

Conclusion
MeBlue is a prognostic module in triple negative breast cancer. The expressions of PCDH12 SLIT3 ACVRL1 and DLL4 are not only related to the type and proportion of immune cells, but also contribute to the prognosis of breast cancer.

Declarations
Ethics approval and consent to participate Clinical and animal experiments are not involved in this study. Therefore, ethical and participatory consent elements are not included.

Consent for publication
This study did not involve the collection of patient samples and data.

Availability of data and materials
The datas that support our ndings of this study are openly available in GEO public database. The singlecell transcriptome data used in this study were obtained from GSE75688 and GSE118389 datasets. The URL of the GSE75688 dataset is: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75688 The URL for the GSE118389 dataset is https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE118389 LCX and YDC completed the statistical analysis and visualization of the results.
WAH and WMH discussed the results, revised the manuscript, and provided nancial support for this study.  Single-cell clusters of breast cancer and the expression of marker genes in each cluster. A: Visualization of cell UMAP clustering results. B: Visualization of the genes with the highest differential change and non-differential genes. C: The violin chart shows the expression of common cell marker genes in the corresponding cell clusters.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryTable1resultligandreceptor.xlsx