A six-mRNA signature-based model for the prognosis prediction of breast cancer

Abstract


Background
According to the latest statistics from A Cancer Journal for Clinicians, the three most common malignant tumors among women are breast cancer (BC), uterine cancer and colorectal cancer 1 . BC as a complex heterogeneous disease is considered to be the main cause of cancer death, and about 30% of BC is heritable [2][3][4] . Although systematic treatment of BC has made great progress, the increasing incidence and the high speci c mortality caused by tumor recurrence and metastasis still pose a serious threat to the life health of women worldwide 5,6 . The invasion and metastasis of BC can be in uenced by tumor microenvironment (TME) 7 . The matrix containing broblasts and immune cells in TME is often altered as the tumor progresses, and the molecular markers involved in this process are also important 8 . It is a good starting point to study the interaction between BC and these TME-related molecular markers.
In recent years, many studies have shown that prognostic markers can predict possible outcomes associated with cancer progression, recurrence, or death 9 . At the same time, multi-gene signatures for prognosis prediction have also been studied, providing a reliable prognostic tool for individualized treatment of BC patients 10,11 . Renhua Li et al. have identi ed a signature-based model consisting of 16 master regulatory genes that can be used to identify the molecular subtypes of BC and is signi cantly associated with the prognosis and survival rate of BC 12 . Therefore, construction of effective prognostic gene signatures will bring more hope for BC patients.
In this study, immune and stromal cells of BC samples were scored and used to analyze the in uence of TME on the prognosis of BC. Then, differentially expressed genes (DEGs) were screened out. DEGs is one of the research hotspots of BC. Maria Libera Ascierto et al. have identi ed that CD96, CD1d, CRTAM, LFA-1 and DNAM1 as DEGs participate in the interaction between natural killer cells and tumor cells, and their expression is increased in BC patients with good prognosis 13 . Bao et al. suggest that the interaction pathway of DEGs and extracellular matrix-receptor (ECM-receptor) may perform an important role in BC 14 .
Studies have also discovered that some BC-related DEGs can be used as prognostic biomarkers. For example, Rodrigues-Ferreira et al. have found that EB1 and ATIP 3 can be used as biomarkers to improve the prognosis of BC 15 . These studies suggest that it is feasible to mine TME-related DEGs as prognostic markers of BC to improve the survival status.
Recent advances in bioinformatics and the availability of RNA-seq transcriptome data in public databases have contributed a lot to mine biomarkers for diagnosis and prognosis 16 . Machine learning analysis can be used for survival calculation, risk classi cation and prognosis prediction of a variety of cancers including BC 17,18 . With the application of machine learning analysis, the accuracy of tumor outcome prediction has increased by 15%~20% 19 . In this study, we used machine learning analysis, including univariate Cox regression, LASSO regression and multivariate Cox regression analysis, to construct a gene signature-based model for prognosis prediction of BC patients based on TMEassociated DEGs. 6 feature genes related to tumor immunity were identi ed and evaluated for their value as prognostic indicators. Our ndings provide a new prognostic gene signature and independent prognostic factors for BC, which have certain clinical signi cance.

Screening of TME-related DEGs in BC
Based on the FPKM data of TCGA-BRCA mRNA, the ESTIMATE prediction software was used to score the immune and stromal cells in each sample (Table S1). Survival analysis was performed based on the immune score and stromal score combined with the clinical information of downloaded TCGA-BRCA samples (Table S2). It was found that the survival time of patients with high immune cell in ltration was signi cantly longer than that of patients with low immune cell in ltration, indicating that the in ltration degree of immune cells remarkably affected the prognosis of BC, while the in ltration degree of stromal cells had no signi cant impact on the prognosis (Fig. 1A). Thereafter, all patients were grouped into highscore and low-score groups according to the median value of the immune score. Wilcox test was used for differential analysis and a total of 829 TME-related DEGs (Table S3) were obtained, including 726 upregulated genes and 103 down-regulated genes (Fig. 1B). Furthermore, the R package "clusterPro ler" was used to carry out GO and KEGG pathway enrichment analysis on these obtained DEGs, and it was observed that most DEGs were related to the proliferation and activation of immune cells (Fig. 1C). These results suggested that the TME-related DEGs in uenced the prognosis of BC by affecting tumor immunity.

Construction of a gene signature-based model for BC prognosis prediction
Univariate Cox, LASSO and multivariate Cox regression analysis were performed on the obtained TMEassociated DEGs to construct a multivariate Cox risk regression model with variable gene expression. Firstly, univariate Cox regression analysis was carried out on the 829 DEGs to screen out survival-related DEGs, among which 154 DEGs with signi cantly statistical signi cance (p < 0.01) were selected (Table   S4). LASSO regression analysis was used to establish a prediction model based on the 154 survivalrelated DGEs. In order to avoid over tting of the prediction model, 10-fold cross validation method was used to choose tuning parameter (lambda) as 1000, and genes with higher correlation were removed for reducing model complexity. Finally, 13 relatively independent feature genes were acquired for subsequent analysis ( Fig. 2A-B). Afterwards, multivariate Cox regression analysis was performed on the 13 feature genes, and a multivariate Cox risk regression model was constructed based on the survival time and survival status of patients, so as to screen the independent factors that dramatically in uenced the prognosis of BC patients. The formula of the risk regression model was as follow: , where is the coe cient, and is the z-score-transformed relative expression value of each selected gene. 6 relatively independent feature genes (PXDNL, FABP7, STX11, PIGR, SHISAL2A, WDR17) were identi ed to have the potential as independent prognostic factors associated with BC immunity (Fig. 2C).

Veri cation Of The Prognostic Gene Signature
In order to verify the accuracy of the prognostic gene signature containing 6 feature genes, ROC analysis was used to evaluate the predictive e ciency of the model. The ROC curves showed that the AUC value was 0.728, indicating that the model had a high accuracy (Fig. 3A). In addition, the samples were divided into high-risk group and low-risk group based on the median value of the sample risk score, and survival analysis was conducted on the two groups. The results exhibited that the survival time of patients in the high-risk group was signi cantly shorter than that in the low-risk group (Fig. 3B). In the meantime, higher risk scores were associated with shorter survival and higher mortality (Fig. 3C). These results veri ed that the 6-gene signature model performed well in predicting BC survival and prognosis with a good accuracy.
2.4 The 6 feature genes are independent prognostic factors affecting BC immunity In order to analyze whether the 6 feature genes in the 6gene signature model could be independent prognostic factors of BC, we conducted survival analysis on these feature genes, and found that these 6 feature genes all had a signi cant impact on the prognosis of patients (Fig. 4A). FABP7, PIGR, SHISAL2A, STX11 and WDR17 were favorable prognostic factors, and their high expression was associated with better prognosis. PXDNL was an adverse prognostic factor, and patients with low expression of PXDNL had better prognosis. Moreover, basic clinical information (age, gender, stage, T_stage, M_stage and N_stage) was combined to conduct univariate and multivariate Cox regression analysis on these 6 feature genes. It was observed that the 6 feature genes were signi cantly associated with the prognosis of BC patients, especially showed an intimate correlation with age and stage, as the prognostic model or independent factors (p < 0.001), (Fig. 4B). These results suggested that the 6 feature genes associated with tumor immunity could be used as independent prognostic factors for BC.

Discussion
With the development of systemic chemotherapy and radiotherapy in clinic, the survival of BC patients has been greatly improved. However, survival rates of patients with metastasis and recurrence are still unsatis ed. Therefore, identifying new biomarkers to predict the prognosis of BC patients is helpful for the customization of personalized treatment. High-throughput RNA-seq contributes to exploring various biomarkers for diagnosis and prognosis prediction of many cancers, including BC, along with the development of technology [20][21][22][23] .
In this study, we rstly found that the high in ltration degree of immune cells in TME was bene cial to the prognosis of BC patients. Further, machine learning analysis was used to systematically analyze the potential role of tumor immunity-related DEGs as prognostic predictors, construct a prognostic gene signature-based model and evaluate the value of the feature genes as prognostic markers of BC. In order to solve the "curse of dimensionality" (the combination of small sample and a large number of genes), which is common in high-throughput biological data, we analyzed the expression data using LASSO regression model and cross-validation method. LASSO regression is adept at dealing with highdimensional regression variables by reducing all regression coe cients to zero to force many regression variables to be completely zero, thus, LASSO regression model has high stability and prediction accuracy 24,25 . The BC survival-related DEGs with a better predictive performance were identi ed by LASSO regression model and cross validation in this study, and a 6-mRNA signature model was constructed to predict the prognosis of BC.
There are several important ndings during the analysis. Firstly, we found that most DEGs in uenced the prognosis of BC patients by affecting the proliferation and activation of immune cells in TME. Secondly, 154 survival-related DEGs were screened using univariate Cox analysis and most of them were risk factors that may cause cancer (97/154). Finally, we constructed a prognostic gene signature model by using multivariate Cox analysis, and screened 6 BC immunity-related feature genes (PXDNL, FABP7, STX11, PIGR, SHISAL2A, WDR17) which could be used as independent prognostic factors. PXDNL  31 . Through these prior studies, we found that PXDNL, FABP7 and PIGR of the 6 feature genes were closely related to the occurrence and development of BC and they could serve as prognostic markers of BC. While STX11, SHISAL2A and WDR17 had not been reported in BC, indicating that these three genes were newly discovered, and the survival analysis results of our study indicated that these three genes may be favorable factors for the prognosis of BC patients.

Conclusions
There are some limitations that cannot be neglected in this study. Our results have not been experimentally veri ed, but we will study further to make the results more scienti cally rigorous. In conclusion, a new 6-gene prognostic signature was discovered and it is a promising independent predictor. The 6 feature genes can function as important biomarkers for BC clinical treatment. At the same time, our study also provides a new insight to explore the molecular mechanism of TME and immune cells that in uence BC progress.

Data acquisition
RNA-seq data from 1,109 patients and clinical data from 1097 patients were obtained from the TCGA-BRCA in the public database TCGA (https://www.cancer.gov/).

Data Processing
Based on the Fragments per Kilobase Million (FPKM) data of TCGA-BRCA mRNA, immune and stromal cells of each sample were scored by ESTIMATE prediction software 32 to predict tumor purity and analyze the in uence of TME on the prognosis of BC. The samples were divided into high-score group and lowscore group based on the median value of the immune score. Differential analysis was conducted by using Wilcox test to obtain DEGs affecting tumor immunity.

Enrichment Analysis
GO and KEGG pathway enrichment analysis were performed on DEGs affecting tumor immunity by using R-package "ClusterPro ler". On the basis of GO analysis, gene functional annotation was conducted from three aspects: biological process (BP), cell component (CC) and molecular function (MF), with FDR < 0.05 as the threshold.

Construction Of A Prognostic Gene Signature-based Model
Univariate Cox regression analysis was used to screen the DEGs associated with BC survival, and p value less than 0.01 was set as the standard to screen the candidate feature genes for subsequent analysis. LASSO regression analysis was used for further screening. It is an effective method for analysis of highdimensional predictors and has been widely used in Cox proportional risk regression model for survival analysis 33,34 . 10-fold cross-validation method was used to select the tuning parameter (lambda) to determine the penalty degree and screen out relatively independent feature genes. Furthermore, multivariate Cox regression analysis was performed and feature genes with p < 0.05 were used to construct the prognostic signature.

Validation Of The Prognostic Model And Feature Genes
Patients were divided into high-risk group and low-risk group according to the median risk score, and Kaplan-Meier (K-M) survival analysis was performed in both groups. In addition, receiver operating characteristic (ROC) analysis was used to evaluate the predictive e ciency of the model and the area under the curve (AUC) was calculated. Meanwhile, survival analysis was carried out on the selected prognostic feature genes to evaluate their value as prognostic and survival markers. All of these processes were performed by R software (version 3.5.1). Authors' contributions YH analyzed and interpreted the patient data regarding the hematological disease and the transplant. XW and SXW performed the histological examination of the kidney, and was a major contributor in writing the manuscript. All authors read and approved the nal manuscript." Figure 1 Identi cation of the DEGs associated with TME in BC (A) Survival analysis of BC patients in the highscore and low-score groups (patients were grouped based on the median value of the immune score, stromal score and ESTIMATE score); (B) Heat map of DEGs in high and low immune score groups; (C) The results of GO and KEGG enrichment analysis of the screened DEGs.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.