Development a Prognostic Model Integrating lncRNA/ mRNA novel Biomarkers Identified by Bioinformatics Analysis and Experiments in Breast Cancer


 Purpose: The purpose of this study is to established a prognosis model based on the expression profiles of lncRNAs and mRNAs for breast cancers.Methods: Single Variable Cox Proportional Risk Regression analysis and difference analysis were applied to screen survival-related and differently expressed lncRNAs and mRNAs between tumor and normal tissues from TCGA data. GO and KEGG analysis were applied for top 30 survival-related genes. LncRNA/mRNA co-expressed network was constructed based on correlation analysis. LASSO analysis and Multivariate Stepwise Cox Regression analysis were applied to establish the prognosis model. RT-PCR experiments were applied to verify the correctness of the analysis results. Relative components of the TME in breast cancers with high and low risk groups were analysed by xCell and Cox proportional risk regression analysis. The ceRNA network was constructed by calculating the Pearson correlation coefficient (PCC) for miRNA-mRNA and miRNA-lncRNA using paired miRNA, mRNA, and lncRNA expression profile data.Results:Venn diagrams showed that there were 60 genes and 54 lncRNAs that were differently expressed and related with survival. Through lncRNA/mRNA co-expressed network construction, 19 lncRNA and 16 mRNA hub genes were gained. The genes were then included in LASSO and multivariate Cox proportional hazard regression analysis, and finally, 3 lncRNAs (LINC01497, LINC02766, LINC02528) and 2 mRNAs (C20orf85, CST1) were selected as prognosis predictive genes. According to the median risk score of the 5 candidates, patients were divided into high-risk group and low-risk group. The results of RT-PCR were consistent with the analysis results. The proportions of Adipocytes, Endothelial cells, HSCs, Fibroblasts were significantly lower in low risk score tissues compared with the high risk score tissues, while the proportions of M1 macrophages, MSCs, Th2 cells were significantly higher. A lncRNA-miRNA-mRNA ceRNA network containing 3 lncRNAs, 2 mRNAs, and 158 miRNAs was finally constructed, preliminarily revealed a proper mechanism of the 5 molecules playing important roles in breast cancer progression and prognosis prediction.Conclusion: We found that LINC01497, LINC02766, LINC02528 and C20orf85, CST1 may serve as a powerful prognostic tool to optimize the prognosis evaluation system of breast cancer.


Introduction
Breast cancer is the most common cancer and the most frequent cause of cancer death among women worldwide. The incidence is rising in most countries and is projected to rise further over the next 20 years despite current efforts to prevent and treatment of the disease [1,2]. Breast cancer is a complex, heterogeneous disease classi ed into hormone-receptor-positive, human epidermal growth factor receptor-2 overexpressing (HER2+) and triple-negative breast cancer (TNBC) based on histological features [3]. However, the great heterogeneity of breast cancer makes it impossible to rmly predict the prognosis of each subtype patients according to the conventional prognostic factors currently employed.
Considering that most oncological treatments have short-and long-term toxic effects, new methods capable of offering a more precise prognosis need to be developed. Especially nd out the subset of patients with early-stage breast cancer at high risk for recurrence who have the urgent need for additional therapeutic schemes [4]. However, such clinical and pathological criteria are of relative e cacy and have potential limitations in identifying the risk of disease relapse. Therefore, it is necessary to explore effective prognostic biomarkers to help optimize the prognosis evaluation system of breast cancer. In recent years, a new prognostic molecular classi cation has been developed based on the grouping of tumors according to their gene expression pro les. The individualisation of the diagnosis of patients with breast cancer based on molecular and gene expression studies is bringing about a veritable revolution in our understanding of the biology of the disease and will lead to the application of more speci c treatments, thereby improving patient survival with lesser toxicity and increased economic savings.
Long non-coding RNA (lncRNA) genes are an important population of non-coding RNAs with key roles in tumorigenesis process. Evidences suggest that they can be classi ed as tumor suppressor genes or oncogenes according to their functions and expression pattern in tumoral tissues. Their physiological and pathological functions have been shown to be exerted via their interactions with microRNAs (miRNAs), mRNAs, proteins and genomic DNA [5]. Their important roles in the regulation of cancer-related pathways in addition to deregulation of their expression in a number of cancers have suggested that they can be used as markers for cancer detection and prognosis, as well as targets for cancer treatment [6].
Deregulation of a number of lncRNAs has been detected in breast cancer samples and cell lines [7]. In addition to miRNAs, lncRNAs have been shown to regulate the plasticity of CSCs and been suggested as possible targets for anti-CSC treatments, such as HOTAIR, XIST, MALAT and H19 [8,9]. Besides, the role of lncRNAs in the epithelial-to-mesenchymal transition (EMT) programs has also been partly elucidated [10].
But the role of lncRNAs in predicting the outcome of breast cancer has not been elucidated. Currently, emerging bioinformatics resources are assisting these types of analyses. Large-scale public data with gene expression and clinical information, complete biological databases, and sophisticated highthroughput data analysis methods together provide opportunities for identifying broader prognostic features in breast cancer biology. Here, the expression patterns of lncRNAs in breast cancer, as well as their signi cance in prognosis are discussed.
In order to explore prognosis-related lncRNAs and genes in breast cancer, we performed a multiperspectives, multi-dimensional analysis of a large number of breast cancer samples using massive bioinformatics and machine learning methods, such as differential analysis, weighted gene co-expression network analysis, univariate COX analysis, LASSO analysis, functional analysis, Cox proportional risk regression analysis etc. We applied difference analysis and single factor Cox proportional risk regression analysis to screen differently expressed and related to prognosis lncRNAs and genes from 1104 tumoral and 113 normal breast tissues in The Atlas Cancer Genome (TCGA) breast cancer dataset. Then, the intersection genes gained by the collection analysis were applied to the Least Absolute Shrinkage and Selection Operator (LASSO) and the Multivariate stepwise Cox regression analysis to establish the prognosis model. Finally, ve biomarkers (3 lncRNAs and 2 mRNAs) of breast cancer were identi ed that were thought to be probably important prognostic features in breast cancer. According to the median risk score, patients were divided into high-risk group and low-risk group, and survival analysis was conducted to evaluate the prognostic value of risk score. To verify the accuracy of the above bioinformatics analysis, RT-PCR experiments were applied. Besides, we applied xCell analysis for TME components comparison between high-risk and low-risk groups. Finally, in order to initially clarify the mutual regulation between the ve biomarkers, we construct a ceRNA network. Isolation and identi cation of differentially expressed lncRNAs and genes not only helped to discover the function of genes, but also helped to reveal the pathogenesis of the disease.

Data acquisition
RNA-sequencing data, updated clinical data, and sample information of breast invasive cancer (BRCA) cohort were downloaded from the TCGA data portal (https://portal.gdc. cancer.gov/); A total of 1,276 specimens, consisting of 1,104 cancer samples and 162 normal samples, were obtained from TCGA. Use R (version 3.6.0) software to standardize and process data. According to the gene encoding proteins, the genes are divided into coding genes and non-coding genes, and corresponding counts are extracted for subsequent difference analysis.

Differentially expressed genes (DEGs) screening
We used the R package (R-3.6.3 https://www.r-project.org/) of "DESeq2" to analyze the differences between mRNA and lncRNA from the normal tissue samples and breast cancer tissues in the expressing data. The signi cance analysis with false discovery rate (FDR) < 0.05 and |log2 fold change (FC)| ≥ 2 was applied to select lncRNAs and mRNAs for further analysis.

Survival-related genes generation
The association between lncRNAs and mRNAs and patients' overall survival was analyzed in the TCGA cohort. R packages "survival"and "survminer" were applied for the survival analysis. 1,104 tumor samples out of 1,276 cases had a detailed survival time record, with time span from0 to 23.6 years, which were used for survival analysis. A univariate Cox proportional regression model was used to calculate the association between the expression of each lncRNA, mRNA and OS. P value < 0.05 & HR1.5(HR > 1.5 or HR < 0.67) were used to analyze the effect of lncRNA or mRNA on the prognosis of breast cancer. R package "ggplot2" is used for the volcanic diagram. Kaplan-Meier method was used to plot the survival curve, and log rank as the statistical signi cance test; p < 0.05 was considered signi cant.

GO and KEGG enrichment analysis
To search the key genes and pathways that were associated with breast cancer survival, the DEGs (differently expressed genes) were uploaded to the Database for Annotation, Visualization and Integrated Discovery (DAVID) to screen enriched biological themes, particularly GO terms and KEGG pathways [28,29]. P < 0.05 was set as the cut-off criterion. R package "Cluster Pro ler" was used to carry out Gene Ontology (GO) enrichment analysis including biological process(BP), cell components(CC) and molecular functions(MF) for the differentially expressed survival-related genes. The same tool is also used for the enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis.

Construction of the lncRNA/mRNA co-expression network
To construct the lncRNA/mRNA co-expression network, we calculated the Pearson correlation coe cient and R value to evaluate lncRNA-mRNA correlation. The network construction procedure includes: (1) Preprocess data: the same mRNAs with different transcripts taking the median value represent the gene expression values, without special treatment of lncRNAs expression value. (2) Screen data: remove the subset of data according to the lists showing the differential expression of lncRNAs and mRNAs. (3) Calculate the Pearson correlation coe cient and use R value to calculate the correlation coe cient between lncRNAs and mRNAs. (4) Screen by Pearson correlation coe cient: select the Pearson correlation coe cient ≥ 0.99 or ≤-0.99 as the meaningful value and draw (correlation>0.5) the lncRNA/mRNA co-expression network by using the cytoscape (3.5.1) software program.

Candidate genes selection
Hub genes, highly interconnected with nodes in a module, have been considered functionally signi cant. In our study, an interesting module was chosen, and hub genes were de ned by module connectivity. We de ned genes with the node connectivity > 2 (total nodes) as the hub nodes in lncRNA/mRNA coexpression network and were chosen as the candidates to be further analyzed and validated.

Feature Selection by LASSO analysis
LASSO is a machine learning algorithm in which both variable selection and regularization occur simultaneously. This model uses a penalty to shrink regression coe cients toward zero, a number of variables will be eliminated because their coe cients will shrink to exactly zero [11]. According to the collection analysis results by software venny, a batch of genes in modules that were closely related to the prognosis of breast cancer was obtained. In the present study, the survival-related lncRNAs and mRNAs identi ed were included in the LASSO regression analysis by using the R package "glmnet", and the penalty parameter "lambda" was selected to choose the best model based on leave-one-out crossvalidation, which is more suitable than tenfold cross-validation for a smaller number of samples. Finally, we extracted variables with nonzero coe cients and their corresponding coe cients.

Multi Cox proportional regression analysis and stepwise Cox
Combined with the overall survival rate of breast cancer patients in TCGA, the R package "survival" was used to perform multivariate COX regression analysis and stepwise COX multivariate regression analysis on the lncRNAs and mRNAs selected by LASSO to obtain lncRNAs and mRNAs (p value < 0.05). To make our model more optimized and practical, Muti Cox proportional regression analysis and a stepwise Cox proportional hazards regression model was used. Finally, a risk score formula was calculated by taking into account of the expression of optimized lncRNAs and mRNAs and correlation estimated Cox regression coe cients: Risk Score = 0.43203*LINC01497+0.69806*LINC02766-0.54019*LINC02528-0.25365*C20orf85-0.078*CST1. Patients with breast cancer were classi ed into the high-or low-risk group by ranking the given risk score. Differences in overall survival (OS) between the high-and low-risk group were assessed using Kaplan-Meier method and two-tailed log-rank test. A Cox proportional hazards regression model was used to identify independent prognostic factors. P < 0.05 was set as signi cant difference in all statistical methods.

Quantitation of the 5 biomarkers
Total RNA was extracted from the frozen tissues of patients using TRIzol reagent (Invitrogen, USA) following the manufacturer's protocol. Puri ed total RNA was reverse transcribed using the PrimeScript RT reagent kit (Takara, Japan). Quantitative real-time PCR (qPCR) of the 3 lncRNAs was performed using the Power SYBR Green PCR Master Mix (lnRcute lncRNA qPCR detection kit) according to the manufacturer's guidelines. The primers for the 3 lncRNA primers were as follows: LINC01497:forward: AAATCAAGGTGTTGGCTGGGCTAC, reverse: GTGTTGCTGGCTCCGAAGATGG; LINC02766: forward: AGGAAGTAGGCGGTGTGGAGTG, reverse: GACAGAGTGGGCGGGAGGAG; LINC02528, forward: ACCTACCGAGAGACCTCCAAACAG, reverse: ACCCCTCTTCATCTGGGCATCTG. Quantitative real-time RT-PCR of the 2 mRNAs was performed by speci c gene primers using Thermal Cycler Dice Real Time (Takara Bio Inc.) according to the protocol. The primers for the 2 genes primers were as follows: C20orf85, forward: GCCATGCCAGGGAAAGGAAGAG, reverse: GAAGGGTGACGCTGATGACTTGAG; CST1, forward: AGGAGGAGGATAGGATAATCCC, reverse: TCTTTGGTGGCCTTGTTATACT. The results were normalized to β-actin and then calculated with the ΔCT method, the primers were as follows: ACTB, forward: CCTGGCACCCAGCACAAT, reverse: GGGCCGGACTCGTCATAC. All data were expressed as the mean ± standard error of the measurement from at least three experiments. TME Cell type enrichment abundance analysis between high-risk and low-risk groups Components of the TME in high-risk and low-risk breast cancer samples from the TCGA cohort were assessed by applying xCell web tool (http://xcell.ucsf.edu/).The cell type enrichment score was calculated based on the TME gene expression data, the xCell tool provides 64 cell types, including immunocytes, stromal cells, stem cells, and other cells.

Construction of a ceRNA Network
MiRanda is an information resource for experimentally validated miRNA-target interactions that provides the expression pro le of a miRNA and its target gene http://www.microrna.org/microrna/home.do . The lncRNA-miRNA interactions were predicted with the LncBase Predicted v.2 (http://carolina.imis.athenainnovation.gr/diana_tools/web/index.php?r=lncbasev2%2Findex-predicted. The Pearson correlation coe cient (PCC) for miRNA-mRNA and miRNA-lncRNA was calculated using paired miRNA, mRNA, and lncRNA expression pro le data. A candidate pair of lncRNA-miRNA-mRNA was constructed based on the "ceRNA hypothesis" as follows: (i) mRNAs and lncRNAs share the same miRNAs; (ii) mRNAs and lncRNAs suggest a positive correlation when the PCC is higher than 0.3 and P value < 0.05; (iii) mRNAs and lncRNAs show a negative regulation with miRNA with PCC < 0 and P value < 0.05; and (iv) the miRNAs are aberrantly expressed in breast cancer. By integrating the lncRNA-mRNA, lncRNA-miRNA, and mRNA-miRNA pairs, the lncRNA-miRNA-mRNA ceRNA network was constructed and visualized using Cytoscape.

Statistical analysis
Student's t test was used for continuous variables, while categorical variables were compared by χ2 test. The logrank test was performed for comparing Kaplan-Meier curves between groups. The differential proportions of the TME cells in the TCGA cohort were evaluated by the Wilcoxon rank-sum test. The speci city and sensitivity of survival prediction according to the determined risk score were obtained by time-dependent receiver operating characteristic (ROC) curves, with AUC values quanti ed with the survival ROC package (https://cran.r-project.org/web/packages/survivalROC/index.html). We considered p <0.05 to be statistically signi cant.

Results associated
After the rst quality check by WGCNA R package, 7 tumor samples and 49 normal low-quality samples were removed from subsequent analysis, the remaining samples needed to be integrated and processed. Statistical analysis software R was used for preprocessing the differential expression analysis of microarray data. The data from the statistical results included the 753 differentially expressed genes,1315 differentially expressed lncRNAs. The volcano plot is used to show the signi cantly different expressions of the two sets of samples. The p-value and fold change values obtained by accurate T-test statistical analysis were used to draw the volcano map between the two groups (Figure1A, FDR < 0.05 and |log2 fold change (FC)| ≥ 2). In the volcano plot, one of the coordinates shows the negative log of pvalues computed by the T-test, and the other shows the converted log of p-values compared to the two conditions. Similarly, we can also nd that compared with the normal sample, the breast cancer samples down-regulated more in the differentially expressed genes and lncRNAs. Single variable Cox proportional risk regression analysis was performed to screen genes signi cantly associated with overall survival (OS) in the TCGA breast cancer dataset. A total of 874 genes and 3262 lncRNAs were signi cantly associated with prognosis. The set analysis software was used to take the intersection of the two (Venn diagram) to obtain 60 genes and 54 lncRNAs, which were not only differentially expressed between normal and breast cancer tissues, but also had prognostic differences in cancer tissues (Figure1B). Only the intersected genes and lncRNAs were selected for the subsequent bioinformatics.
In order to inquire about the potential signal pathways related to prognosis related genes in breast cancer, we screened the top30 prognosis related DEGs and analyzed them with GO and KEGG by R-packet "limma", and the screening criteria were | lgfc | > 2, and adj. P < 0.05. GO analysis results showed that prognosis related DEGs can be enriched in basic biological processes(BP), including neurological system process, sensory perception; in cellular component, extracellular space and chromosome were enriched; in molecular function, receptor binding and protein dimerization activity were enriched. KEGG analysis showed that the top30 prognosis related DEGs were mainly enriched in systemic lupus erythematosus, cytokine-cytokine receptor interaction, salivary secretion and chemokine signaling pathway and cardiac muscle contraction pathways ( Figure 1C).
Although accumulating studies have attempted to reveal the functional signi cance of lncRNAs, the biological roles of most lncRNAs are still unknown. Biological processes and cellular regulation networks are very complex, involving the interactions of various molecules, such as proteins, RNAs, and DNAs. We constructed an lncRNA/ mRNA co-expression network based on the correlation coe cient of lncRNAs and mRNAs and investigated the potential interaction between mRNAs and lncRNAs. The co-expression network was composed of 27 differentially expressed lncRNAs, 32 differentially expressed mRNAs and 11 network nodes ( Figure 1D). The network showed that several lncRNAs (AC104211.2, RBMS3-AS3, AC002401.3, AC063919.1 and BOK-AS1) correlated with a great number of targeted mRNAs, and vice versa. This co-expression network also indicated that one lncRNA (AC104211.2) could target 15 network nodes and one coding gene (LVRN) could target 17 network nodes. In addition, the second ranked lncRNA (RBMS3-AS3) and mRNA (KLHL33), could both target 15 network node. Taken together, these results suggested the closer inter-regulation of lncRNAs and mRNAs in breast cancer. Through the co-expression network we obtained 31

Preliminary Identi cation of Optimal Prognostic Biomarkers
LASSO logistic regression were used to screen the characteristic genes by appling the glmnet package. The 31 genes entered the LASSO regression analysis (Figure 2.A, B) and then further analyzed by Stepwise Multivariate Cox Regression analysis to establish a prognostic model for patients with breast cancer on the basis of lncRNA and gene expression levels. Finally, 3 lncRNAs and 2 genes (LINC01497, LINC02766, LINC02528, C20orf85, CST1) related to the prognosis of breast cancer were obtained. The risk scores for patients were calculated on the basis of the relevant RNA expression level and risk coe cient of each gene, and patients were categorized as low or high risk according to the optimal cutoff ( Figure 2C). The AUC of risk score was 0.634, which proved that the Cox model had an ability to predict prognosis and was an independent prognostic indicator ( Figure 2D). Scatter plot was drawn to show gene expression pro les in high-risk and low-risk groups (Figure2f). Kaplan-Meier curves of OS in all patients with breast cancers based on 3 lncRNAs and 2 genes expression showed that the survival time of patients with low-risk score was signi cantly longer than that of patients with high-risk score ( Figure 2E).

The 5 biomarkers were independent prognostic indicators
In an attempt to con rm the independent prognostic impact of individual lncRNA or gene, we performed univariate COX regression of the 5 screened variables, the results showed that the 5 biomarkers coul be independent prognostic indicator ( Figure 3A). Calculated by multivariate Cox regression model, hazard ratio with 95% con dence interval (95% CI) for independent 3 lncRNAs and 2 genes signature were shown in forest plot. The forest map results showed that the genes with HR>1 (ENSG00000237560(LINC01497), ENSG00000229484(LINC02766)) are considered to be dangerous genes, while the genes with HR<1(ENSG00000124237(C20orf85), ENSG00000170373 (CST1), ENSG00000226004(LINC02528)) to be a protective gene ( Figure 3B).
Composition differences of TME between high and low risk breast cancer groups and their associations with prognosis Of all breast cancer samples, high and low risk groups were eligible based on xCell analysis. The results showed that there were 44 cell types differently expressed between high and low risk groups, and the immune score and stroma score were also different between the two groups ( Figure 4A, p<0.05). The two most differently expressed components of the TME in breast cancer tissues were Epithelial cells and HSCs. Speci cally, the proportions of Adipocytes, Endothelial cells, HSCs, Fibroblasts were signi cantly lower in low risk score tissues compared with the high risk score tissues, while the proportions of M1 macrophages, MSCs, Th2 cells were signi cantly higher ( Figure 4B). Correlations among the components ranged from weak to moderate. Obviously, Astrocytes showed highly positive correlations with CD8+ Tcm and MSCs, Myocytes showed highly positive correlations with Mast cells, Macrophages M2 and Osteoblast ( Figure 4C). Then single variable Cox proportional risk regression analysis was performed to screen cells types signi cantly associated with overall survival (OS). The results showed that there were 13 differently expressed cell types between high and low risk groups were associated with survival, of which, aDC, CLP, Melanocytes, NKT, pDC, Tgd.cells were protection factors, while Endothelial cells, Hepatocytes, ly Endothelial cells, mv. Endothelial cells, Neurons, Neutrophils, Preadipocytes were associated with poor prognosis (Figure 5), indicating that the TME compositions have high sensitivity and speci city to predict the prognosis of breast cancer patients.

Validation of the 5 indicators expressions Using RT-PCR
To verify the reliability of the aberrant lncRNAs and mRNAs found in the TCGA database, a RT-PCR assay was used to detect the expression levels of the 3 lncRNAs and 2 mRNAs in 25 paired tumor tissues and paracancer tissues from patients with breast cancer. The results indicated that the levels of LINC01497 and LINC02766 were decreased in the tumor tissues, and the LINC02528,c20orf85 and CST1were upregulated( Figure 6A,n=10,tumor vs normal, Student's t test,*p<0.05). The results were consistent with the expressions in TGCA BRCA data.

Construction of the breast cancer-speci c ceRNA Network
The biological roles of most lncRNAs are still unknown, accumulating studies have attempted to reveal the functional signi cance of lncRNAs on gene by modulating miRNAs. To establish the ceRNA network, we obtained the miRNAs that might interact with both the 3lncRNAs and 2 genes. Based on the results of the online prediction, by integrating them, a ceRNA network containing 3 lncRNAs, 2 mRNAs, and 158 miRNAs was nally constructed ( Figure 6B). A total of 286 pairs of miRNA-lncRNA interactions containing 147 miRNAs and 3 lncRNAs, 70 pairs of miRNA-mRNA interactions containing 52 miRNAs and 2 mRNAs were contained in the ceRNA network. The degree of nodes in the ceRNA network was calculated and red color presents a highest degree, and purple presents a lowest degree.

Discussion
Worldwide, the incidence of breast cancer is increasing and high risk of recurrence and the high mortality raise concerns to women health. Great efforts are put by clinicians and researchers and progressions are seen in early detection, diagnosis, and treatments of breast cancer over the years which suggests a bene t from the combination of early detection and more effective comprehensive treatment of local disease with surgery, radiation therapy, endocrine therapy, and systemic treatment with chemotherapy and so on [2]. Nevertheless, early recurrence, distant metastasis and drug resistance are still commonly seen, which hold threads to the prognosis of breast cancer patients and mount challenges for clinicians [12][13][14]. Even though patients with the same molecular subtype of breast cancer receive the same therapy, they have different outcomes [15,16]. Hence, further researches were urgently needed to unravel the molecular mechanism underlying and discovering valuable prognostic biomarkers for breast cancer survival and thus to provide guidance for individualized treatment.
LncRNAs have been shown to be involved in mammary gland development, as well as breast cancer evolution [17,18]. More and more studies have con rmed that abnormal lncRNA expression is related to the progress of breast cancer and can be used as markers for cancer detection and prognosis, as well as targets for breast cancer treatment [19][20][21]. Compared with protein-coding genes, lncRNAs have signi cant advantages as diagnostic and prognostic biomarkers, the association between lncRNAs signature and breast cancer patients' survival has been assessed in various studies [22]. Many lncRNAs, such as lncRNA MALAT1,whose levels were found inversely correlate with breast cancer progression and metastatic ability, was related with the breast cancer patients' survival [23,24].Others such as HOTAIR, XIST,H19 U79277,AK024118, BC040204 and AK000974 have been shown to be aberrantly expressed in breast cancer and cell lines and have been recognized predictive of breast cancer patient survival [20,21]. Besides, a recent study has also identi ed 3 lncRNA genes (LINC00324, PTPRGAS1 and SNHG17) associated with tumor histologic grade, as well as clinical outcomes [25]. However, accurate prognostic markers are still lacking. In this study, we attempted to use lncRNA data, mRNA data and clinical followup data from the tumor genome atlas to constructed a prognosis-speci c lncRNAs/mRNAs co-expression network based on correlation analysis to identify breast cancer prognostic factors and provide potential therapeutic targets for treatment.
In this study, breast cancer-speci c genes and lncRNAs were identi ed through bioinformatics analysis of tens of thousands of candidate RNAs, and by applying single variable Cox proportional risk regression analysis to select the genes that were related with the survival. Then the ones that were both abnormally expressed and related with survival were obtained. Through lncRNA/mRNA co-expressed network construction, 15 lncRNAs and 16 mRNA hub genes were gained. The genes entered to LASSO and multivariate Cox proportional hazard regression analysis, nally, 3 lncRNAs (LINC01497, LINC02766, LINC02528) and 2 mRNAs (C20orf85, CST1) were selected as prognosis predictive genes. According to the median risk score of the 5 genes, patients were divided into high-risk group and low-risk group, patients with low-risk score was signi cantly longer than that of patients with high-risk score. The AUC of risk score was signi cantly large, which proved that the Cox model had better ability to predict prognosis. Besides, through univariate COX regression analysis of the 5 screened variables, we con rmed that the 5 screened variables can be independent prognostic indicator. The three lncRNAs have not been reported in previous studies so far and it was the rst time that they were identi ed as prognostic indicators. Cystatin SN (CST1),one of the 5 screened prognostic indicators, is a speci c inhibitor of cysteine proteases with the activity of protein degradation and was reported associated with a diversity of diseases and facilitates the development and progression of cancer cells [26]. Various research have inferred that CST1 acts as an imperative role in tumorigenesis and tumor suppressors [27,28]. For instance, upregulation of CST1 promotes gastric carcinoma cells proliferation and inhibits cathepsin activity [29]. High CST1 expression was also reported by previous studies to be linked to poor survival in colorectal cancer [30], pancreatic cancer [31], and non-small cell lung cancer patients [32]. In this study, our results showed that CST1 was upregulated in breast cancer tissues, which was consistent with previous reports. However, its high expression was related with a longer survival time. In order to explore the reason, we analyzed the CST1 expression in TCGA BRCA subclasses data and found CST1 was upregulated most in luminal subtype and second in Her-2 positive subtype, while has a least increase in TNBC subtype. This implied that CST1 could serve as a feasible prognostic factor that related with tumor types and may be a new biomarker in breast cancer. C20orf85, also known as LLC1, was rstly found present in the normal lung epithelium but absent or downregulated in most primary nonsmall cell lung cancers and lung cancer cell lines, but was not studied before in breast cancers. Interestingly, this is the rst time to identify c20orf85 could act as a prognostic marker for breast cancer. Besides, to con rm the accuracy of the analysis results, we did RT-PCR experiments and the results were consistent with the bioinformatics analysis results, indicating our method to be reliable. However, what's the correlation between the 3 lncRNAs and 2 mRNAs as they were nally identi ed prognostic indicators. The studies on the participation of lncRNAs in gene regulation through the ceRNA mechanism, as in through the lncRNA/miRNA/mRNA axis, in the pathogenesis, progression, and metastasis accumulated over the past few years [33]. The competing endogenous RNA (ceRNA) hypothesis was rst proposed by Salmena and colleagues, lncRNAs can be enhancers, scaffolds, "sponges" that bind several miRNAs, or even precursors of some miRNAs to mutually regulate their expression or functions [34]. So a dysregulated lncRNA-associated ceRNA network was constructed based on the 3 lncRNA, and the 2 mRNA and the miRNA expression pro les by using an integrative computational method, and nally construct a ceRNA network, which initially reveals a possible regulatory mechanism between lncRNAs and mRNAs.
Increasing evidence demonstrated the importance of the tumor microenvironment (TME) a mixture that consists of mesenchymal cells, tumor-in ltrating immune cells (TIICs), endothelial cells, extracellular matrix molecules and in ammatory mediators in the tumor development. Structural components of the TME are mainly resident stromal cells and recruited immune cells, while there was compelling evidence for the role of stromal cell contributing to tumor angiogenesis and extracellular matrix remodeling, but perhaps it is still not fully understood [35]. Collaborative interactions between cancer cells and their supporting cells contributed to the malignant phenotypes of cancer, such as immortal proliferation, resisting apoptosis, and evading immune surveillance, therefore, the TME signi cantly in uences therapeutic response and clinical outcome in cancer patients [36]. In the TME, TIICs constitute the major type of non-tumor components reported to be valuable for prognostic assessment in OS [37]. Prognostic or predictive biomarkers, associated with tumor immune microenvironment, may have great prospects in guiding patient management, identifying new immune-related molecular markers, establishing personalized risk assessment of breast cancer [38]. Thus, improving immunotherapy e cacy in breast cancer by systematically assessing the TME's immune properties and determining components of TME distribution in high and low risk group patients is of prime importance. Cell type Identi cation by xCell was used to assess the levels of components of TME in large amounts of heterogeneous samples [14]. Cell has been successfully applied for identifying components of TME cells landscapes and their associations with prognosis in neuroblastoma and triple-negative breast cancer [39,40]. Large-scale public data with gene expression and clinical information, complete biological databases, and sophisticated high-throughput data analysis methods together provide opportunities for identifying broader prognostic features in breast cancer biology. In this study, to understand the dynamic modulation of the immune and stromal components in TME in high and low risk group patients, we applied xCell computational methods to compare the TME components in high-risk and low-risk groups. The results showed that the proportions of Adipocytes, Endothelial cells, HSCs, Fibroblasts were signi cantly lower in low risk score tissues compared with the high risk score tissues, while the proportions of M1 macrophages, MSCs, Th2 cells were signi cantly higher. Subsequently, optimal TME cells were selected as prognostic indictors by univariate Cox regression analyses. It has been demonstrated that 13 types: Neutrophils, NKT, pDC, ly Endothelial cells, Melanocytes, Endothelial cells, Tgd cells, Hepatocytes, mv Endothelial cells, Neurons, Preadipocytes, CLP, aDC were differently expressed both between tumor and normal groups and high risk and low risk groups, suggesting the proportions of TME were associated with the outcome in breast cancer patients. It adds strong evidence that tumor-associated immune genes may become potential targets for cancer therapy.

Conclusions
Using bioinformatics analysis both in lncRNA and mRNA levels, we focused on survival-related genes of breast cancer based on large data of real samples, and screened ve potential prognostic targets. Although the prognostic markers identi ed in our current study may still need further researches and validations, they may provide different insights into prognostic monitoring for breast cancer.