Exploring Long Non-coding RNA Potential Markers for Prognosis Risk based on Double Hormone Receptor Positive Breast Cancer CeRNA Network

Purpose: The aim of this study is to search for key long non-coding RNA (lncRNAs) able to inuence the prognosis of double hormone receptor positive (ER+PR+) breast cancer by constructing a competing endogenous RNA (ceRNA) topology network of RNA-seq data and miRNA-seq data. Materials and methods: Based on The Cancer Genome Atlas (TCGA), we downloaded RNA-seq data and miRNA-seq data of ER+PR+ breast cancer and obtained lncRNA-miRNA relationship pairs by comparing with the data in moRcode database. MiRDB, miRTarBase and TargetScan databases were used to predict target genes, and thereby gaining miRNA-mRNA relationship pairs. The ceRNA network was constructed using Cytoscape, then functional enrichment analysis was performed. Cox survival analysis was carried out to explore the potential prognostic lncRNAs. Results: A ceRNA topology network was constructed and six lncRNAs, PSORS1C3, AC010536.2, IQCF5-AS1, AL021707.1, FGD5-AS1, and PART1, were found to be mostly associated with the survival of ER+PR+ breast cancer. Conclusion: Six lncRNAs with prognostic signicance were identied in ER+PR+ breast cancer, providing new ideas for the diagnose and treatment of this subtype.


Introduction
Breast cancer (BC) is one of the most common malignant tumors in women and consists of highly heterogeneous subtypes (1)(2)(3). According to statistics, the incidence rate is 8-12% among all body malignant tumors. Its incidence is considered much related to heredity and is higher among women aged 40-60 and perimenopause (4,5). Breast cancer is de ned by distinct steroid hormone receptors (HR): estrogen receptors (ER) and progesterone receptors (PR) status. HR status has been shown to be a signi cant prognostic factor and predictive marker in response to endocrine therapy during breast cancer treatment(6, 7). The double hormone receptor positive (ER + PR+) breast cancer accounts for the vast majority of breast cancer and is known to have a favorable prognosis and good response to endocrine therapy compared with the double negative phenotype (8,9). With the continuous development of science and technology, great progress has been made in the diagnosis and treatment of breast cancer, the survival rate of patients has also been greatly improved (10,11). However, there are still a small number of ER + PR + breast cancer patients who are less sensitive to endocrine therapy leading to a poor prognosis, the reason still remains unknown. Recently, more and more researches have focused on the study of tumor biomarkers. Biomarkers with predictive and diagnostic signi cance have provided new clinical ideas for diagnosing and treating this breast cancer phenotype (12)(13)(14).
For the past few years, the abnormal expression of protein products during transcription are attracting widespread attention, long non-coding RNA (lncRNA) has gradually become one of the hot issues in cancer research (15). LncRNA is one of the two categories of noncoding RNAs and is longer than 200 nucleotides in length(16). Another category is microRNA (miRNA) with small or short RNA length that can induce mRNA degradation or inhibit translation to silence target gene expression (17). With the rapid development of whole genome and transcriptome sequencing technologies, increasing evidence demonstrates that lncRNAs play signi cant regulatory roles in various biological processes, including cancer progression (18)(19)(20). According to recent studies, lncRNAs form a competitive endogenous RNA (ceRNA) through sponge adsorption of miRNA to regulate mRNA expression (21). The ceRNA hypothesis believes that lncRNAs and other transcripts can competitively bind miRNAs to prevent them from binding to mRNA thereby regulate mRNA expression (22).
At present, the mechanism of ceRNA interactions in double hormone receptor positive breast cancer is not clear yet. Exploring the interaction relationship is of great signi cance for elucidating the mechanism of breast cancer development.

Data download and preprocessing
We downloaded breast cancer RNA-seq data and miRNA-seq data from the TCGA database. The RNA-seq data contained 113 normal samples and 1102 cancer samples. Likewise, the miRNA-seq data contained 104 normal samples and 1096 cancer samples. Among these samples, we ltered out 12 male samples and then selected out data with both RNA-seq and miRNA-sEq. Both estrogen and progesterone receptors should be positive in cancer samples, thus we got the nal 753 samples, including 102 normal samples and 651 cancer samples. Then we extracted the mRNA expression pro le and the lncRNA expression pro le from the matrix les obtained from the RNA-seq data. Since many miRNAs express 0 in most samples and these miRNAs may affect our follow-up relationship construction, we removed miRNAs with a ratio of 0 to 75% in all samples. Therefore, we nally obtained three matrix les of mRNA, lncRNA and miRNA expression pro les and ready for downstream analysis.

Forecast follow-up relationship
MiRcode is a database website to predict miRNA targets based on a complete GENCODE annotation transcriptome of human. Through this database prediction, we got the lncRNA-miRNA relationship les. MiRDB is a miRNA target gene prediction database based on high-throughput sequencing experiments.
MiRTarBae is a miRNA targets database validated by experiments. TargetScanS predicts miRNA target genes by searching for conserved 8mer and 7mer sites that match the miRNA seed sequence. These three databases are currently more used to predict miRNA target genes. We used miRDB, miRTarBase and TargetScan these three databases to predict the target genes of our miRNAs, and de ned the genes predicted concurrently by these three databases as our nal target genes. In this way, we got the miRNA-mRNA relationship les.

Screening relationship
Based on the ceRNA hypothesis, lncRNAs can competitively bind miRNAs to prevent them from binding to mRNA in the model of ceRNA. Therefore, the expression level of lncRNA that can theoretically exert ceRNA function is inversely related to the expression level of miRNA. The mRNA expression level in the ceRNA relationship is also negatively correlated with the miRNA expression. We calculated the Pearson correlation coe cient of lncRNA and miRNA in lncRNA-miRNA relationship pairs, and screened out the pairs with signi cantly negative correlation (P < 0.05). Similarly, we calculated the Pearson correlation coe cient and its P value of miRNA and mRNA in miRNA-mRNA relationship pairs, and then screened out signi cant negative correlation (P < 0.05) relationship pairs. These pairs acted as preselected ceRNA regulatory relationship pairs.

Building a ceRNA network
The ceRNA network is a lncRNAs and mRNAs network using miRNA as a medium. In this network, each miRNA must be linked to lncRNA and mRNA simultaneously. That is to say, in our relationship les, only the lncRNA-miRNA relationship les or the miRNA-mRNA relationship les exist. Thus, we further ltered out the incompatible relationships between the two relationship pairs, combined with the screened lncRNA-miRNA and miRNA-mRNA pairs, visualized the topological network with Cytoscape software, and nally obtained a topological network diagram of ceRNAs.

Analysis of mRNA enrichment in ceRNA networks
To observe the function of the constructed ceRNA network, we selected all the mRNAs in the network and used the R package clusterPro ler 8 to perform functional enrichment analysis.

CeRNA network topology analysis and stability analysis
The network topology quality is analyzed by using the plug-in NetworkAnalyzer tool kit in Cytoscape. NetworkAnalyzer network analysis mainly includes graphs of network diameter, number of connections, average aggregation coe cient, and so on. In this study, we calculated the number of connections, the length of the paths, and the proximity of the nodes.

Prognostic analysis of ceRNA modules
We downloaded the clinical information of the above samples from the TCGA website and extracted the survival data of each sample. Then, combined with the expression data obtained before, Cox survival analysis of lncRNA nodes in the network was carried out to explore the potential lncRNA prognostic markers.

LncRNA prognostic model construction
We used the lncRNA with prognostic value to further construct a prognostic model, and observed the impact of its classi cation on the prognosis.

Relationship construction and screening
We have a total of 14370 expression pro les of lncRNAs. Using the miRcode database, we obtained 7098 lncRNA-miRNA pairs containing 1449 lncRNAs and 35 miRNAs. Also, we had 2253 expression pro les of miRNAs and removed samples with a ratio of 0 to 75%, 875 miRNAs were left. Using miRDB, miRTarBase and TargetScan databases, we performed target gene prediction on our miRNAs and got 19571 relationship pairs. Among the three databases, 5311 genes and 780 miRNAs were predicted as target genes at the same time.
Based on the ceRNA hypothesis, lncRNAs can competitively bind miRNAs to prevent them from binding to mRNA in the model of ceRNA. Therefore, the expression level of lncRNA that can theoretically exert ceRNA function is inversely related to the expression level of miRNA. The mRNA expression level in the ceRNA relationship is also negatively correlated with the miRNA expression. We calculated the Pearson correlation coe cients of lncRNAs and miRNAs in lncRNA-miRNA relationship pairs, and screened out 240 pairs with signi cantly negative correlations (P < 0.05). Similarly, we calculated the Pearson correlation coe cients and their P-values of miRNAs and mRNAs in miRNA-mRNA relationship pairs, and screened out 2883 relationship pairs with signi cant negative correlations (P < 0.05).
The ceRNA network is a lncRNAs and mRNAs network using miRNA as a medium. In this network, each miRNA must be linked to lncRNA and mRNA simultaneously. That is to say, in our relationship les, only the lncRNA-miRNA relationship les or the miRNA-mRNA relationship les exist. We further ltered out the incompliant relationships in the two relationship pairs, combined with ltered lncRNA-miRNA and miRNA-mRNA pairs, nally we got 227 lncRNA-miRNA relationship pairs and 171 miRNA-mRNA relationship pairs. Some of the relationships are shown in Table 1-2.

CeRNA network construction and visualization
Combined with the 227 pairs of lncRNA-miRNA and 171 pairs of mRNA-miRNA obtained above, we used Cytoscape software to visualize the network and, and nally got a ceRNA network diagram. There were 398 edges and 326 nodes in this network, including 146 lncRNAs, 20 miRNAs, and 160 mRNAs (Fig. 1).

Topology analysis and stability analysis of ceRNA networks
For the topology analysis of the constructed ceRNA network, Fig. 2A showed the distribution of node degrees. The degrees of most nodes are very low, and only a small part of the nodes is high, which suggests that those nodes with higher degrees act as hubs for the entire network.
The closeness (Centrality, CC) of a node can calculate the number of connection steps between a given node and other nodes. A more concentrated node will get a higher score, so CC means the shortest path. As shown in Fig. 2B, nodes with the relatively lower number of connections are relatively concentrated, and some points with higher degrees of connectivity are relatively scattered.
The path can re ect the combination of all nodes in the network. Figure 2C showed the shortest path length distribution of the network. It can be seen in the gure that the distribution of the path length is more concentrated, with fewer extremes at both ends, an upper limit of 10, and a lower limit of 1, indicating that most of the nodes in the network can be connected by short path correlation. Figure 2D showed the degree distribution density graph of nodes. We nd that as node degrees increase, the density of nodes decreases sharply, suggesting that most of the nodes in the network show isolation.
In the process of disease, it may be due to changes of several key nodes, and then interact with neighboring nodes to induce coexpression, and nally affect the downstream biological processes.

MRNA enrichment analysis (GO and KEGG)
GO (Gene Ontology) analysis shows that most of the enriched biological processes (BP) are all associated with the cell cycle, suggesting that changes in the cell cycle and regulation are involved in the development of breast cancer (Fig. 3A).
MAPK signaling pathway ranked rst in the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichment analysis. Studies have shown that MAPK signaling pathway mainly increases the expression of cyclin D1, promotes the transition from cell cycle G1 to S phase and accelerates cell division and proliferation (23,24). It can also inhibit the expression of cell cycle arrest proteins, release cell cycle arrest and promote cell division, which is consistent with the results of our GO analysis. In addition, bladder cancer and pancreatic cancer also appeared in other items, suggesting that mRNAs are closely related to cancer (Fig. 3B).

Network node survival analysis and potential markers of lncRNA
We performed Cox survival analysis on 146 lncRNAs in the network, and nally screened 6 lncRNAs with signi cant prognostic signi cance. The result was shown in Table 3. Next, we carried out univariate regression analysis on 6 prognostically differentiated lncRNAs. A forest map was shown in Fig. 4, the high expression of 6 prognostically signi cant lncRNAs was associated with unfavorable prognosis. The prognosis model of lncRNA Further, we used these 6 lncRNA expression pro les for multivariate Cox regression to construct a prognostic model. Risk Score = − 2.67×PSORSIC3 − 0.71×AC010536.2 + 0.01×FGD5-AS1 − 1.46×IQCF5-AS1 + 0.18×PART1 − 0.89×AL021707.1. We substituted each lncRNA expression pro le into the model to calculate the risk coe cient of each sample, and classi ed the samples according to the median risk factor. It can be seen from the gure that the number of deaths in high-risk group was signi cantly higher than that in the low-risk group, and the higher the risk coe cient, the lower the expression of the lncRNAs in PSORS1C3, AC010536.2, IQCF5-AS1, AL021707.1, FGD5-AS1, and PART1 (Fig. 5).
To observe the prognostic differences between the high-risk and low-risk samples of the model, we performed survival analysis and plotted K-M curves for the high-risk group and the low-risk group (Fig. 6A). There are signi cant prognostic differences between the two groups of samples. To further observe the heterogeneity and stability of the model, we analyzed the ROC curve of the model (Fig. 6B).
The prognostic model constructed by the 6 lncRNAs has a high AUC area and can signi cantly distinguish the risk of prognosis, suggesting that the 6 lncRNAs can be served as good prognostic markers.

Discussion
Recently, ceRNA hypothesis has been demonstrated to represent a novel model of post-transcriptional regulation working through miRNA competition (25). In the ceRNA network, ceRNA can inhibit multiple miRNAs and regulate the network by changing the type and number of miRNA binding sites, involving various signaling pathways and biological phenomena (22,25). Further study on the mechanism of ceRNA will help to deepen the understanding of the pathogenesis of the disease, and has a good application prospect in disease treatment.
With the increasing attention to lncRNA genes and related researches, more and more new lncRNAs have been found to be regulators in ceRNA gene expression network. Although a large number of studies have shown that the expression of lncRNAs plays an important role in the development and progression of tumor, including cell growth, apoptosis, cell migration and invasion, as well as cancer cell stemness, the precise molecular mechanism is still not well studied(26-30). Generally, lncRNAs can be identi ed as both carcinogenic regulators and tumor suppressors, depending on the complicated context of tumor (31)(32)(33). One mechanism of lncRNAs has been reported to be a competitive endogenous RNA (ceRNA). It has been reported that lncRNAs can regulate miRNAs expression by sponging miRNAs and further mediate their downstream genes (34)(35)(36). For example, lncRNA GACAT3 can predict poor prognosis and promote cell proliferation by miR-497/CCND2 axis in breast cancer (37). LncRNA H19 and HOTAIR sponge miR-200 and miR-20a-5p respectively to promote breast cancer development (38,39). A functional analysis of the ceRNA networks in four subtypes of breast cancer (basal-like, HER2+, luminal A and luminal B) shows that three common lncRNAs, NEAT1, OPI5-AS1 and AC008124.1, play speci c roles in each subtype by competing with diverse mRNAs (40).
This article analyzes the RNA-seq and miRNA-seq data of double hormone receptor positive breast cancer in TCGA public database, constructs a ceRNA network, analyzes the network topology, and analyzes the survival of the intrinsic nodes in the network. Finally, six lncRNA markers that may in uence the prognosis, PSORS1C3, AC010536.2, IQCF5-AS1, AL021707.1, FGD5-AS1, and PART1, are found. Several analyses have revealed FGD5-AS1 as a potential regulator of clear cell renal cell carcinoma progression (41), and another integrated analysis of lncRNA-associated ceRNA networks in periodontitis found that FGD5-AS1 targets both miR-125a-3p and miR-142-3p (42). PART1 expression can effectively predict the early recurrence of hepatocellular carcinoma after surgical resection (43). PART1 is considered as a new target for the treatment of prostate cancer by inhibiting toll-like receptor pathways to promote cell proliferation and apoptosis (44). PART1 may also be a promising biomarker in stage I-III non-small cell lung cancer since its expression is associated with poor prognosis and tumor recurrence (45). Additionally, the expression of PART1 is related to OS in oral squamous cell carcinoma(46). The rest lncRNAs are not well studied yet.
In conclusion, our research strategy using a combination of bioinformatics and multiple databases provides a new methodological reference for the study of tumor mechanism. By targeting the ceRNA regulatory network, lncRNA and other key molecules have been identi ed as novel prognostic markers that will help to select appropriate therapies to improve the survival outcomes of ER + PR + breast cancer patients. However, there are still some shortcomings in this article. Due to the mismatch in the amount of data between normal samples and cancer samples in TCGA database, our results need to be further veri ed by extended sample and cellular and molecular experiments.

Competing Interests
The authors declare that there are no con icts of interest.

Availability of data
The data were downloaded from TCGA database.

Author Contributions
Qi Huang, Literature research, Data analysis, Manuscript preparation. Jingjing Wu, Data extraction. Yao Peng, Manuscript editing. Na Li, Study design.

Ethical approval
The studies involving human participants were reviewed and approved by all data are from public database on the internet.  The relationship between the expression of 6 lncRNAs and risk scores. The horizontal axis represents the sample's risk coe cient and the score increases from left to right. a: The lncRNA risk score distribution of ER+PR+ breast cancer samples in TCGA database. b: Survival status and time of all ER+PR+ breast cancer patients across lncRNA risk score. c: Heatmap and hierarchical clustering of lncRNAs from the prognostic risk model are generated.