Construction of a Novel 3-lncRNA Risk Score System for the Prognostic Prediction of Triple-Negative Breast Cancer

Background (cid:0) Triple-negative breast cancer (TNBC) is an essential type of breast cancer (BC). Compared with other molecular subtypes of BC, TNBC has the features of fast tumor increase, quick recurrence and natural metastasis. It is more urgent to establish a comprehensive evaluation system containing multiple biomarkers than single parameter. Methods (cid:0) We conduct a bioinformatics analysis on 13 BC expression datasets from the Gene Expression Omnibus (GEO), which covered 2950 samples. We took 3484 genes with a more signicant difference between TNBC and normal-like candidate genes for weighted correlation network analysis (WGCNA). A total of 54 genes were chosen as hub genes with great connectivity with the TNBC signicant module. Based on The Cancer Genome Atlas (TCGA) data, we identify the best prognostic three lncRNA. Multivariate Cox regression was used to construct a 3-lncRNA risk score model. We evaluated prognostic capacity using time-dependent subject operating characteristics (ROC) and Kaplan-Meier (KM) survival analysis. The predictive power of the model was demonstrated by the time-dependent ROC spline and Kaplan-Meier spline. At the same time, it also shows good predictive ability in the validation set. Ultimately, Functional enrichment analysis of hub genes and three lncRNAs were offered to advise the possible biological pathways. Results (cid:0) The construct LNC00337, DEPCE-AS1, DDX11-AS1 multi-factor risk scoring model was meaningfully associated with the prognosis of TNBC patients. Through survival analysis, the risk score eciently divided the patients into high-risk groups with poor overall survival. The time-dependent ROC curve revealed that the model presented robust in predicting survival over the rst 3 years. The validity of the model in the validation set is also veried. Finally, functional enrichment analysis proposed some biological pathways that may be correlated to the tumor. Conclusions (cid:0) In our study, we established a lncRNA-based model to prognosticate the prediction of TNBC, which might afford a strong prognosis estimate tool to help therapy policy-making in the clinic. interleukin-7-mediated signaling pathway, PD-L1 expression and PD-1 checkpoint pathway in cancer and PI3K-Akt signaling pathway. The GSEA enrichment results revealed that the genes of blue modules were enriched in transporter activity, Fischer dream targets, transmembrane transport and transmembrane transport activity.


Introduction:
Breast cancer (BC) is a frequently diagnosed malignancy and the leading cause of cancer death among females worldwide [1]. Triple-negative breast cancer (TNBC) refers to breast cancer that does not express the genes for estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (Her2/neu), constitutes 15-20% of all breast cancers [2,3]. TNBC in young women with large tumors, high lymphatic metastasis rate, and high clinical stage. Thus, there is a momentous crucial in establishing a strong prognosis estimate tool for predicting TNBC and making conservative judgments based on clinical features. Biomarkers used to predict the occurrence of cancer range from clinical features, endogenous substances and histopathological characteristics of tumors to speci c mutated genes [4]. For example, the tumor node metastasis (TNM) classi cation system is most widely used to estimate prognosis and guide treatment in patients with cancer [5]. In addition, studies have shown that some single biomarkers have been closely associated with the occurrence and development of TNBC, such as STAT3 [6], miR-105/93-3p [7], and CYPOR [8]. However, due to the features fast tumor increase, quick recurrence and natural metastasis [9], it is challenging to predict TNBC patients' survival with a single parameter. Therefore, establishing a comprehensive prognostic assessment mode, including multiple biomarkers, is the best way to increase prediction power.
In this work, we constructed a model to predict the OS of TNBC patients based on a variety of prognosticrelated lncRNAs. We reused publicly available breast cancer datasets from the Gene Expression Omnibus (GEO), and differentially expressed genes (DEGs) were screened. These DEGs were then adopted to screen gene modules highly coordinated and closely related to TNBC through weighted correlation network analysis (WGCNA). Three seldom reported hub genes, LINC00337, DEPDC1-AS1, and DDX11-AS1, were selected to construct the prognosis scoring model. The risk score was calculated by multiplying the multivariate Cox factor by the gene expression. Training sets and validation sets were conducted to validate the risk scoring model. The risk scoring model was estimated by time-dependent ROC analysis. Finally, the potential biological pathways of hub genes and 3 lncRNAs were identi ed by functional enrichment analysis. The 13 BC expression datasets were downloaded from GEO, including GSE20685, GSE21653, GSE12276,   GSE42568, GSE58812, GSE102484, GSE27830, GSE45827, GSE76124, GSE65194, GSE43365, GSE36771 and GSE31448. The datasets were reviewed manually to ful ll the following criteria, including, ( ) Studies involved in selection samples >100 only, ( ) Studies without any drug treatment. The datasets were based on the GPL570 platform.

Data resource
All the available BC mRNA datasets were searched for the TCGA data portal and obtained using the R package TCGAbiolinks [10]. Clinical information was downloaded and extracted to molecular subtypes of breast cancer and statistical analyses. No approval from the ethics committee was needed because all the information was required from the TCGA database (https://portal.gdc.cancer.gov/).

Construct feature DEGs
The 13 series have good consistency after background correction, standardized, and summarization by RMA arithmetic. We also annotated the gene biotypes of these microarray probes by the R package biomaRt [11]. All breast samples were classi ed into different molecular characteristics. The PAM50 classi er was used to dived BC samples of Basal-like, Luminal A, Luminal B, Her2, and Normal-like molecular characteristics by the R package geneFu [12]. Among them, Basal-like is also called TNBC. The limma package was used to analyze the DEGs in the other four subtypes, respectively, compared to the normal-like subtypes of BC [13]. DEGs were obtained according to the criteria: adjusted top10,000. The feature DEGs with speci c differences between Basal-like and Normal-like were obtained by removing the same gene probes as the other three DEGs sets.
WGCNA co-expression network construction WGCNA analyzed these feature DEGs, and pair-wise correlation matrices were constructed [14]. Using average cascading hierarchical clustering, we classi ed the genes with high absolute correlation according TOM-based difference measure and the gene modules with the minimum size of 25. A heat map is drawn to show the correlation and independence of each module. Then, using correlation analysis, the most closely related modules with TNBC were selected for further analysis.

Identi cation and validation of Hub Genes
Hub genes usually refer to characteristic genes that are highly associated with other genes in the module. The module connectivity of each gene in relevant modules was characterized by the absolute value of module membership (MM) after the module was associated with the BC subtype. We calculated the fabs of gene signi cance (GS), which described the Pearson's correlation between a given gene and the molecular characteristics of BC. The selection criteria for hub genes were based on absolute GS> 0.2, absolute MM > 0.8. We will also use TCGA datasets to identify the expression levels of selected optimal prognostic-relate genes of TNBC.

Construction and estimation of prognosis scoring model based on Hub Genes
We t a multivariate Cox PHR model to the datasets with overall survival (OS) time to construct a lncRNA signature [15]. The best cut-off for TNBC samples strati cation by this signature. Subsequently, the risk score coe cient was calculated based on the regression coe cient and expression level of prognostic lncRNA and was used as a survival risk measurement standard. The multivariate Cox regression model as following: Here, the coef represents the Cox PHR coe cient of lncRNA, and the expr represents lncRNA expression level.
We t the lncRNA risk score model to the merged dataset of GSE42586 and GSE58812 with OS time to construct a lncRNA signature. According to the optimal cut-off value, TNBC samples in the data set were classi ed into high-risk and low-risk groups. Recurrence-free survival was compared between the highand low-risk groups by KM survival curve analysis and log-rank test with the Survival package in R [16]. The result was considered to gure a statistically signi cant difference in P <0.05. Additionally, tROC analysis was executed the predictive precision of multi-lncRNA prognostic signature and risk score.
Validation of the multi-lncRNA prognostic signature To verify the predictive power and applicability of multi-lncRNA prognostic markers in TNBC, we collected TNBC-related TCGA (n=96) data sets as external validation. In the validation set, the coe cients of the three genes were used to calculate the risk score for each patient. Patients were divided into the high-risk group and low-risk group according to the median risk score of the training set. The prognostic characteristics of multiple lncRNAs were veri ed by KM survival analysis using a log-rank test and tROC analysis.

Functional annotation analysis
We carried out the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis of signi cant modules by the R package clusterPro ler [17]. We did Gene Set Enrichment Analysis (GSEA) analysis of signi cant modules through the R packages clusterPro ler. To identify potential functions as signi cant modules, input annotation the complete Molecular Signatures Database (MSigDB) of mapping, the cut-off level of adjusted P < 0.001.

Acquisition of BC expression datasets through array reannotation
The detailed work ow is shown in Fig. 1. In this study, the 13 expression pro les data sets were included representing a total of 2950 samples of BC (we deleted 17 normal samples from GSE42586 and 25 normal or cell lines symbols from GSE65194) were obtained (Additional le 1: Table S1). There were two datasets available with survival information, including GSE42586 and GSE58812, with the OS. All data sets were normalized using RMA arithmetic. Getting speci c DEGs of TNBC The other four, respectively, compared to normal-like subtypes, were selected and underwent DEGs analysis. Those candidate genes were set as background lists to get an intersection list of DEGs using Venn diagrams. The 3,468 DEGs with speci c differences between Basal-like and Normal-like were obtained by removed the same gene probes come from the other three DEGs sets, which included 70 upregulated and 39 downregulated genes (Fig. 2a, Fig. 2b, Additional le 2: Table S2). The expression levels of all the genes were demonstrated, the top 100 DEGs (Head 50 and tail 50 genes) were rendered that were well clustered different subtypes of BC in the heatmap (Fig. 2c).

WGCNA co-expression network construction
Here, the weighted gene co-expression module was constructed from the 3484 DEGs, which included all TNBC subtypes samples of BC using the WGCNA algorithm. According to the scale-free topology criterion, we predicted a plot identifying scale-free topology in simulated expression data when the power value of β = 16 (scale-free R 2 = 0.95).
A total of 5 co-expression modules were identi ed (module size, ≥ 25; cut height, ≥ 0.99) by the average linkage hierarchical clustering based on TOM dissimilarity measure (1-TOM) stability, and every module was indicated in different colors (Fig. 3a). The four enriched modules (excluding grey modules that no genes had been assigned to any of the modules) have better connectivity (Fig. 3b). We delineate the eigengene network using a hierarchical cluster tree and a heatmap plot. A co-expression network of the associations between the subtype of BC and these modules was constructed (Fig. 3c). It is worth noting that we chose the blue module with the most signi cant correlation with TNBC as the relevant module and carried out the next analysis. There was a signi cant correlation between gene signi cance (GS) and module membership (MM), indicating that critical genes of the blue module were also highly associated with TNBC. (Fig. 3d).

Construction and estimation of prognosis scoring model based on 3 lncRNAs
We speculated that the identi ed complex 3-lncRNA signatures forcefully contribute to the prognosis of TNBC. Accordingly, we used this complex 3-lncRNA signature as an alone predictive mark to predict the risk of sample survival probability. We t the lncRNA risk score model to the merged dataset of GSE42586 and GSE58812 with OS time to construct a lncRNA signature, the following formula: Risk Score = 0.2882*expr (LINC00337)-1.1816*expr (DEPDC1-AS1)-0.3307*expr (DDX11-AS1).
To assess the robustness of the 3-lncRNA signature in predicting the risk of tumor recurrence for TNBC samples, the predictive power of the 3-lncRNA signature was then tested in the merged dataset. Next, the risk score of each patient was calculated, and patients were classi ed into a high-risk group (n = 46) and a low-risk group (n = 57) by the mid-cut point (Fig. 5a). The survival status of all patients and the heat maps of the three prognostic genes in the merged data set are shown in Fig. 5b and Fig. 5c, respectively.
The KM survival curve showed that the high-risk group's OS was worse than that of the low-risk group (Fig. 5d). Besides, in the time-dependent ROC analysis, the prognostic characteristics of the three lncRNAs showed a more substantial area under the curve (AUC) value (Fig. 5e), indicating that the multi-lncRNA model had greater predictive ability in 1-year and 2-year OS.

Validation of the multi-lncRNA prognostic signature
To verify the predictive power and applicability of three-lncRNA prognostic markers in TNBC, we collected TNBC-related TCGA (n = 96) data sets as external validation. In the validation set, the three lncRNA coe cients were used to calculate the risk score for each patient. Consistent with the results in the training set, The KM curves of the external test sets were consistent with the results of the training set, and the high-risk group had a worse prognosis than the low-risk group (Fig. 6a). Time-dependent ROC analysis pointed out that AUC for 1-year, 2-year, and 5-year OS of the external validation set were 0.706, 0.828, and 0.731 (Fig. 6b). In summary, the prognostic characteristics of three-lncRNA showed an excellent performance in predicting OS in TNBC patients.

Functional annotation analysis
To further study the function of the blue modules containing three-lncRNA was highly correlated with TNBC, Gene Ontology (GO) term, KEGG pathway, and GSEA analysis were performed.
Biological processes analysis showed enrichment of GO terms associated with chromatin silencing at rDNA, protein heterotetramerzation, interleukin-7-mediated signaling pathway and response to interleukin, etc. (Fig. 6c and Additional le 3: Table S3). According to KEGG pathway analysis, our results indicated that these genes were mainly involved in PD-L1 expression and PD-1 checkpoint pathway in cancer, Transcriptional misregulation in cancer, Cytosolic DNA-sensing pathway, and PI3K-Akt signaling pathway ( Fig. 6d and Additional le 4: Table S4).
In the GSEA enrichment results (Fig. 7 and Additional le 5: Table S5), Results revealed that the genes of blue modules were enriched in transporter activity, Fischer dream targets, transmembrane transport and transmembrane transport activity. To sum up, these functional analysis results suggested that the blue module containing three-lncRNA was associated with the development and progress of TNBC.

Discussion
Breast cancer is the head cause of cancer death in women and easy to recrudesce. Among all the breast cancer subtypes, TNBC is more prevalent in young women, with large tumors, high lymphatic metastasis rate, and high clinical stage. Although BC treatment has developed during the last years, the knowledge to treat TNBC is still restricted due to lack expressed in ER, PR, and Her2. Although some studies have shown that some single biomarkers are strictly related to the occurrence and development of TNBC, it is complicated to predict the survival of patients with a single parameter. Therefore, establishing a comprehensive prognostic evaluation system, including multiple biomarkers, is the best way to improve prediction accuracy.
Although some BC related biomarkers have been proposed via WGCNA [18][19][20], it is rare to use WGCNA to further predict the potential biological targets of TNBC, especially in lncRNA. Tian et al. [21]only used coexpression analysis of lncRNAs with mRNAs to identify lncRNAs in TNBC. Dong et al. [22] identi ed the essential genes and pathways related to TNBC by DEGs and Protein-Protein Interaction (PPI) networks. These correlation network analyses only indicate whether multiple genes are related to each other. Moreover, the WGCNA can look for co-expressed gene modules and explore the correlation between gene networks and phenotypes of interest and the network's hub genes.
In this study, we tried to select more samples of breast cancer data from the GPL570 platform, a total of 2950 samples. Before the network analysis, we took 3484 genes with a signi cant difference between TNBC and normal-like candidate genes. After WGCNA analysis, we selected the blue module most related to TNBC, including 84 genes that were selected as hub genes with high connectivity. Additionally, based on the TCGA data, the expression levels of these three lncRNA genes were signi cantly higher in the TNBC tissues, including the LINC00337, DEPDC1-AS1, and DDX11-AS1 (p < 10 − 15 ).
As far as we know, no study has used screening methods like ours to screen for three novel DEGS, all of which were positive prognostic lncRNAs. DDX11-AS1 (DDX11 Antisense RNA 1) is an RNA Gene that named "cohesion regulator non-coding RNA," or CONCR [23]. DDX11-AS1 affected the gene expressions involved in HCC proliferation, differentiation, and cell cycle [24], which may be a novel oncogene in hepatocarcinogenesis by repressing LATS2 [25]. DDX11-AS1 is transcriptionally activated by MYC and is up-regulated in multiple cancer types, especially breast cancer and lung cancer [26]. The expression of DDX11-AS1 is cell cycle-regulated, and it is required for cell-cycle progression and DNA replication. There was a similar report in another article. The authors identi ed eight independent prognostic markers that independently predicted the survival of the samples in multiple cancers, including DDX11-AS1. These independent prognostic markers modulate cell cycle progression and cell proliferation [27]. Recent studies report that LINC00337 acts as the oncogene to promote gastric cancer cell proliferation through epigenetically repressing p21 mediated by EZH2 [28]. LINC00337 may up-regulate the expression of PBK and KIF23 through competitive binding of has-mir-373 and has-mir-519d. The competitive binding of hasmir-373 and has-mir-372 can up-regulate the expression of SLC7A11. The interaction between these RNAs may have an essential regulatory role in the immune in ltration in lung adenocarcinoma [29]. However, alterations of DEPDC1-AS1 in TNBC have been scarcely reported.
Moreover, a novel three-lncRNA-based prognostic model combining three-lncRNA signature was established and validated to improve survival prediction for TNBC patients. The patients in high-risk groups showed signi cantly poorer prognosis than the patients in the low-risk group. The multivariate Cox regression analysis showed that the three-lncRNA prognosis signature could be an independent factor in evaluating the prognosis. Internal and external validation was also conducted to con rm its predictive value. Further, Time-dependent ROC analysis of three-lncRNA showed favorable discrimination in both test set and validation set. Based on the GO and KEGG pathway analyses, the three-lncRNA may play crucial roles in chromatin silencing at rDNA, protein heterotetramerzation, interleukin-7-mediated signaling pathway, PD-L1 expression and PD-1 checkpoint pathway in cancer and PI3K-Akt signaling pathway. The GSEA enrichment results revealed that the genes of blue modules were enriched in transporter activity, Fischer dream targets, transmembrane transport and transmembrane transport activity.

Conclusion
In summary, based on the comprehensive analysis of publicly available TNBC data in GEO and TCGA, we identi ed a novel, robust three-lncRNA-based prognostic model combining a three-lncRNA signature to predict 7-year OS in TNBC patients. Furthermore, the three-lncRNA signature can effectively identify lowrisk patients from the high-risk group in TNBC patients. In other words, the three lncRNAs could be potential biomarkers in TNBC. The relevant lncRNA-based model could predict the speci c survival rate and promote the choice of speci c treatment selections of TNBC patients. The work ow of the selection process for TNBC survival-related 3-lncRNA signature.