Acquisition of BC expression datasets through array re-annotation
The detailed workflow is shown in Fig. 1. In this study, the 13 expression profiles data sets were included representing a total of 2950 samples of BC (we deleted 17 normal samples from GSE42586 and 25 normal or cell lines symbols from GSE65194) were obtained (Additional file 1: Table S1). There were two datasets available with survival information, including GSE42586 and GSE58812, with the OS. All data sets were normalized using RMA arithmetic.
A total of 23,520 microarray probes were matched using microarray annotation information (HG-U133A annotation). We are also getting 18115 re-annotation probes with gene biotypes. To stratify all 2950 BC samples into different molecular characteristics, including Basal‐like (826 samples), Luminal A (904 samples), Luminal B (668 samples), Her2 (435 samples) and Normal‐like (117 samples) intrinsic subtypes by PAM50 classifier. We collected 1217 patients with BC-related mRNA expression data and clinical data from the TCGA database. Based on the clinical data, the normal and TNBC groups were found to contain 113 and 96 samples, respectively.
Getting specific DEGs of TNBC
The other four, respectively, compared to normal-like subtypes, were selected and underwent DEGs analysis. Those candidate genes were set as background lists to get an intersection list of DEGs using Venn diagrams. The 3,468 DEGs with specific differences between Basal-like and Normal‐like were obtained by removed the same gene probes come from the other three DEGs sets, which included 70 upregulated and 39 downregulated genes (Fig. 2a, Fig. 2b, Additional file 2: Table S2). The expression levels of all the genes were demonstrated, the top 100 DEGs (Head 50 and tail 50 genes) were rendered that were well clustered different subtypes of BC in the heatmap (Fig. 2c).
WGCNA co-expression network construction
Here, the weighted gene co-expression module was constructed from the 3484 DEGs, which included all TNBC subtypes samples of BC using the WGCNA algorithm. According to the scale-free topology criterion, we predicted a plot identifying scale-free topology in simulated expression data when the power value of β = 16 (scale-free R2 = 0.95).
A total of 5 co-expression modules were identified (module size, ≥ 25; cut height, ≥ 0.99) by the average linkage hierarchical clustering based on TOM dissimilarity measure (1-TOM) stability, and every module was indicated in different colors (Fig. 3a). The four enriched modules (excluding grey modules that no genes had been assigned to any of the modules) have better connectivity (Fig. 3b). We delineate the eigengene network using a hierarchical cluster tree and a heatmap plot. A co-expression network of the associations between the subtype of BC and these modules was constructed (Fig. 3c). It is worth noting that we chose the blue module with the most significant correlation with TNBC as the relevant module and carried out the next analysis. There was a significant correlation between gene significance (GS) and module membership (MM), indicating that critical genes of the blue module were also highly associated with TNBC. (Fig. 3d).
Identification and validation of Hub Genes
We screened a total of 54 hub genes with high connectivity in the blue module based on absolute GS > 0.2, absolute MM > 0.8, including 5 lncRNA genes and 49 protein-coding genes (Fig. 4a). Among them, LINC00337, OSTM1-AS1, DEPDC1-AS1, DDX11-AS1, and LINC00608, had high functional significance in the hub genes. Additionally, based on the TCGA data, the expression levels of these four lncRNA genes were significantly higher in the TNBC tissues, including LINC00337, OSTM1-AS1, DEPDC1-AS1, and DDX11-AS1 (p < 0.005) (Fig. 4b, c, d, and e), especially the LINC00337, DEPDC1-AS1 and DDX11-AS1 (p < 10− 15). Surprisingly, there was no significant difference in the expression level of LINC00608 between healthy and TNBC tissues (Fig. 4f). The expression of LINC00337, DEPDC1-AS1, and DDX11-AS1 was significantly up-regulated in the TNBC tissues (p < 10− 15). Therefore, the above three lncRNA were selected for further experimental verification.
Construction and estimation of prognosis scoring model based on 3 lncRNAs
We speculated that the identified complex 3-lncRNA signatures forcefully contribute to the prognosis of TNBC. Accordingly, we used this complex 3-lncRNA signature as an alone predictive mark to predict the risk of sample survival probability. We fit the lncRNA risk score model to the merged dataset of GSE42586 and GSE58812 with OS time to construct a lncRNA signature, the following formula: Risk Score = 0.2882*expr (LINC00337)-1.1816*expr (DEPDC1-AS1)-0.3307*expr (DDX11-AS1).
To assess the robustness of the 3-lncRNA signature in predicting the risk of tumor recurrence for TNBC samples, the predictive power of the 3-lncRNA signature was then tested in the merged dataset. Next, the risk score of each patient was calculated, and patients were classified into a high-risk group (n = 46) and a low-risk group (n = 57) by the mid-cut point (Fig. 5a). The survival status of all patients and the heat maps of the three prognostic genes in the merged data set are shown in Fig. 5b and Fig. 5c, respectively. The KM survival curve showed that the high-risk group's OS was worse than that of the low-risk group (Fig. 5d). Besides, in the time-dependent ROC analysis, the prognostic characteristics of the three lncRNAs showed a more substantial area under the curve (AUC) value (Fig. 5e), indicating that the multi-lncRNA model had greater predictive ability in 1-year and 2-year OS.
Validation of the multi-lncRNA prognostic signature
To verify the predictive power and applicability of three-lncRNA prognostic markers in TNBC, we collected TNBC-related TCGA (n = 96) data sets as external validation. In the validation set, the three lncRNA coefficients were used to calculate the risk score for each patient. Consistent with the results in the training set, The KM curves of the external test sets were consistent with the results of the training set, and the high-risk group had a worse prognosis than the low-risk group (Fig. 6a). Time-dependent ROC analysis pointed out that AUC for 1-year, 2-year, and 5-year OS of the external validation set were 0.706, 0.828, and 0.731(Fig. 6b). In summary, the prognostic characteristics of three-lncRNA showed an excellent performance in predicting OS in TNBC patients.
Functional annotation analysis
To further study the function of the blue modules containing three-lncRNA was highly correlated with TNBC, Gene Ontology (GO) term, KEGG pathway, and GSEA analysis were performed.
Biological processes analysis showed enrichment of GO terms associated with chromatin silencing at rDNA, protein heterotetramerzation, interleukin-7-mediated signaling pathway and response to interleukin, etc. (Fig. 6c and Additional file 3: Table S3). According to KEGG pathway analysis, our results indicated that these genes were mainly involved in PD-L1 expression and PD-1 checkpoint pathway in cancer, Transcriptional misregulation in cancer, Cytosolic DNA-sensing pathway, and PI3K-Akt signaling pathway (Fig. 6d and Additional file 4: Table S4).
In the GSEA enrichment results (Fig. 7 and Additional file 5: Table S5), Results revealed that the genes of blue modules were enriched in transporter activity, Fischer dream targets, transmembrane transport and transmembrane transport activity. To sum up, these functional analysis results suggested that the blue module containing three-lncRNA was associated with the development and progress of TNBC.