Identification and functional enrichment of DEGs
Taking BC non-bone metastasis samples as a control group, 1858 DEGs in the training set including 992 up-regulated and 866 down-regulated genes were generated according to the selection criteria. The volcano plot and heat map of the DEGs were presented in Figure 1A, and Figure 1B, respectively.
To have a biological understanding of these DEGs, they were subjected to the DAVID database for GO annotation and KEGG pathway enrichment analysis. The top enriched GO terms in BPs were signal transduction, positive regulation of transcription from RNA polymerase II promoter, and immune response, and those in CCs were cytoplasm, cytosol, and extracellular exosome (Figure 2A-2B). The major MFs were protein binding, Poly(A) RNA binding, and identical protein binding (Figure 2C). In the KEGG pathway enrichment analysis, these genes were mainly involved in the MAPK signaling pathway, proteoglycans in cancer, and focal adhesion (Figure 2D).
WGCNA
We incorporated the expression profile of integrated DEGs with clinical traits of the BC samples to construct a gene co-expression network. Clinical characteristics including sample group, PFS time, OS time, OS status, and PFS status were clustered with expression matrix (Figure 3A). Then, we chose the optimal β=6 to ensure that network was scale-free (β was a soft-thresholding parameter that could emphasize strong correlations between genes and penalize weak correlations). After choosing the power of 2, the adjacency was transformed into a topological overlap matrix (TOM), which could measure the network connectivity of a gene defined as the sum of its adjacency with all other genes for the network gene ration, and the corresponding dissimilarity (1-TOM) was calculated (Figure 3B). Based on TOM, the average linkage hierarchical clustering was conducted to cluster genes by setting the minimum number of genes for each gene network module to 30. To further analyze the module, we calculated the eigengenes of each module and merged the modules by setting a height of 0.25. Finally, a total of 4 modules were acquired (Figure 3C-3D). The genes in the grey module could not be incorporated into any other module. Next, Pearson’s correlation coefficients of the module eigengene of each module and the sample characteristics were calculated. The blue module with 76 genes was closely related to bone metastasis and survival status (Figure 3E). Thus, the genes in the blue module were chosen for further analysis.
Construction of the gene expression signature-based nomogram (GESBN) model
Univariate Cox regression and Kaplan-Meier plotter analyses were carried out on 140 patients in the GSE124647 to evaluate the association of 76 gene expression profiles in the blue module with patient OS and PFS. In univariate Cox regression analysis, significant genes related to OS were SLC44A1, SLC24A3, PDGFC, ITGBL1, and GJA1 (Figure 4A) (all P <0.05). Eleven genes including MDFI, IGFBP6, GJA1, ANXA2, SLC24A3, TGFBI, CELA2A, CELA2B, CLEC11A, PPEF2, and SLC44A1 were notably linked to PFS (Figure 4B) (all P <0.05). However, only four genes related to OS, and six genes related to PFS with statistical differences were extracted in Kaplan-Meier plotter analysis (Figure 5 and Figure 6) (all P <0.05). Taken together, GAJ1, SLC24A3, ITGBL1, and SLC44A1 were defined as potential prognostic genes for OS. GJA1, IGFBP6, MDFI, ITGFBI, ANXA2, and SLC24A3 were potential genes correlated with PFS. These prognostic genes were then subjected to the construction of nomogram models based on OS and PFS (Figure 7A-7B). For OS, SLC44A1 had the greatest contribution, which could reach 100 points, while MDFI contributed most to PFS. Therefore, SLC44A1 and MDFI were considered as hub genes. To ensure the accuracy of the GESBN models, the calibration curves were drawn. The calibration curves showed good agreement between prediction and observation in the probability of 1-, 3- and 5- year survival (Figure 7C-7D), indicating that the accuracy of the nomogram model was reliable.
Evaluation of the GESBN models
The GESBN score was calculated for each patient in the training set. Patients were ranked based on their risk scores and assigned into two groups as high-risk and low-risk of bone metastases. Using the survival package in R for survival analysis, the results showed that the OS rate of patients in the high-risk group was low, and the difference between the two groups was statistically significant (Figure 8A) (P <0.001). Similarly, an unfavorable PFS was observed in the high-risk group patients (Figure 8B) (P <0.001), suggesting that two nomogram models could predict survival well. Further, the time-independent ROC curves were drawn using the pROC package in R. In terms of OS, the AUCs of the 1-, 3-, and 5-year survival rates were 0.68, 0.62, and 0.72, respectively (Figure 8C). For PFS, the AUCs of the 1-, 3-, and 5-year survival rates were 0.73, 0.88, 0.94, respectively (Figure 8D), indicating that nomograms had a good predictive ability.
The expression levels of genes in nomogram models
Due to the predictive ability of nomogram models for both OS and PFS, we explored the expression levels of these key genes. In the training dataset of GSE124647, the expression levels of all the prognosis-related genes were significantly different between control and bone-metastasis groups (Figure 9A) (all P <0.05). In the validation dataset of GSE14020, GJA1, IGFBP6, ITGBL1, SLC44A1, and TGFBI expressions in the bone-metastasis group were different from those in the control group. The differences were statistically significant (Figure 9B) (P<0.05).
Validation of the prognostic value of hub genes
Based on the nomogram result, SLC44A1 and MDFI were the hub genes. Kaplan-Meier plotter was performed to verify the effect of SLC44A1 and MDFI on OS, PFS, and DMFS in BC. Patients in the high SLC44A1 expression group tended to have favorable OS, PFS, and DMFS (Figure 10A-10C) (P <0.05). Although MDFI expression was not significantly linked to OS and DMFS of the BC patients (Figure 10D, 10F) (P >0.05), its high expression predicted worse PFS (Figure 10E) (P <0.01). These results indicated that SLC44A1 and MDFI might be potential biomarkers for BC.