Screening of DEGs
A flow chart of the study design is shown in Figure 1. The R package “limma” was used to screen DEGs between tumor and normal tissues in GSE184717 and GSE183947, where a total of 1967 (1494 upregulated and 473 downregulated) and 351 (180 upregulated and 171downregulated) DEGs were obtained, respectively. The distribution of each gene were visualized by volcano plots (Figure 2A,B). The resulting list of DEGs was the intersection between the above datasets and a total of 62 overlapping DEGs were obtained(Figure 2C).
TABLE 1: The relationship between the siganture and patient characteristics.
Two genes Were Screened Out as Potential Predictive Biomarkers
The clinicopathologic characteristics are summarized in Table 1. To identify DEGs that associated with metastasis, univariate Cox regression analysis was performed.A total of nine candidate genes(CD74,TSPAN7,COL11A1,FLG,MMP11,CHRDL1,FNDC1,MELK,PITX1)with an adjusted P value < 0.05 were screened out(Figure 3A). Detailed information for each gene was listed in Table 2.
TABLE 2.Details of the candidate genes
Gene Symbol
|
Gene ID
|
Full Name
|
Cytogenetic locations
|
Protein Function
|
CD74
|
972
|
CD74 molecule
|
5q33.1
|
CD markers; Disease related genes; Cancer-related genes:Candidate cancer biomarkers
|
TSPAN7
|
7102
|
tetraspanin 7
|
Xp11.4
|
Transporters:Accessory Factors Involved in Transport; CD markers; Disease related genes; Potential drug targets
|
COL11A1
|
1301
|
collagen type XI alpha 1 chain
|
1p21.1
|
Predicted intracellular proteins; Disease related genes; Predicted secreted proteins; Cancer-related genes:Candidate cancer biomarkers
|
FLG
|
2312
|
filaggrin
|
1q21.3
|
Cancer-related genes:Mutated cancer genes; Predicted intracellular proteins; Disease related genes
|
MMP11
|
4320
|
matrix metallopeptidase 11
|
22q11.23
|
Peptidases:Metallopeptidases; Cancer-related genes:Candidate cancer biomarkers; FDA approved drug targets:Small molecule drugs; Predicted secreted proteins; Enzymes
|
CHRDL1
|
91851
|
chordin like 1
|
Xq23
|
Disease related genes; Predicted secreted proteins
|
FNDC1
|
84624
|
fibronectin type III domain containing 1
|
6q25.3
|
Predicted intracellular proteins; Predicted secreted proteins
|
MELK
|
9833
|
maternal embryonic leucine zipper kinase
|
9p13.2
|
Disease related genes; ENZYME proteins:Transferases; Predicted intracellular proteins; Potential drug targets; Kinases:CAMK Ser/Thr protein kinases; Enzymes
|
PITX1
|
5307
|
paired like homeodomain 1
|
5q31.1
|
Predicted intracellular proteins; Disease related genes; Transcription factors:Helix-turn-helix domains
|
we further analyze the differential gene expression between breast cancer tissues(n = 1085) and normal tissues(n = 291) using GEPIA. As shown in Figure 4,Eight genes were found to be differentially expressed in breast cancer tissues, including six upregulated genes(CD74,MMP11,MELK,COL11A1,PITX1,FNDC1) and two downregulated genes(TSPAN7,CHRDL1).
Interestingly, breast tissues with low CD74 expression were related to poorer DMFS although CD74 commonly upregulated in breast cancer patients.
Additionally, bc-GenExMiner was utilized to validate the relationship between predictive biomarkers and DMFS. The expression levels of seven genes,including TSPAN7(HR:0.75; 95%CI:0.69-0.83;P < 0.0001),CD74(HR:0.88;95%CI:0.81-0.97;P = 0.0094),MMP11(HR:1.33; 95%CI:1.21-1.46;P < 0.0001),MELK(HR:1.87;95%CI:1.70-2.06; P < 0.0001),COL11A1(HR:1.13; 95%CI:1.03-1.24; P = 0.0108),CHRDL1(HR:0.81; 95%CI:0.73-0.89; P < 0.0001),PITX1(HR:1.17; 95%CI:1.07-1.29; P = 0.0010), were found able to estimate DMFS(Figure 5).
Finally, CD74 (HR:0.61;95%CI:0.4-0.92;P = 0.019)and TSPAN7(HR:0.51;95%CI:0.29-0.9;P = 0.021)were screened out to develop a signature according to the results of multivariate Cox analysis (Figure 3B).
Development and Validation of the Signature
We calculated the signature based on the expression of CD74 and TSPAN7 as follows:
Signature = (0.7697433 X expression of CD74)+(0.6633031 X expression of TSPAN7)
Of note, the expression of CD74 and TSPAN7 was negatively correlated with DMFS, suggesting that patients with low scores have a higher probability of distant metastasis.
To be better applied in clinical diagnosis, a constant cutoff value was determined by the median of the training cohort (15.488). Survival analysis, using the Kaplan-Meier method, indicated that the low score group portends a worse DMFS(HR:0.63;95%CI:0.49-0.82; P = 0.01; Figure 6A). Subsequently, survival analysis was performed twice with the same results in the internal validation cohort (HR:0.58;95%CI:0.36-0.96; P = 0.023; Figure 6B), and the external validation cohort (HR:0.67;95%CI:0.47-0.95; P = 0.0065; Figure 6C), respectively. To individually show the differences between low score and high score groups, we visualized the scores, distant metastasis, and gene expression profiles in the three cohorts(Figure 6D,E,F).The above results indicated that the signature could predict the risk of metastasis development as an independent risk feature.
Gene Set Enrichment Analysis
GSEA was performed to analyze the causes of breast cancer metastasis risk difference between high and low score groups. Figure 7A showed upregulated genes in the high score group were significantly enriched in immune related signal pathways, such as T cell receptor signaling and chemokine signaling pathway, suggesting that our signature may be an immune infection related signature(IIRS).
Immune Infiltration Analysis
we further utilized TIMER to analyze possible correlation between CD74, TSPAN7 expression and levels of immune infiltration in breast cancer (Figure 7B). Both CD74 and TSPAN7 are associated with multiple immune cell infiltration, especially T cells. To identify the relationship between our cohort and immune infiltration, we calculate the fraction of immune cell subsets using quanTIseq.As shown in Figure 7C, The high score group had a higher fraction of B cell(P < 0.001),M2Macrophage(P < 0.001),Neutrophil(P = 0.009),CD8+T cell(P < 0.001),Tregs(P < 0.001). On the contrary, the fraction of NK cell(P = 0.012) and uncharacterized cell was found to increase significantly in the low score group. It is worth noting that the analysis showed that there was the most significant difference in CD8+T cells between groups.
In view of myeloid cells play an important role in the recruitment and concentration of CD8+ T cells 24 .we attempted to evaluate the correlation between myeloid cell levels and the signature in breast cancer patients. CD33 is widely used as a marker for tumor infiltrating myeloid cells 25,26 ,so we collected FFPEs from 28 consecutive pairs of breast cancer patients and performed IHC for CD74 and CD33, respectively.Interestingly, CD74 protein mainly existed in some CD33+ cells(Figure 8A), and were positively correlated with CD33 levels in three cohorts(Figure 8B), GEPIA(Figure 8C) and IHC(Figure 8D), respectively. In addition, there is a significant positive correlation between CD74 and CD8 in breast cancer(Figure 8E), which means that CD74+ myeloid cells will not affect the recruitment of CD8+T cells.
Since TSPAN7 is significantly downregulated in breast cancer tissues, we tried to identify the cause of low expression of TSPAN7 by detecting the level of methylation. we hypothesized that TSPAN7 methylation occurs in breast cancer cells and treated three TSPAN7-silenced cells (MCF-7, ZR-75-1, MDA-MB-231) with demethylation agent 5-Aza, respectively.TSPAN7 expression was subsequently restored in all treated cells(MDA-MB-231,ZR-75-1,MCF-7)( Figure 8F). Importantly, we found that the level of TSPAN7 was significantly correlated with the level of CD74 (p=8.3e-11), which has been proved to exist mainly in immune cells(Figure 8G). Therefore, breast cancer cells undergo TSPAN7 methylation, which is related to the infiltration of CD74 positive immune cells.
Construction and Assessment of Predictive Nomogram
In order to increase clinical utility,a predictive nomogram was constructed(FIGURE 9A) based on the risk feature verified by univariate and multivariate Cox analysis, including the IIRS, age, T stage, and N stage (Figure 3C, D). Interestingly, we find that younger patients are at higher risk of metastasis, which are consistent with previous findings 27 . Subsequently, calibrate curves were plotted to identify the consistency between ideal outcome and actual observation in prediction of 3-,5- 10-year distant metastasis-free survival times.The calibration curves showed good performance in training cohort (Figure 9B), internal validation cohort (Figure 9C), and external validation cohort (Figure 9D), especially for long term survival rate(10-year DMFS).
To assess the predictive performance of this model, the nomogram was internally validated by 1,000 bootstrap resamples in three cohorts. Pleasantly, and surprisingly, the nomogram yielded a C-index of 0.742 (95% CI, 0.715 to 0.769) for training cohort ,0.801 (95% CI, 0.754 to 0. 0.848) for internal validation cohort and 0.695(95% CI, 0.643 to 0. 0.747) for external validation cohort. Therefore, we demonstrated that the nomogram could predict the DMFS of BC patients effectively.
We subsequently performed a decision curve analysis to evaluate the clinical net benefit in predicting the probability of 10- year DMFS.As shown in Figure 9E, if the threshold probability of a patient or doctor is < 48% or >63%, using the nomogram with IIRS to predict DMFS adds more benefit than the other without IIRS.