Construction and validation of a risk prediction model for clinical axillary lymph node metastasis in T1–2 breast cancer

The current diagnostic technologies for assessing the axillary lymph node metastasis (ALNM) status accurately in breast cancer (BC) remain unsatisfactory. Here, we developed a diagnostic model for evaluating the ALNM status using a combination of mRNAs and the T stage of the primary tumor as a novel biomarker. We collected relevant information on T1–2 BC from public databases. An ALNM prediction model was developed by logistic regression based on the screened signatures and then internally and externally validated. Calibration curves and the area under the curve (AUC) were employed as performance metrics. The prognostic value and tumor immune infiltration of the model were also determined. An optimal diagnostic model was created using a combination of 11 mRNAs and T stage of the primary tumor and showed high discrimination, with AUCs of 0.828 and 0.746 in the training sets. AUCs of 0.671 and 0.783 were achieved in the internal validation cohorts. The mean external AUC value was 0.686 and ranged between 0.644 and 0.742. Moreover, the new model has good specificity in T1 and hormone receptor-negative/human epidermal growth factor receptor 2- negative (HR−/HER2−) BC and good sensitivity in T2 BC. In addition, the risk of ALNM and 11 mRNAs were correlated with the infiltration of M2 macrophages, as well as the prognosis of BC. This novel prediction model is a useful tool to identify the risk of ALNM in T1–2 BC patients, particularly given that it can be used to adjust surgical options in the future.


Scientific Reports
| (2022) 12:687 | https://doi.org/10.1038/s41598-021-04495-y www.nature.com/scientificreports/ In this study, we constructed a model to predict ALNM in T1-2 BC by analyzing public databases, and we also analyzed the association between the risk of ALNM and patient survival and immune cell infiltration. Therefore, we hope that this study will provide an effective and new method to predict lymphatic metastasis accurately in BC.

Materials and methods
Public datasets. Transcriptome profiling data (including lncRNA and mRNA data) normalized to fragments per kilobase million (FPKM) and relevant clinical information on BC from The Cancer Genome Atlas (TCGA) were downloaded. According to the specific inclusion and exclusion criteria, 465 samples in the TCGA database and 716 samples in the Gene Expression Omnibus (GEO) database were selected. All 465 patients in the TCGA were randomly divided into two independent datasets at a ratio of 7:3 based on a computer-generated random number (training dataset: 326 patients; validation dataset: 139 patients). GSE9893 served as another training dataset. The clinical characteristics of all patients are shown in Table 1 and Supplementary Table 1.
The inclusion criteria were as follows: (a) female BC patients with pathological stage T1-T2 disease; and (b) patients with a pathological diagnosis of invasive ductal carcinoma or invasive lobular carcinoma. The exclusion criteria were as follows: (a) patients with incomplete clinicopathological information, such as TX stage (the primary tumor could not be assessed), NX stage (regional lymph node involvement could not be assessed), and MX stage (the metastatic status could not be assessed) in the TNM staging system, and those with an uncertain estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2) status; (b) patients who had received preoperative adjuvant therapy; and (c) patients with distant metastasis.
Identification of differentially expressed genes (DEGs). The "limma" package in R was utilized to select DEGs between lymph node (LN)-positive and LN-negative patients in the TCGA/GSE9893 datasets. We used the false discovery rate (FDR) < 0.05 and |log 2 FC (fold change)| > 1 as the thresholds for identifying the DEGs. A volcano diagram and a heatmap were generated using the "pheatmap" R package.
Feature selection. We employed least absolute shrinkage and selection operator (LASSO) regression to further select the most diagnostically predictive features in the training datasets. The lymph node status served as the dependent variable, and a minimum λ was used for feature selection. Then, we used univariate logistic www.nature.com/scientificreports/ regression to filter the diagnostic features selected by the LASSO regression analysis. The LASSO regression analysis was performed with the "glmnet" package in R.
Prediction model construction and performance assessment. Using a multivariate logistic regression algorithm, we constructed a risk prediction model in the training dataset. A nomogram was formulated based on the results of the multivariable analyses by integrating multiple prediction indicators. Correspondingly, the coefficient of each feature in the model and the predicted index of each patient in the training cohort were calculated. The goodness of fit between the observed value and the predicted value was tested using the Hosmer-Lemeshow test and displayed in the calibration curve. The predictive discrimination of the model was evaluated using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Decision curve analysis (DCA) was employed to judge the clinical applicability of the nomogram by quantifying the net benefits at different threshold probabilities 16 . The prediction model was evaluated in the validation datasets. The "rms", "ROCR" and "rmda" packages in R were applied to create the calibration curve, ROC graph and DCA curve.
Prognostic analysis. The largest Youden index was used as the cutoff value to separate the patients into high-or low-risk groups 17 . Kaplan-Meier analysis with the log-rank test was subsequently performed to assess the differential outcomes in overall survival (OS) or distant metastasis-free survival (DMFS) between the two groups. The Kaplan-Meier plotter (https:// kmplot. com/ analy sis/) database 18 was applied to analyze the difference in recurrence-free survival (RFS) according to target gene expression.

Functional analysis and immune infiltration.
We performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on the DEGs using the "clusterProfiler" package in R. ImmuCellAI (http:// bioin fo. life. hust. edu. cn/ ImmuC ellAI#!/) and CIBERSORT were used to estimate tumor immune infiltration in each sample in the TCGA cohort 19,20 . Based on the "Gene" module of the TIMER2.0 database 21 (http:// timer. cistr ome. org/), we further evaluated the correlation between the hub genes and the infiltration of M2 macrophages.
Statistical analysis. Statistical analysis was performed using R software (version 4.0.2). The Wilcoxon rank-sum test is a nonparametric statistical test mainly utilized for comparing two groups, and the Kruskal-Wallis test is suitable for comparing two or more groups. A conventional two-sided P-value < 0.05 was considered significant.

Results
Clinical characteristics of the patients. We developed a risk prediction model for clinical ALNM in T1-2 BC patients. A flow chart of the whole study design is shown in Supplementary Fig. 1. The clinical features of the patients in the training and internal validation sets in the TCGA are given in Table 1, and no differences in baseline characteristics were observed between the two groups (P-value > 0.05). We also summarized the relevant clinical information of the BC samples from the GEO database that met the inclusion and exclusion criteria (Supplementary Table 1).

Development of a risk prediction model for BC. Differential gene expression analysis revealed 256
upregulated and 314 downregulated genes in the TCGA training sets (Fig. 1A). However, in the GSE9893 dataset, 2742 upregulated genes and 2176 downregulated genes were found (Fig. 1B). Therefore, the results revealed 35 common upregulated genes and 22 common downregulated genes (Fig. 1C). A total of 57 mRNAs were selected as biomarker candidates and used together with clinicopathologic factors (T stage, age, ER, PR, HER2, and subtypes) to construct a diagnostic model in the training set using LASSO regression analysis (Fig. 1D,E). The heatmap shows the correlations between the expression of the 57 mRNAs and clinicopathological variables. Compared with non-axillary lymph node metastasis (NALNM) group, the expression levels of many genes were higher in the ALNM group, such as HOXB2, HOXB5, HOXB7 ( Supplementary Fig. 2). In addition, the optimal value of λ in LASSO regression was 0.0185. Subsequently, all significant factors in the univariate logistic regression analysis (Supplementary Table 2) were included in the multivariate logistic regression analysis. Finally, we constructed a risk prediction model that contains the T stage of the primary tumor and 11 genes in T1-2 BC, as shown in Supplementary Table 3.
Nomogram construction and validation. These 12 features that are associated with ALNM in BC were used to construct the nomogram (Fig. 2A). This nomogram can be used to estimate the probability of ALNM through the summation of the points of each variable.
To compare the discrimination ability of the model in predicting ALNM, we conducted ROC curve analysis of T1-2 BC patients. The AUC value of the model in the training set was 0.828 (95% CI: 0.783-0.873; P-value < 0.001), which indicated that the model had high prediction efficacy (Fig. 2B). Furthermore, the calibration curve of the model demonstrated good agreement between the predictions and observations in the TCGA training set (Fig. 2C). The Hosmer-Lemeshow test suggested that the model had good fit (χ 2 = 8.859; P-value = 0.354). Moreover, the prediction model showed a high net benefit to aid in clinical decisions for a risk probability threshold between 2 and 91% in the training set according to the DCA curve (Fig. 2D).
In the internal validation stage of the model, the AUC values were found to be 0.671, 0.746 and 0.783 ( Fig. 3A-C, Supplementary Table 4). In the external validation cohort, the AUC value in the GSE11001 dataset was up to 0.742, while the AUC values in the other GEO datasets were 0.644, 0.661, 0.673, and 0.709 (Fig. 3D-H Fig. 3). Furthermore, the specificity of the risk prediction model in predicting ALNM in T1 BC was as high as 92.3-95.1%, and the false positive rate was between 4.9 and 7.7% (Fig. 4A, Supplementary Table 5). In patients with T2 BC, the sensitivity of the model was between 71.3 and 90.3% (Fig. 4B, Supplementary Table 6). The model also had good specificity in evaluating the risk of ALNM in HR−/HER2− BC patients (Fig. 4C, Supplementary Table 7).
The prognostic role of the risk prediction model. All patients were separated into high-or low-risk groups according to the optimal cutoff value. Kaplan-Meier analyses were subsequently performed to assess the www.nature.com/scientificreports/ differential outcomes between the two groups. Patients in the high-risk group had worse OS (P-value = 9.337e−07) and DMFS (P-value = 3.45e−02) (Fig. 5A,B).
To further evaluate the prognostic value of genes in the risk prediction model, the RFS of BC patients was investigated with the Kaplan-Meier Plotter database. Interestingly, the expression of ACOX1, CD1A, OPN3, REPS1, RTN1, CNP, DUT, HOXB3, and PYGB was significantly associated with the prognosis of BC (P-value < 0.05), which was consistent with the predictive value of the model (Fig. 5C).

Analysis of gene functions and tumor immune infiltration.
We analyzed the common differentially expressed genes by functional enrichment analysis, and the results indicated that the abovementioned genes mostly function in the processes of antigen processing and endogenous peptide antigen (Fig. 6A), which suggests that the common genes may be associated with tumor immune infiltration. Furthermore, we evalu-  (Fig. 6B). We calculated the 22 subpopulations of immune cells in 685 BC patients (according to the previous inclusion and exclusion criteria of the TCGA database, patients were re-incorporated without restricting the expression of ER,PR and HER2) by using the CIBERSORT algorithm and investigated the differences between tissues with different N stages. Surprisingly, the lymph node stage was correlated with the infiltration of M2 macrophages (P-value < 0.05) (Fig. 6C). Based on the TIMER2.0 database, we further explored the functions of the 11 genes in the model, all of which were associated with the infiltration of M2 macrophages (Fig. 6D).

Discussion
ALNs are often the first and most frequent metastatic site and are the most important factor for the diagnosis and prognosis of BC 22,23 . In the current study, we developed and validated a risk prediction model to evaluate the probability of positive lymph nodes in patients with T1-2 BC to improve the ability of preoperative individualized treatment decisions in the future. As mentioned previously, the risk prediction model incorporates 11 genes that may be related to ALNM and the T stage of the primary tumor. The results of this study showed that the risk of ALNM was positively correlated with tumor size, and similar results have been reported 15 .
To confirm the generalizability and repeatability of the established model, we used internal verification and external verification cohorts for further analysis. Generally, an AUC value greater than 0.75 indicates high accuracy, 0.75 to 0.6 indicates general accuracy, and less than 0.6 indicates low accuracy 24 . In this study, the distinguishing ability of the risk prediction model in T1-2 BC was acceptable regardless of whether it was used in either the internal or external validation cohort. Furthermore, the model had good specificity in predicting ALNM in T1 or HR−/HER2− BC patients, which means that SLNB may be omitted in patients who are classified as low risk according to this model based on the patient's condition. In stage T2 patients, the model showed good sensitivity, suggesting that patients who are classified as high risk need to receive neoadjuvant therapy or accept ALND directly. Interestingly, although the model was designed to predict ALNM in T1-2 BC, we found that it could be used to predict patient survival, supporting the clinical and prognostic value of the model.
The expression of acyl-CoA oxidase 1 (ACOX1) is associated with brain metastasis in BC 25 . CD1a molecule (CD1A) may predict regional lymph node invasion and prognosis in BC 26 . Deoxyuridine triphosphatase (DUT) is correlated with the treatment of BC 27 . FKBP prolyl isomerase 9 (FKBP9) has been shown to be related to www.nature.com/scientificreports/ distant metastasis in prostate cancer 28 . The frequency of FKBP9 mutation is relatively high in BC 29 . A previous study revealed that low expression of homeobox B3 (HOXB3) was associated with a poor prognosis in hormone receptor-negative BC 30 . Glycogen phosphorylase B (PYGB) has potential applications in the prevention of BC metastasis 31 . Signal recognition particle 14 (SRP14) plays a role in OS in acute myeloid leukemia patients 32 . The expression of 2′,3′-cyclic nucleotide 3′ phosphodiesterase (CNP) correlates with glioblastoma patient survival 33 . Opsin 3 (OPN3) promotes epithelial-mesenchymal transition and tumor metastasis in lung www.nature.com/scientificreports/ adenocarcinoma 34 . RALBP1-associated Eps domain-containing 1 (REPS1) may be involved in neurodegeneration with brain iron accumulation 35 . The last gene, reticulon 1 (RTN1), is believed to be associated with prognosis   36 . Furthermore, many genes in 57 candidate biomarkers, such as HOXB2, HOXB5, HOXB7, COL3A1, COL5A2, KRT18 and KRT19, are closely related to tumorigenesis, metastasis and invasion of cancer [37][38][39][40][41][42][43] . In this study, the odds ratios (ORs) of eight genes in the risk prediction model were more than 1, suggesting that the aforementioned genes, such as ACOX1 and DUT, facilitate lymph node metastasis in BC. In addition, we found that the high expression of ACOX1, CNP, DUT and OPN3 was associated with poor survival. This finding is consistent with the hypothesis that the above genes may serve as adverse prognostic indicators of survival by affecting ALNM in BC patients. Similarly, BC patients with high CD1A or REPS1 expression had longer survival times than those with low CD1A or REPS1 expression.
Previous studies have shown that M2 macrophages play protumor roles 44 and that the infiltration of M2 macrophages is correlated with lymph node metastasis of BC 45 . A similar result was found in this study. We found that the risk of ALNM and the 11 genes in the risk prediction model were correlated with the infiltration of M2 macrophages by bioinformatics analysis, which indicates that the above genes may affect ALNM in BC by participating in tumor immune infiltration.
In summary, we innovatively constructed a risk prediction model that contains the T stage of the primary tumor and 11 genes in T1-2 BC, although we used different cohorts for internal and external verification. However, this was a retrospective study, and further multicenter studies with larger sample sizes are needed to demonstrate its potential clinical application value in the future.

Data availability
The data which used in this article are public data.