Identification of a Novel Gene Model-Based Homologous Recombination Deficiency Score to Improve Survival Prediction of TNBC.


 Background

Triple-negative breast cancer (TNBC) is a specific histological type of breast cancer with a poor prognosis, early recurrence, which lacks durable chemotherapy responses and effective targeted therapies. We aimed to construct an accurate prognostic risk model based on homologous recombination deficiency (HRD) - gene expression profiles for improving prognosis prediction of TNBC.
Methods

Triple-negative breast cancer RNA sequencing data and sample clinical information were downloaded from the breast invasive carcinoma (BRCA) cohort in the Cancer Genome Atlas (TCGA) database. Combined with the HRD database, tumor samples were divided into two sets. We screened differentially expressed genes (DEGs) and then identified HRD-related prognostic genes using weighted gene co-expression network analysis (WGCNA) and Cox regression analysis. The least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analysis were used to identifying key prognostic genes. Risk scores were calculated and compared with HRD score, Kaplan–Meier (KM) survival analysis were used to assess its prognostic power. GSE103091 dataset from GEO (Gene Expression Omnibus) database was used to validate the signature. Univariate and multivariate Cox regression were performed to independently verify the prognosis of the risk score. A nomogram was constructed and revealed by time-dependent ROC curves to guide clinical practice.
Results

We found that HRD tumor samples (HRD score > = 42) in TNBC patients were associated with poor overall survival (p = 0.027). We identified a total of 147 differential genes including 203 up-regulated and 213 down-regulated genes, among which 29 were prognosis-related genes. Through the LASSO method, 6 key prognostic genes ((MUCL1, IVL, FAM46C, CHI3L1, PRR15L, and CLEC3A) were selected and a 6-gene risk score was constructed. We found risk score was negatively associated with homologous recombination deficiency (HRD) scores (r = -0.22, p = 0.019). Compared with the low-risk group, Kaplan-Meier survival analysis shows that the high-risk group has an obvious poorer prognosis (P < 0.0001). Finally, we integrated the risk score model and clinical factors of TNBC (AJCC-stage, HRD score, T stage, and N stage) to construct a compound nomogram. Time-dependent ROC curves showed the risk score performed better in 1-, 3- and 5-year survival predictions compared with AJCC-stage.
Conclusions

Based on HRD gene expression data, our six HRD-related gene signature and nomogram could be practical and reliable tools for predicting OS in patients with TNBC.


Introduction
Triple-negative breast cancer (TNBC) is an aggressive subtype of breast cancer, which is characterized by the loss of the expression of estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2) and accounts for 10-20% of all types of breast cancer [1]. Compared with non-TNBC patients, TNBC patients have the characteristics of younger age, higher histological grade, larger tumor size, higher positive rate of lymph nodes and easier metastasis to lung, brain and other parts [1,2]. Secondly, TNBC cannot bene t from endocrine therapy and anti-HER-2 targeted therapy, chemotherapy has become the main adjuvant therapy for TNBC patients. However, with the increase of drug resistance, the effective chemotherapy of TNBC is limited, even if active treatment is carried out, the median overall survival of patients with metastatic, triple-negative breast cancer is less than one year [2]. TNBC also has obvious heterogeneity. There are survival differences between different subtypes. Not all patients have a poor prognosis. Traditional clinicopathological indicators and single molecular markers have obvious limitations in predicting prognosis.
Homologous recombination repair (HRR) is an important pathway signal pathway for several cellular processes including the error-free repair of DNA double-strand breaks (DSB) and the recovery of stalled DNA replication forks [3], in which the key proteins are BRCA1 and BRCA2. BRCA1 or BRCA2 loss of function leads to homologous recombination de ciency (HRD). However, germline and somatic alteration of other HRR-related genes, such as PALB2, CDK12, RAD51, CHEK2, ATM or BRCA1 gene promoter methylation can lead to HRD in sporadic cancers, broadly termed BRCAness [4,5]. HRD has a high incidence of ovarian cancer, prostate cancer, breast cancer and pancreatic cancer [6,7]. some clinical studies had found that HRD status is highly correlated with the sensitivity of platinum-based chemotherapeutics and PARP inhibitors and is a key indicator of treatment options and prognosis for a variety of tumors [8]. To this end, the development and clinical evaluation of platforms to identify HR defects have recently been a subject of intense investigation, especially in TNBC, as this subtype is considered to be enriched for the HR pathway de ciency. Approximately 40% and 70% of TNBC tumors have a presumed HRD phenotype [9,10], HRD score based on NGS assay was de ned as the unweighted sum of the loss of heterozygosity (LOH), telomeric allelic imbalance (TAI), and large-scale state transitions (LOS) scores [11]. High HRD scores have shown to be signi cantly associated with sensitivity to neoadjuvant platinum-based chemotherapy in TNBC [10,12]. However, most of the work carried out around HRD is to study the mutations of HRD-related genes, and the accuracy of genomics to identify HRD. there is no transcriptomics research is involved.
In the present study, we gathered information about the clinical features and RNA sequencing data of 123 TNBC tumor samples from the TCGA database and identi ed HRD grouping based on the genome. we then construct a prognostic model of the HRD transcriptome and compare it with the genome HRD score.

Data acquisition and preprocessing
The TCGA (The Cancer Genome Atlas) BC mRNA expression pro le (The TCGA-BRCA cohort) and the associated clinical information were downloaded from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). Use R (version 3.6.0) software to standardize and process data. The HRD information of tumor samples can be obtained from HRD related data (HRD score see: PMID: 29617664, TCGA_DDR_Data_Resources.xlsx form: DDR footprints), the tumor samples were divided into HRD tumor samples (HRD score ≥ 42) and non-HRD tumor samples (HRD score <42) according to the HRD score. GSE103091 data set was obtained from Gene Expression Omnibus (GEO, https ://www.ncbi.nlm.nih.gov/geo/) for validation.
Identi cation and enrichment analysis of HRD-related differentially expressed genes in TNBC To acquire differentially expressed genes (DEGs) in TNBC (between the HRD and non-HRD samples). The R language limma package (http://www.r-project.org/) was used to analyze for differentially expressed genes between HRD and non-HRD samples, the threshold for a signi cant difference was designated a Pvalue of <0.005. The R package "cluster pro le" was used to carry out Gene Ontology (GO) enrichment analysis including biological process (BP), cell components (CC) and molecular functions (MF) for the differentially expressed HRD-related genes. The same tool is also used for the enrichment analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. The weighted gene coexpression network analysis (WGCNA) package was used to identify key modules associated with prognosis based on differentially expressed HRD-related genes.
Identi cation and veri cation of prognostic gene signatures We next performed univariate Cox survival analysis on the 199 key module genes to screen cancer-related prognostic factors. Variables with a P value less than 0.1 in the univariate analysis were screed to identify independent prognostic factors. The 29 prognosis-related genes identi ed were included in the LASSO regression analysis by using the R package "glmnet" to screen the genes. Then, the multivariable Cox proportional risk regression analysis was carried out to establish the prognosis model of TNBC.
The risk score of each patient was calculated according to the mRNA expression level of each risk gene and the risk coe cient, we used the following formula to calculate the risk score of each patient.
where Exp i is the expression level of each prognostic gene, and β i is the regression coe cient of it. patients were divided into the high-risk group and low-risk group with the median risk score as the threshold. K-M curve was used to compare the survival difference between the above two groups.
To validate the prognostic value of our risk model, we obtained the GSEGSE103091 data set from the Gene Expression Omnibus (GEO) database, the risk score of each patient was calculated using the coe cients of 6 genes above. Then the patients were strati ed into high-risk and low-risk groups by the median risk score. The correlation between risk score and HRD score was evaluated using Spearman's rank correlation analysis. The KM survival analysis used to validate the multi-gene prognostic signature.
In order to analyze the stability of the risk prediction model in different levels of other clinical prognostic parameters, the KM curves were used to compare the difference of subgroups of AJCC-stage. In addition, in order to independently verify the prognosis of the risk score. single variable and multivariate Cox regression analyses were conducted. Risk score, T stage, N stage, AJCC -stage and HRD score were used as covariates.

Construction of gene prognostic nomogram
A composite nomogram was constructed based on all independent prognostic parameters screened by univariate and multivariate Cox proportional hazards regression analysis above and compared with AJCC staging systems in the clinical cohort. we used the "rms" and "survival" packages in R to predict the probability of 1-year, 3-year and 5-year OS.

Identi cation of DEGs and Functional Analysis
The TCGA-BRCA cohort consisted of 1104 tumor samples (123 TNBC and 981 non-TNBC samples) and 114 normal samples. Combined with the HRD database, we screened out 110 samples with HRD scores in TNBC samples and divided them into HRD tumor samples and non-HRD tumor samples according to the HRD score (>=42 for HRD samples). Kaplan-Meier survival analysis showed that HRD tumor samples in TNBC patients were associated with poor overall survival ( Figure 1). The limma package of R was used for detecting the DEGs between HRD samples and non-HRD samples. There were 417 differential genes including 203 up-regulated and 214 down-regulated genes with signi cant differences (P < 0.05) ( Figure  2A-B). full result can be obtained in the supplementary Table1 and Supplementary Table2. all the differentially expressed genes (DEGs) were further analyzed by carrying out gene ontology and KEGG enrichment, The DEGs based on the HRD score, were mainly enriched in BP related to digestion and inner ear development, CC was associated with extracellular matrix and collagen−containing extracellular matrix, and MF terms were associated with ion channel activity ( Figure 2C-E). The KEGG pathway identi ed was protein digestion and absorption pathway ( Figure 2F).

Identi cation of Key Prognostic Genes in TCGA -TNBC dataset
To evaluate the prognostic effect of DEGs, the weighted gene co-expression network analysis (WGCNA) package was used to identify key modules associated with prognosis. Finally, the key nding of our study is that brown and blue modules containing 99 genes and 100 genes respectively were found to have a stronger correlation with overall survival (Figure 3). 199 key module genes were analyzed by single variable Cox regression (Supplementary Table3). We identi ed a set of 29 genes whose P values were less than 0.1, part of the results is shown in Table 1. At the same time, we selected the genes included in the subsequent risk model to draw the Kaplan-Meier curve, and the result is shown in Figure 4. To further identify the 29 candidate genes that were signi cantly correlated with the prognosis of TNBC patients, LASSO regression was performed, which was related with 6 genes (MUCL1, IVL, FAM46C, CHI3L1, PRR15L and CLEC3A) in DEGs that signi cantly associated with OS ( Figure 5).
Construction and Estimation of a 6-Gene Risk Score According to the gene expression level and the risk coe cient of each gene, the risk score of each patient was calculated (Supplementary Table4), the median risk score of each model was used as the threshold to divide the samples into high-and low-risk groups to draw the Kaplan-Meier (KM) curve and the risk factor linkage diagram of model evaluation (Figure 6), compared with the low-risk group, Kaplan-Meier survival analysis shows that the high-risk group has an obvious poorer prognosis (P<0.0001) ( Figure 6A). Besides, in order to analyze the stability of the risk prediction model, Kaplan-Meier survival curve analyses showed that the low-risk group was signi cantly correlated with better OS in N0 stage (P = 0.0032), N1-N3 stage (P = 0.00033), stage I + stage II stage (P = 0.00011) and T1-T2 (P = 0.00015) patients (Figure7).
Interanion and Validation of the 6-gene risk score Firstly, we conducted a Spearman's correlation test to evaluate the correlation between the HRD score and the 6-gene risk score. Scatter plot of the 6-gene risk score and HRD score showing the negative linear relationship between the two variables (Pearson correlation coe cient = -0.22) (Figure 8). In additional, to verify the predictive value of the six-gene prognostic signature, we conducted internal and external veri cation. The risk score in univariate analysis was signi cantly correlated with overall survival (OS) (HR = 0.074, 95% CI = 0.017-0.032, P =0.00056) ( Figure 9A). Multivariate analysis showed that the risk score was an independent prognostic indicator (HR = 39.373, 95% CI = 7.059-219.624, P < 0.001) ( Figure  9B). Besides, we obtained the GSEGSE103091 data set from the Gene Expression Omnibus (GEO) database and used our risk scoring model for survival analysis. Finally, the results were consistent with the training group ( Figure 10). All in all, the above data showed that 6-gene risk score was an independent risk factor of patients with TNBC.

Construction of nomogram
To better predict patients' prognosis and guide clinical practice, we integrated the risk score model and clinical factors of TNBC (AJCC-stage, HRD score, T stage, and N stage) to construct a compound nomogram (Figure.11A). The calibration curve of the nomogram was abnormal, so we used 1-year, 3-year, 5-year ROC curves instead. The 1-,3-and 5-year time-dependent ROC curves were used to evaluate the predictive ability of the risk score (AUC = 0.785 of 1 year, 0.710 of 3 years, and 0.847 of 5 years), and its predictive ability was found to be better than AJCC-stage ( Figure.11B-D).

Discussion
Breast cancer is the most frequently diagnosed cancer with an incidence rate of 11.7% of newly diagnosed female cancers and the rst most common cause of cancer mortality in women worldwide [13]. Triple-negative breast cancer (TNBC) accounts for 20% of all molecular subtypes of breast cancer [14]. Although signi cant bene ts of new targeted therapy and immunotherapy have been reported in the past few decades. Due to the high heterogeneity and few genetic targets, these tumors have the worst prognosis among all of the breast cancer subtypes and the overall survival rate within ve years is less than 70% [15]. Traditional clinicopathological parameters, tumor-node-metastasis (TNM) staging system and single molecular markers have obvious limitations in predicting prognosis. It is necessary to identify the effective prognostic biomarkers of TNBC and establish the relevant prognostic risk prediction model. Multiple studies have reported that mutations in HRD-related genes are associated with the prognosis of multiple tumors, as well as the e cacy of PARP inhibitors. However, there is no research involving any transcriptomics about HRD. In this study, we construct a prognostic risk model based on transcriptome data of HRD. This signature could be used to e ciently determine the overall survival time of TNBC patients.
In this study, we screened out 110 TNBC samples with HRD scores according to the TCGA breast cancer dataset and HRD database. the DEGs were identi ed from HRD tumor samples and non-HRD tumor samples in TNBC patients. After WGCNA analyses, univariate, LASSO and multivariate Cox regression analyses, MUCL1, IVL, FAM46C, CHI3L1, PRR15L and CLEC3A were screened out as prognostic genes ultimately to develop the prognostic model. These genes contained in the signature have previously been reported to be associated with different cancer in various ways.
Mucin-like 1 (MUCL1) is a gene encoding a low molecular weight glycoprotein with high similarity to sialomucins, which was only expressed in salivary glands and breast tissues. It was identi ed as a breast-speci c gene for breast cancer micrometastasi [16]. Most recently, Liu Liang et al. used IHC technology to detect the expression level of MUCL1 in para n-embedded tissues of 89 triple negative breast cancer patients and found that high MUCL1 expression is signi cantly correlated with high recurrence and death rates in triple negative breast cancer patients [17]. In additional, several studies have shown that MUCL1 expression strongly correlates with clinical stage of TNM and the status of axillary lymph node metastasis [17]. Involucrin (IVL), a component of keratinocyte crosslinked envelope, is found in the cytoplasm and crosslinked with membrane proteins by transglutaminase. This gene is mapped to 1q21, among calpactin I light chain, trichohyalin, pro llaggrin, loricrin, and calcyclin. Recently, IHL has been identi ed as a novel hub gene that shows a signi cant up-regulation in colon adenocarcinoma as compared to normal tissue [18]. So far, there is little research on IVL in TNBC. only one study reported that 6-mRNA signature including IVL may act as a potential prognostic biomarker in patients with TNBC [19]. Family with sequence similarity 46, member C (FAM46C) is a member of the FAM46 family, it is located on chromosome 1p12 and seems to play a role in the regulation of translation by acting as an mRNA stabilizing factor. its abnormal deletions in tumor tissues were con rmed in multiple myeloma [20] and gastric cancer [21]. Zhang, et al reported that FAM46C was downregulated in hepatocellular carcinoma (HCC)and induced cell apoptosis through regulating Ras/MEK/ERK pathway [22]. In additional, FAM46C was downregulated in prostate cancer to inhibit cell proliferation and cell cycle progression and promote apoptosis through PTEN/AKT signaling pathway [23]. However, there is no research on FAM46C in TNBC. CHI3L1, on human chromosome 1q32.1, encodes a secreted glycoprotein called YKL-40, which plays an important role in in ammation, angiogenesis, radioresistance, and cancer progression. Overexpression of CHI3L1 have been de described in various types of cancer, including oligodendroglia, glioblastoma, osteosarcoma, breast, and small-cell lung cancers [24,25]. YKL-40 expression was signi cantly upregulated in NSCLC tissues, and associated with poor prognosis and shorter survival [25]. PRR15L, also known as ATAD4, which encodes a protein of unknown function, to date, no report has been ascribed to this function of this gene. C-type lectin domain family 3 member A (CLEC3A), belonging to the superfamily of C-type lectins, is known to associate with cell adhesion which in uenced results in tumor cell proliferation and metastasis [26,27]. It was reported that CLEC3A expressed initially in cartilage and was associated with osteoarthritis. Recently, Ni, J et al [28] gured out that high CLEC3A expression signi cantly correlated with poor prognosis in IDC patients and promoted invasion and metastasis of breast cancer through activating PI3K/AKT signaling. Our ndings suggest that these genes may be acted as important biomarkers to predict survival outcomes in patients with TNBC. If we can explore their speci c mechanisms of action in triple-negative breast cancer more extensively and in-depth, it is likely that they can be used as new cancer biomarkers.
After identifying the six prognostic genes, the risk score model of HRD signature was developed and investigated for its prognostic value in TNBC patients, a clear separation was observed in the survival curve between patients in high-risk and low-risk subgroups, which was evaluated as a category variable (divided by median cutoff). We also found that the low-risk group had a very low proportion of deaths. Furthermore, we performed a strati cation analysis, and the results suggested that 6-gene risk score in clinical subgroups (N0 stage, N1 + N2 + N3 stage, stage I + stage II, T1 + T2 stage) could still better predict the prognosis of TNBC. Then the univariate and multivariate Cox regression analysis showed that 6-gene risk score could be an independent factor to evaluate the prognosis. Finally, we developed a nomogram to guide clinical practice including AJCC-stage, HRD score, T stage, and N stage, and risk score to construct a nomogram to predict the 3-year and 5-year survival of TNBC patients. when compared with the TNM stage, 6-gene risk score showed the even better predictive ability in the ROC analysis.
However, we acknowledge several limitations in our study. Firstly, our study only focused on the largescale mRNA sequencing data from the TCGA platform. the results report may be biased. Secondly, we searched the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) for external validation, but many data sets have no prognostic OS information. Thirdly, in this study, the gene function and participation mechanism of six-gene models have not been clari ed, and the relationship with the occurrence and development of breast cancer needs to be further con rmed by research under in vitro and in vivo conditions.

Conclusion
In summary, our study developed a six HRD-related gene risk score for the prognostic prediction of TNBC based on samples deposited in the TCGA database. which can reduce the waste of medical resources and contribute to personalized treatment decisions. Note: Cutoff, the threshold to distinguish the high and low expression of the gene. Figure 1 Kaplan-Meier curves of HRD samples with overall survival time.      The correlation between the 6-gene score and HRD scores.  Validation of the 6-gene signature. GSE103091 was regarded as the external validation set. Kaplan-Meier survival analysis of the 6-gene signature in external validation set.

Figure 11
Construction of a nomogram for overall survival prediction in TNBC. a the nomogram consists of AJCCstage, HRD score, T stage, N stage and the risk score based on the six-gene signature. b-d ROC curves for predicting 1, 3 and 5-year overall survival between 6-gene score and AJCC-stage.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.