Establishment and Validation of a Novel Metabolism-related Gene Signature for Predicting Overall Survival of Patients with Breast Cancer

It is well known that Breast cancer is a heterogeneous disease.Although the current recurrence and mortality rate have been greatly improved, many people still suffer relapse and metastasis.Metabolic reprograming is currently considered to be a new hallmark of cancer.Therefore,in this study, we comprehensively analyzed the prognostic effect of metabolic-related gene signatures in breast cancer and its relationship with the immune microenvironment.We constructed a novel metabolic-related gene signature containing 6 genes to distinguish between high and low risk groups by univariate Cox regression and least absolute shrinkage and selection operator (LASSO) regression, and validated its robustness and accuracy through multiple databases. The metabolic gene signature may be an independent risk factor for BC both in the training and the testing set,the nomogram has a moderately accurate performance,and the C index was 0.757 and 0.728 respectively.The signature can reveal metabolic characteristics based on gene set enrichment analysis and at the same time monitor the status of TME.This gene signature can be used as a promising independent prognostic marker for BC patients, and can indicate the current status of TME, providing more clues for exploring new diagnostic and treatment strategies.


Introduction
Breast cancer (BC) has emerged as the highest incidence in women accounting for nearly 24.2% of female new cases worldwide and the leading cause of death according to the latest report of Global Cancer Statistics 2018 (1).The incidence and mortality of breast cancer are rising year by year (2).Although big progress has been made in the treatment of breast cancer over the previous decades,the long-term prognosis for patients remains poor due to the high rate of recurrence and metastasis. Breast cancer is a heterogeneous disease consisting of ve molecular subtypes and different treatments approaches are needed according to the type of biological molecule markers (3).Therefore looking for new targets for treatment of breast cancer and the development of diagnostic and prognosis markers is an urgent need and could provide early and effective treatment.Improved prognostic models will help to individualize treatment regimens for breast cancer patients (4).
Tumor cells have powerful characteristics of adapting to adversity and growing rapidly.This adaptability is achieved by changing the way of energy metabolism of tumor cells, called metabolic-related reprogramming (5).Metabolic-related reprogramming has become one of the ten hallmarks of tumors, which is also one of the four emerging characteristics of tumors (6).Changes in tumor cell metabolism enable tumor cells to obtain growth advantages,escape metabolic-related pressure,promote tumor cell immune escape, and enhance their invasion ability (7). Metastasis formation accounts for the majority of tumor-related deaths and consists of different steps, each of which is characterized by a unique adaptive phenotype of cancer cells. Metabolic reprogramming represents one of the main adaptive phenotypes utilized by cancer cells in all major steps of tumor and metastasis progression.However,the underlying molecular mechanism and pathways of metabolic-related reprogramming are still unclear, especially in breast cancer.
To further investigate the prognostic role of these metabolic genes in BC, we conduct this study.In this study, we downloaded the data from the The Cancer Genome Atlas (TCGA) databases to identify the differentially expressed metabolism-related genes(DE-MRGs) in breast cancer and established prognosis molecular signatures and validated it through the Gene Expression Omnibus(GEO) database.The important prognostic role of metabolic-related genes(MRGs) in breast cancer has been elucidated.

Data collection and Identi cation of differentially expressed MRGs (DE-MRGs)
We downloaded the messenger RNA(mRNA) expression pro les(FPKM format) and the corresponding clinical information of breast cancer patients from TCGA database(https://portal.gdc.cancer.gov) concluding 113 normal breast and 1109 BC cases.Missing data was removed.The data were all female patients.The GSE20685 gene expression pro le and clinical information was collected from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) ,which contains a total of 327 BC cases. TCGA dataset were used as the training cohort and the geo dataset were used as the validation cohort.Samples within 30 days of follow-up data were deleted.
The MRGs were obtained from the downloads section of GSEA website(http://software.broadinstitute.org/gsea/index.jsp).GSEA v4.1.0 for Windows and c2.cp.kegg.v7.2.symbols.gmt were downloaded from the GSEA website for the further extraction analysis. Genes enriched in metabolism pathways of the KEGG database were utilized in this study as MRGs.In total, 948 MRGs were obtained, of which 861 MRGs were common between TCGA and GSE20685 dataset.The mRNA expression of common MRGs in the TCGA database and GEO database was extracted respectively.Then, an adjustment was given to adjust different mRNA expression levels of metabolicrelated genes between TCGA and GEO databases by 'sva' package (8) in R software (version 4.0.2).
Genes were selected as candidate metabolic-related genes for meeting the following two conditions: (a) they have the same expression pattern in TCGA database and the GEO database; and (b) they were listed in the metabolic-associated pathways.The R package 'limma'(9) was used to screen the DE-MRGs in the TCGA database.The screening criteria were set as |logFC| > 0.5 and P-value < 0.05.

Prediction of potential drugs for BC treatment by Cmap
The Cmap database(http://www.broadinstitute.org/cMAP/ is a computer simulation method to predict small molecular compounds for a speci c disease.The DE-MRGs were divided into upregulated and downregulated genes group which were subsequently uploaded to the Cmap,and searches for small molecule drugs that may treat BC.The higher negative connectivity score of drugs indicate greater suggestive of its possible treatment.Drugs with a score of ≤0.75 were considered as candidate drugs.Tomograph of these relative molecular drugs was searched in Pubchem database.

GO and KEGG Analysis
The functions of the DE-MRGs were analyzed using the Gene Ontology (GO) tool and the Kyoto Encyclopedia of Genes and Genomes (KEGG) by using the "clusterPro ler"package (10). A P-value of <0.05 was considered signi cant.
Construction and Assessment of the Metabolic Gene Signature andSurvival Analysis Univariate Cox regression analysis was used to identify prognostic associated MRGs in the TCGA database.Genes with P<0.05 were de ned as candidate genes to further construct the least absolute shrinkage and selection operator (LASSO)-penalized Cox regression model.The LASSO regression was performed to construct the prognostic signature and avoid over tting of this model by glmnet package (11).The risk score of every patient was calculated with this model. The formula of risk score was as follows: riskscore = (Coe cientmRNA1 × expression of mRNA1) + (Coe cientmRNA2 ×expression of mRNA2) + ⋯ + (Coe cientmRNAn × expression mRNAn).
Based on the median of risk score, samples were classi ed into high-risk group and low-risk group. A Kaplan-Meier (K-M) survival curve was plotted to compare the predictive survival time between the two groups.The model was evaluated by the area under the curves (AUCs) of the receive operator characteristic (ROC) curve.The risk score was also proved as an independent risk factor by univariate and multivariate Cox regression respectively. During the analyses, the TCGA cohort was used as a training set and the GSE20685 cohort was used as an external testing cohort.

Construction and Validation of a Prognostic Nomogram
Based on the lasso Cox regression model results,a prognostic nomograms including clinical features and risk scores were constructed and assessed to predict 3 and 5 year survival rates in BC patients for TCGA and GEO cohorts respectively.The calibration curves was plotted to evaluate the prediction probabilities and tness of the metabolic signature.The distinguishing ability of the nomograms was evaluated by the concordance index (C-index), time-Dependent ROC Curve and the area under the curve (AUC). Finally, a net bene t curve for patients was plotted to re ect the potential utility and evaluate the clinical value of thismodel by decision curve analysis (DCA).
The genes included in the signature validated their expression signi cance in TIMER database( https://cistrome.shinyapps.io/timer/).The cBioportal database (https://www.cbioportal.org/) was used to investigate the overview of the alteration that occurred in BC for the novel metabolic-related genes.Moreover,The protein expression levels was validated in the Human Protein Atlas database (https://www.proteinatlas.org/) to visually compare the differentiation between tumor and control tissues.

GO and KEGG analyses by GSEA
We utilized GESA software(12)(version 4.1.0,permutations was set as 1000) to perform KEGG pathway analyses for the TCGA cohort.The potential molecular mechanism of genes in this signature was elucidated. Enrichment results satisfying a nominal P-value cutoff of <0.05 with FDR <0.25 were considered statistically signi cant.
Clinical Application in Tumor Microenvironment.
The ESTIMATE scores for each sample was calculated and the comparison between high-and low-risk groups was made.The ESTIMATE scores contained immune score, stromal score,and tumor purity which respectively re ected the in ltration level of immune cells,the stromal content, and the estimated tumor purity (13).Furthermore,we calculated speci c in ltration levels for 22 subtypes of immune cells through the CIBERSORT system to extend the utility of this metabolic signature (14).

Statistical analysis
All the statistical analyses were conducted by R software v4.0.2.The Cox and LASSO regression were employed to screen the survival related variables.The survival curves were compared by the log-rank test. The differences for the independent samples were analyzed by the Wilcoxon rank-sum test.The coe cient of correlation was calculated by Pearson correlation analysis. p<0.05 was considered statistically signi cant.

Extraction of MRGs from the TCGA database and identi cation of DE-MRGs
We downloaded a total of 1222 cases from the TCGA database and extracted 927 MRGs from the TCGA gene expression matrix.After adjusting with GSE20685 expression pro le,861 MRGs was nally selected as candidate MRGs.After screening, we identi ed 315 DE-MRGs in the TCGA database under the screening criteria (|logFC|>0.5 and the P-value<0.05), including 163 up-regulated MRGs and 152 downregulated MRGs. The top 10 differentially up-regulated MRGs and the top 10 differentially down-regulated MRGs were demonstrated in Table 1.The heatmaps of these top differentially up-regulated and downregulated MRGs were demonstrated in Figure 1. The potential drug treatments for breast cancer From the prediction of the Cmap dataset,6 candidate drugs that scored ≤-0.75 were considered as potential drugs for BC treatment.Strong negative correlationwas found between BC and LY-294002, tanespimycin, apigenin, omeprazole, liothyronine, sirolimus.Strong positive correlation was found between BC and adiphenine,viomycin,Prestwick-692,genistein,iso upredone and atractyloside( Table  2).These drugs might have therapeutic effects on BC.The tomographes of the top 3 associated molecule drugs were investigated in Pubchem database (Figure 2a-c). The GO analysis includes three categories (biological process, molecular function, and cellular component),and the 10 signi cant enrichment terms were displayed for each category ( Figure 3A).From the biological process,it was observed that DEGs were predominantly associated with small molecule catabolic process,purine-containing compound metabolic process and nucleoside phosphate biosynthetic process.For the cellular component,these DEGs were enriched in mitochondrial matrix,organelle outer membrane and outer membrane.In molecular function,these DEGs were associated with coenzyme binding,oxidoreductase activity,acting on CH-OH group of donors and oxidoreductase activity.
Furthermore, analysis of the KEGG pathways of these metabolic genes showed that 32 KEGG pathways were enriched,mainly including purine metabolism,Carbon metabolism,Fatty acid degradation, Tryptophan metabolism and Pyruvate metabolism (Fig. 3B) Construction and of validation the prognostic metabolic gene signature Univariate Cox regression analysis and lasso-penalized Cox regression analysis (15) were performed to identify the prognosis related metabolic genes and construct the prognostic gene signature.P<0.01 in univariate Cox regression analysis were considered statistically signi cant.LASSO analysis identi ed six DE-MRGs(NT5E,PAICS,PFKL,PLA2G2D,QPRT and SHMT2) which were included in the classi er. The riskscore = 0.0697*expression of NT5E + 0.0182*expression of PAICS + 0.0220*expression of PFKL -0.1127*expression of PLA2G2D + 0.0239*expression of QPRT + 0.0068*expression of SHMT2. All patients were assigned to high-(n = 519) and low-risk(n = 520) groups based on median risk scores. The AUC of ROC to the risk score was the best in both training set and testing set and the K-M survival analysis indicated signi cantly different survival time between the two groups ( Figure 4).In both univariate and multivariate Cox regressions, the hazard ratio of risk score was maximal compared with other clinical features.The univariate Cox regression focused on the individual variable but may be affected by the confounding factors.The multivariate Cox regression avoided this limitation. These analyses complemented each other and indicated that the risk score could be a de nitely independent risk factor for the prognosis of BC ( Figure 5).

Constructing and validating a predictive nomogram in the TCGA and GEO cohort
Nomograms was created for OS based on their common clinical traits.The prognostic nomogram for 3-,5-and 10-year OS of the TCGA and GEO cohort were shown in Figure 6A,B.In order to identify the discriminating superiority of nomograms,various methods were used in this study, including calibration curves C-index values and DCA curves.Calibration plots showed that the performance of the nomogram was best in predicting 3-,5-and 10-year OS ( Figure 6C,D). The C-index of the nomogram for the prediction of OS were 0.757 in training group,and 0.728 invalidation group.The DCA curves showed some net bene t for predicting OS, especially for 3-,5-and 10-year survival both in training group and validation group( Figure 6E,F).

Page 9/16
External Validation of the Prognostic Signature.
All these ve genes have been con rmed that they were signi cantly differentially expressed between BC and control tissues in the TIMER2 database, with signi cant differences in expression (Figure 7), which is consistent with our results.Through the Human Protein Atlas Database (https://www.proteinatlas.org/),we compared the protein expression levels in breast cancer and normal tissues( Figure 8A).In the cBioportal database,the mutation rate of QPRT is the highest, 4% in the sample, and other genes also show changes( Figure 8B). Correlation with Tumor Microenvironment.

Gene set enrichment analyses
Compared with the high-risk group, the Stromal score, Immune score and ESTIMATEscore of the samples in the low-risk group were higher, and the tumor purity in the high-risk group was higher with signi cant differences Figure 10 .

Discussion
The incidence of breast cancer increase continuously in the world and had been a serious women disease (16).Even with current signi cant progress in breast cancer treatment,25% to 50% of breast cancers would eventually develop metastasis, leading to poor prognosis (17).The elds of personalized prognosis and precise prediction in medicine are rapidly growing ones (18).In the era of precision medicine,it is vital to include as much prognostic and predictive information as possible for decisionmaking (19).
In this study, we rst performed a differential analysis of the metabolic genes in breast cancer and normal tissues in TCGA and GSE20605, and then performed enrichment analysis, and found DEGs mainly involved in various DNA, protein and fat metabolism processes.Furtherly,through Cmap database, we mainly screened 6 kinds of drugs that may treat breast cancer, among which the rst 3 kinds are LY-294002, tanespimycin, apigenin. LY294002 is a PI3K inhibitor that has been used to treat breast cancer in clinical practice LY294002 has a signi cant inhibitory effect on the proliferation of TNBC cells,and LY294002 is more effective on TNBC cells lacking BRCA1 (20).Tanespimycin is a quinone quinone antitumor antibiotic, derived from the anti-tumor antibiotic geldanamycin.Tanespimycin binds to and inhibits the cytoplasmic chaperone function of heat shock protein 90 (HSP90), It has been reported in the literature that Tanespimycin has strong anti-proliferation activity against breast cancer, colorectal cancer and cervical cancer (21) and can reshape the immunosuppressive microenvironment of triple-negative breast cancer, which is conducive to checkpoint blocking immunotherapy (22).Acquired resistance to adriamycin is a major obstacle to triple negative breast cancer (TNBC) treatment,apigenin can enhance the cytotoxic effect of adriamycin on breast cancer cells (23).
It has been reported that the occurrence of tumors depends on the direct and indirect consequences of oncogenic mutations, namely the reprogramming of cell metabolism and has been proposed to be a core hallmark of cancer (24). Therefore,identifying new metabolic genes in cancer to predict the risk of cancer death has become a hot spot.
In this study, we performed LASSO regression analyses and identi ed a novel six metabolic prognosisrelated genes signature including NT5E,PAICS,PFKL,PLA2G2D,QPRT and SHMT2 base on TCGA data set and veri ed its e ciency through the data downloaded from the GEO database (GSE20685).Among them, PLA2G2D was considered a protective factor while others were risk factors.It has satisfactory robustness and internal predictive ability in both data set.According to the all patients'riskscore results, the low-risk and high-risk patients are effectively strati ed.The data from the GEO database further validated the e cacy, indicating that the predictive e cacy of these metabolism-related genes has strong prognostic value. Abnormal metabolic function in the tumor microenvironment may lead to various prognosis of patients,and genes related to metabolism can be used as prognostic indicators for tumors.thus,we explored the correlation between signatures and TME to expand clinical applications and provide more clues for the choice of treatment strategies.
Previous studies have reported the study of genes in the model. NT5E, 5'-nucleotidase ecto, also known as CD73, is one of the key components of tumor immunosuppressive microenvironment formation. Studies have reported that the high expression of NT5E has a poor prognosis in breast and ovarian cancer, but has a protective effect in lung and gastric cancer. (25).Buisseret L etal reported that the expression of CD73 is related to poor prognosis of human TNBC and low anti-tumor immune function. Targeting CD73 may be a promising strategy to reshape the tumor microenvironment of this subtype (26). CD73 is highly expressed in breast cancer tissues, and increases with the increase of tumor grade and lymph node metastasis. CD73 is highly expressed in more malignant cells, and CD73 overexpression promotes the proliferation of breast cancer cells in vivo and in vitro. It activates the akt/gsk-3 and β/βcatenin/cyclinD1 signaling pathways through mechanisms such as CD73 enzyme activity (27).
PAICS(phosphoribosamidoimidazole carboxylase and phosphoribosamidoimidazole succinyloxycarbonylacetamide synthetase) catalyze two basic steps in the de novo purine biosynthetic pathway and overexpressed in many cancers and may become a promising target for the development of cancer treatments (28). The depletion of PAICS largely eliminates the expansion of breast cancer, exemplifying a prognostic gene related to breast cancer activity (29). Knockdown of PAICS inhibits malignant proliferation of human breast cancer cell lines, These ndings demonstrate that PAICS plays an essential role in breast cancer proliferation in vitro, which provides a new opportunity for discovering and identifying novel effective treatment strategies (30).PAICS is a therapeutic target for pancreatic cancer and can be targeted by small molecules (31).
Phosphofructokinase(PFK) is a key regulator of glycolysis and a key control point for glycolytic ux by catalyzing the phosphorylation of fructose 6 phosphate to fructose 1,6 bisphosphate. PFK is a tetramer composed of three subunits M (muscle), L (liver) and P (platelets), which are respectively encoded by three different genes PFKM, PFKL and PFKP (32).PFKL rs2073436C>G can predict the e cacy of NSCLC chemotherapy (33).Study reported A20 is an E3 ubiquitin ligase that interacts with the liver-type phosphofructokinase (PFKL) in liver cancer A20 to promote the degradation of PFKL, thereby inhibiting glycolysis of liver cancer cell lines (34). The PFKL/miR-128 axis regulates glycolysis by inhibiting AKT phosphorylation and predicts the low survival rate of lung cancer patients (35). PFKs are key genes responsible for glycolysis, and their expression is up-regulated in bladder cancer.Targeting this pathway can inhibit the growth of bladder cancer cells (36).
PLA2G2D, a secreted PLA subtype ,Quantitative PCR analysis of bovine PLA2G2D transcripts showed that their expression levels in breast samples were different between the dry period and lactation period, and their expression in liver tissue was polymorphic (37). Recent studies have shown that secreted phospholipase A2 (sPLA2s) represents an attractive potential tumor biomarker and a therapeutic target for various cancers, and the expression of PLA2G2D is reduced by up to 23 times. May be an ideal candidate for new biomarkers that in uence colon cancer (38). In the context of cancer, the lack of Pl2g2d leads to a signi cant reduction in the occurrence of skin cancer, which may be due to enhanced antitumor immunity. In summary, these results emphasize the general role of sPLA2-IID as an immunosuppressive sPLA2, which allows the lipid balance of the microenvironment to enter an antiin ammatory state, and exert bene cial or harmful effects according to different pathophysiological backgrounds in in ammation and cancer (39).
Quinolinate phosphoribosyltransferase (QPRT) is a key enzyme in the de novo synthesis of NAD.Compared with normal tissues, DCTPP1 and QPRT are highly expressed in BC. The overexpression of DCTPP1 and QPRT is related to the slow progress of BC, which promotes the growth, migration and invasion of MCF7 and T47D cells, but inhibits cell apoptosis DSCAM-AS1 gene knockout reduced the expression of DCTPP1 and QPRT, and inhibited the growth, migration and invasion of estrogen receptor-positive BC (40). Phosphoribosyl transferase (QPRT) is the terminal rate-limiting enzyme in KP Kynurenine pathway ,and its expression is down-regulated, Widespread dysregulation of KP in renal cell carcinoma is common and may lead to tumor immune escape, which is of great signi cance for effective targeted therapy of this key pathway (41).
Serine hydroxymethyltransferase 2 (SHMT2) is a protein-coding gene that regulates the production of glycine in mitochondria, which is an important intermediate in purine biosynthesis, It is a valuable marker in a variety of cancers, including intrahepatic cholangiocarcinoma, large B-cell lymphoma,Kidney Cancer and breast cancer (42).Studies have shown that SHMT2 expression is an independent prognostic factor for breast cancer (43).
GSEA analysis revealed that the difference in enrichment pathways between high-risk and low-risk groups.The results show that the two risk groups have signi cantly different metabolic characteristics.The pathway in the high-risk group is mainly related to DNA synthesis metabolism, and the pathway in the low-risk group is related to lipid metabolism.
Changes in tumor cell metabolism have caused a very large impact on the tumor microenvironment, causing corresponding changes in the pH value and metabolites of the microenvironment.In this study,we have not only clari ed the metabolic characteristics and mechanisms under the conditions of different risk strati cations, but also predicted the metabolic microenvironment and prognostic characteristics through different immune scoring methods. Correlation analysis shows that our metabolic signature is statistically correlated with approximately half of immune cells,therefore,we can monitor the in ltration composition of immune cells and the degree of immune response through metabolic characteristics.Our metabolic signature can re ect the changes in TME from different aspects,and may provide clues for explaining immunological resistance and immunotherapyapplications.
Our study has several limitations.Firstly,the included metabolic genes were from the TCGA and GSE20685 databases, and some metabolic genes may be missed.Secondly,although our signatures had been validated in multiple databases,the experiments are still needed to further con rm its clinical application value.Thirdly,our data was based on transcriptomics data, it was di cult to re ect the overall landscape of tumor metabolism.

Conclusions
Based on the TCGA data set,we determined a new six-gene metabolic gene signature for BC prognosis prediction.Our signature may re ect the dysregulation of the metabolic microenvironment and provide potential biological markers for metabolic therapy and treatment response prediction.However,more functional experiments are needed to validate our signature.

Declarations
Data Availability