Metabolic genes show excellent prognostic ability for clear cell renal cell carcinoma CURRENT STATUS: POSTED

Renal cell carcinoma (RCC) is one of the major malignant tumors of the urinary system, with a high mortality rate and a poor prognosis. Clear cell renal cell carcinoma (ccRCC) is the most common subtype of RCC. Although the diagnosis and treatment methods have been significantly improved, the incidence and mortality of ccRCC are high and still increasing. The occurrence and development of ccRCC are closely related to the changes of classic metabolic pathways. This article aims to explore the relationship between metabolic genes and the prognosis of patients with ccRCC.

in the expression of these MGs between high-risk patients and low-risk patients

Conclusion
We have successfully established a risk model (riskScore) based on 4 MGs, which could accurately predict the prognosis of patients with ccRCC. Our research may shed new light on ccRCC patients' prognosis and treatment management.

Background
In the United States, the death rate of cancer ranks second among all diseases [1]. Renal cell carcinoma (RCC) is a common tumor in the urinary system, ranking among the top ten cancer diagnoses in the world. Clear cell renal cell carcinoma (ccRCC) is the most common histological subtype in kidney cancer, accounting for about 80%. Compared with other subtypes of renal cell carcinoma (including papillary renal cell carcinoma, chromogenic renal cell carcinoma and collecting duct carcinoma), ccRCC showed worse survival outcomes [2]. Due to the lack of obvious clinical symptoms, many patients are already in advanced stages at the time of diagnosis. Therefore, the prognosis of patients becomes more complicated [3].
Glycolysis, fatty acid metabolism, tyrosine metabolism, nucleotide anabolism and other metabolic pathways are necessary to maintain normal cell homeostasis [4,5]. Many scientists believed that abnormal metabolic processes could be a key factor in the development of cancer [6][7][8]. In breast cancer and colorectal cancer, studies have reported that abnormal metabolic pathways are associated with poor prognosis [9,10]. Marin-Rubio et al believed that since abnormally expressed metabolites could be detected in various stages of tumor tissues, metabolic genes (MGs) could be used as prognostic markers for cancer [11]. The prognostic value of metabolic genes has been verified in a variety of cancers including prostate cancer, lung adenocarcinoma, liver cancer, head and neck squamous cell carcinoma, rectal cancer and so on [12][13][14][15][16][17].
Clear cell renal cell carcinoma has been demonstrated that it is usually accompanied by reprogramming of glucose metabolism, reprogramming of fatty acid metabolism and reprogramming of the tricarboxylic acid cycle. The metabolism of tryptophan, arginine and glutamine is also reprogrammed in many ccRCCs [18]. Therefore, these changes in metabolic pathways provide new possibilities for ccRCC treatment strategies and biomarkers. However, few studies have reported the prognostic role of metabolic genes in ccRCC.
In our study, we used gene expression data and related clinical data of ccRCC patients to explore the relationship between metabolic genes and ccRCC prognosis. Subsequently, we used the metabolic genes to establish a reliable prognostic model of ccRCC, which was verified in The Cancer Genome Atlas (TCGA) database, in GEO database and in ArrayExpress database respectively.
The "sva" package was used to correct the transcriptome data from different databases. The "Limma" package was used for further difference analysis.

Identification of MGs with prognostic value
Firstly, we excluded patients with a follow-up time of less than 90 days. Then, the univariate Cox regression analysis was used to identify genes with prognostic value from differentially expressed MGs. We used the "survival" package in R software for univariate Cox regression analysis.

Establishment of an independent prognostic index (PI, riskScore) based on MGs in training cohort
We divided patients in TCGA cohort with complete records of clinical information such as age, gender, pathology stage, histological grade and TMN stage into training cohort and testing cohort. From the perspective of the distribution of patients in various clinical features, the patient composition in the training cohort and the patient composition in the testing cohort were similar.
In order to further identify the key prognostic MGs, the lasso regression and the multivariate Cox regression analysis were performed. According to the weight of each gene in multivariate Cox regression analysis, we obtained the correlation coefficient in the model formula for predicting the prognosis of patients. Combined with the expression of various prognosis-related genes, we established an independent prognostic model. The PI was calculated using the following formula: β 1 × gene 1 expression + β 2 × gene 2 expression + ····· + β n × gene n expression, where β corresponded to the correlation coefficient.
Validation of key genes at the protein level We used The Human Protein Atlas (HPA) database (https://www.proteinatlas.org/) to verify the protein level of key MGs in the established prognosis index.
Evaluation of the prognostic index in the training cohort According to our prognostic model, each patient would get a risk score. We set the median risk score as the cutoff value for dividing ccRCC patients into a high-risk group and a low-risk group. Kaplan-Meier (K-M) method was utilized to plot the survival curves. The log-rank test was performed to assess differences in the survival rates between the high-risk group and the low-risk group. The timedependent receiver operating characteristic curves (ROC) were created by the "survivalROC" package. The area under the curve (AUC) values were calculated to evaluate the specificity and sensitivity of the model. The risk score distribution of patients, Survival status scatter plots for patients in the prognostic model and the heatmap of prognosis-related MGs were also displayed.
Verification of the prognostic ability of this model in the testing cohort and the entire cohort First of all, to verify the performance of the prognostic model in the testing cohort, the survival curve was draw with the "survival" package and "survminer" package. A time-dependent ROC curve was also made. Next, the risk score distribution of patients in the prognostic model, Survival status scatter plots for patients in the prognostic model and the heatmap of prognosis-related MGs were also displayed. We then repeated the verification work done on the testing cohort throughout the entire cohort (including TCGA cohort and GEO cohort).
Then, the correlation between the prognostic model (risk genes and riskScore) and each clinical feature was analyzed to illustrate the reliability of the model we built.
To further evaluate whether our model could be used as an independent prognostic factor, we included age, gender, pathology stage, histological grade, T, M, N and riskScore as independent variables. And then we did univariate Cox regression analysis and multivariate Cox regression analysis on the changes of overall survival time and overall survival outcome.

Verification of the prognostic ability of this model in ArrayExpress cohort
According to our prognostic model, each patient in ArrayExpress cohort will get a risk score. Then, we use the cutoff value from training cohort to divide ccRCC patients into a high-risk group and a low-risk group. Kaplan-Meier (K-M) method was utilized to plot the survival curves. And the log-rank test was performed to assess differences in the survival rates between high-risk group and low-risk group. The time-dependent receiver operating characteristic curves (ROC) were created by the "survivalROC" package. The area under the curve (AUC) values were calculated to evaluate the specificity and sensitivity of the model. Next, K-M survival curve analysis was used to assess the prognostic value of each risk gene.

GSEA enrichment analysis
Gene-set enrichment analysis was used to explore the mechanisms that lead to different outcomes between patients in the high-risk group and patients in the low-risk group.
Nomogram development and validation for prognostic risk prediction By means of "rms" package of R software, A prognostic nomogram was also performed to visualize the relationship between individual predictors and overall survival rates in patients with ccRCC based on the Cox proportional hazard regression model.

Establishment of a ceRNA network
In order to further explore the possible mechanism for the difference in the expression levels of key risk genes between the two groups, we downloaded microRNA expression data and lncRNA expression data of ccRCC patients from the TCGA database.
According to the prediction model we established, we divided TCGA-ccRCC patients into high-risk and low-risk groups. Then we used R software to standardize the data and performed a difference analysis with the "Limma" package.
The lncRNA-miRNA interactions and miRNA-mRNA interactions were integrated together to form the molecular mechanism axis of lncRNA-miRNA-mRNA by using Cytoscape v3.7.1.

Statistical analysis
Statistical analyses of all data utilized in this article were completed by R software (version 3.5.1, https://www.r-project.org/). When the difference met a joint satisfaction of FDR < 0.05 and |log2fold changes (FC)| >1, it was regarded to be statistically significant. Student's t test was used for continuous variables, while categorical variables were compared with the chi-square (χ2) test. The Wilcoxon rank-sum test was utilized to compare ranked data with two categories. The Kruskal-Wallis test was utilized for comparisons among three or more groups. The univariate Cox regression analysis and multivariate Cox regression analysis were used to evaluate the relationship between MGs expression and survival data to establish a prognostic model. "rms" package of R software was used to create the nomogram. The receiver operating characteristic curves were created by the "survivalROC" package of R software and AUC values were also calculated by this package. If the AUC > 0.60, we would consider this model to have a certain predictive ability. If the AUC > 0.75, we would consider this prediction model to have excellent predictive value. All statistical tests were twosided and P < 0.05 was considered to be statistically significant.

Differentially expressed prognostic MGs
Through the online TCGA database, we obtained the mRNA expression of 446 ccRCC tissues and 63 normal kidney tissues. We combined the RNA sequencing results of 39 other ccRCC patients in the GSE29609 dataset with TCGA cohort.
After correcting and standardizing the data, we obtained the differentially expressed gene profile between the ccRCC group and the normal group through the "Limma" package. Then, we extracted the information on the differential expression of 944 metabolic genes. The filtration conditions (|log 2 fold-change|>1, FDR < 0.05) were set. Next, 479 differentially expressed MGs were identified, including 196 up-regulated genes and 283 down-regulated genes (Fig. 1). Based on the obtained differentially expressed MGs, we carried out the univariate Cox regression analysis to identify MGs related to the ccRCC patient's overall survival (OS). Finally, a total of 187 genes were considered to be prognostic MGs.

Construction a prognostic model index based on MGs
Due to the lack of clinical data such as pathology stage and histological grade in the GEO database, we use the patients in the TCGA database to divide them into training cohort and testing cohort.
Finally, the patients in the TCGA database and the patients in the GEO database were merged together as the entire cohort.
After excluding patients with a follow-up survival time of less than 90 days (n = 15) and excluding patients with incomplete clinical information (n = 11), we divided the remaining TCGA-ccRCC patients into two groups with similar composition ratios (296 patients for training cohort, 124 patients for testing cohort). Table 1 shows the clinical characteristics of patients included in the study.
Immediately afterwards, the lasso regression and the multivariate Cox regression analysis were performed in the training cohort to target key risk genes (Fig. 2). Finally, we identified 4 optimal risk genes. Among these four risk genes, P4HA3 was considered as predictors of poor prognosis. The

Evaluation of the prognostic model index based on MGs in training cohort and testing cohort
After obtaining the prognostic model index based on MGs for predicting the prognosis of ccRCC patients, we took a series of measures to evaluate the model.
Firstly, we searched The Human Protein Atlas (HPA) database (https://www.proteinatlas.org/) for risk genes and found that at the protein level, the expression of these risk genes was consistent with the results of the mRNA differential analysis we obtained (Fig. 3). This indicated that the model we built is to some extent credible.
Then, we created Kaplan-Meier curves based on the log-rank test to visualize the prognostic value of our established prognostic model in training cohort and in testing cohort (Fig. 4A, Fig. 5A). From the survival curves, we could see that whether in the training cohort or in the testing cohort, patients in the high-risk group have worse prognosis than those in the low-risk group (HR = 1.1, 95%CI = 1.1-1.2, p < 0.001; HR = 1.1, 95%CI = 1.1-1.2, p < 0.001). Figure 4B  From Fig. 4D and Fig. 5D we found that as the risk score increases, the number of dead patients increases. The expression patterns of the risk genes in the high-risk group and the low-risk group were shown in the form of a heat map (Fig. 4E, Fig. 5E), from which we found that whether in the training cohort or in the testing cohort, P4HA3 was up-regulated in the high-risk group, downregulated in the low-risk group. The patterns of ETNK2, PAFAH2 and ALAD were the opposite.
Evaluation of the prognostic model index based on MGs in entire cohort  (Fig. 6A-D). It was worth mentioning that the predictive ability of the prognostic model we established was better than the predictive ability of the pathology stage. Figure 6E shows the result of risk classification of patients according to riskScore. From Fig. 6F we could found that as the risk score increases, the number of deaths increases. The expression patterns of the risk genes in the high-risk group and the low-risk group were shown in the form of a heat map (Fig. 6G), from which we found that P4HA3 was up-regulated in the high-risk group, down-regulated in the low-risk group. The patterns of ETNK2, PAFAH2 and ALAD were the opposite.
Evaluation of the prognostic model index based on MGs in ArrayExpress cohort In order to verify whether our model was reliable, we evaluated the prognostic value of risk score in the external cohort from ArrayExpress database (E-MTAB-1980). The external cohort contained 101 patients with ccRCC. Similarly, we calculated the risk score of each patient based on riskScore. Then we divided the patients into a high-risk group and a low-risk group according to the cutoff value we obtained in the training cohort. A Kaplan-Meier curve based on the log-rank test and the timedepedent ROC curve were created to visualize the prognostic value of our established prognostic model in external cohort ( Fig. 7A and Fig. 7B)

Clinical correlation analysis
Based on the information of all ccRCC patients from TCGA database we obtained, the correlation analysis between risk factors in the prognostic model (riskScore and each component gene) and clinical characteristics such as age, gender, pathology stage, histological grade, TMN was performed. Figure 8 showed the results. We found that there was a significant correlation between the riskScore and gender, pathology stage, histological grade, T, M (Fig. 8A).

Independent prognostic factor evaluation and GSEA enrichment analysis
To further evaluate whether our model could be used as an independent prognostic factor, we included some key clinical characteristics containing age, gender, pathology stage, histological grade, TMN and riskScore as independent variables. By means of univariate and multivariate Cox regression analysis, our established PI (riskScore) remained significant (both P < 0.001, Table 2). At the same time, the results of multivariate Cox regression analysis also showed that histological grade and M could be used as independent prognostic indicators (both P < 0.05). Looking only at the results of multivariate Cox regression analysis, we could find that riskScore, histological grade and M were related to prognosis (p < 0.001, p = 0.041, p = 0.014 respectively).
In addition, in order to further explore the possible mechanisms that caused different outcomes in the high-risk group and the low-risk group, we performed Gene-set enrichment analysis (GSEA) on the gene expression profiles of the two groups of patients. Figure  Then, we used miRcode online software to predict possible matching pairs between differential lncRNAs and differential microRNAs. Finally, 26 of 176 differentially expressed miRNAs and 340 of 535 differentially expressed lncRNAs were successfully matched, and there was a correlation between them. According to the 26 target miRNAs and 4 risk genes, 23 microRNA-mRNA interactions were found in the three different databases of Targetscan, miRTarBase and miRDB, 159 lncRNA-microRNA interactions were also identified. In general, a lncRNA-microRNA-mRNA ceRNA network was established to further explore the possible mechanism for the difference in the expression levels of key risk genes between the two groups (Fig. 11).

Discussion
In recent years, the incidence of asymptomatic ccRCC has been rising. Patients are prone to poor prognosis and high mortality, which brings new challenges to early clinical detection and treatment.
Studies have shown that the effects of chemotherapy and radiotherapy on ccRCC are not ideal.
Although surgical resection can suppress the metastasis and development of the tumor to a certain extent, the postoperative recurrence rate is persistent and it is difficult to achieve the desired therapeutic effect [19,20]. Studies have shown that after surgical removal of cancer tissue, many ccRCC patients still have poor prognosis such as cancer metastasis and death. The 5-year survival rate is low, indicating that the treatment effect of ccRCC patients is limited [21]. Therefore, biological indicators of disease occurrence and development have become the focus of research in recent years [22][23][24].
Changes in metabolic pathways exist in many diseases, including tumors. Metabolic reprogramming is very important to maintain abnormal proliferation of tumor cells [25]. More and more cancer researchers are focusing their attention on the mechanism of tumor metabolic changes. A recent study pointed out that the heterogeneity of metabolic reprogramming in the tumor is clearly related to the outcome of the tumor [26]. New cancer treatments targeting metabolic pathways are being studied in many tumors. A global regulator of glucose metabolism (MUC1), which was found to be a target for pancreatic cancer treatment recently. After MUC1 was knocked out in pancreatic cancer cells, Fu et al found that the metabolic activity of the cancer cells was significantly reduced, the cancer cells were better able to receive chemotherapy [27]. Similarly, in soft tissue sarcoma, targeting glutamine metabolism can well inhibit tumor growth [28]. Drugs that inhibit glucose metabolism and drugs that inhibit lipid metabolism can be used as a new treatment for lung cancer [29]. Recently, some researchers have suggested that ccRCC is a metabolic disease [30]. They believe that some metabolic pathways in ccRCC can be used as potential biomarkers for therapeutic targets, diagnosis and prognosis. Lucarelli et al pointed out that the changes of metabolic activities in cancer are related to the changes of gene expression. They believed that the changes of metabolic genes are closely related to the occurrence and development of tumors. They further determined that NADH dehydrogenase (ubiquinone) 1 alpha subcomplex 4-like 2 (NDUFA4L2) plays an important role in the occurrence and development of ccRCC [31]. However, few studies have studied the relationship between metabolic genes and the prognosis of ccRCC patients.
In our study, we analyzed and screened out new metabolic genes with prognostic value from the TCGA database and the GEO database. We further verified the feasibility of the method through the data downloaded from the ArrayExpress database. Finally, we have successfully established a risk model (riskScore) based on 4 MGs, which could accurately predict the prognosis of patients with ccRCC. The scientific model construction methods, comprehensive evaluation methods of prognosis ability and verification of multiple databases make the prognostic model we have established with high credibility. According to the prognostic model we established, patients with different survival outcomes were accurately classified into high-risk patients and low-risk patients. Which also indicates that metabolic genes have a good prognostic value in ccRCC. In addition, we have further conducted GSEA analysis to explore the possible mechanisms leading to different survival outcomes. The results of GSEA analysis show that many metabolic pathways changed in low-risk patients, which further suggests that metabolism plays a very important role in ccRCC.
P4HA3, ETNK2, PAFAH2 and ALAD are the four risk metabolic genes we identified. Among these four risk genes, P4HA3 was considered as predictors of poor prognosis. The higher the expression of P4HA3, the worse the prognosis of patients. Three other genes, ETNK2, PAFAH2 and ALAD, were protective factors. P4HA3, encoding a component of prolyl 4-hydroxylase, can promote the proliferation and invasion of melanoma cells [32]. In addition, P4HA3 has been shown to be closely related to the poor survival outcomes of breast cancer and gastric cancer [33,34]. ETNK2 (Ethanolamine Kinase 2) is a protein Coding gene. Among its related pathways are phosphatidylethanolamine biosynthesis II and Glycerophospholipid biosynthesis. The decreased expression of ETNK2 is related to the progression of prostate cancer [35]. PAFAH2 is closely related to ether lipid metabolism. Kono et al revealed that PAFAH2 is involved in the metabolism of esterified 8isoprostaglandin F (2alpha) and protects tissue from oxidative stress-induced injury [36]. However, no one has pointed out its relationship with the survival of cancer patients. Our analysis of patients in the TCGA database and ArrayExpress database showed that PAFAH2 could well distinguish the survival outcome of ccRCC patients and could be a protective prognostic factor. ALAD Catalyzes an early step in the biosynthesis of tetrapyrroles. ALAD has been found to be related to the favorable survival outcome of breast cancer patients. Overexpression of ALAD can inhibit the proliferation and invasion of breast cancer cells [37]. In summary, the four risk genes we identified show a relatively consistent prognostic value in the TCGA database and ArrayExpress database, and have been verified to a certain extent in the IHC data of the HPA database and in other tumors. All these suggest that the prognostic model we established has a high degree of credibility. In addition, we successfully established a lncRNA-microRNA-mRNA ceRNA network to further explain the differences in the expression of these genes between high-risk patients and low-risk patients.