Development of Prognostic Model Based on Five-gene Related to Tumor Mutation Burden in Hepatocellular Carcinoma

Background Hepatocellular carcinoma (HCC) , the incidence rate ranks sixth and regards as the second leading cause of cancer-related death in the world. Tumor mutation burden (TMB) , a novel biomarker featured with microsatellite instability, has been thought to be closely related to tumor microenvironment and immunotherapy. Methods In this study, the 357 HCC samples were download from The Cancer Genome Atlas (TCGA) and analyzed. Results Single nucleotide polymorphism (SNP) and C > T variation had the most missense mutations, and we also observed the higher mutation frequency in TP53, TTN, CTNNB1. With an optimal cut-off value of 4.61, the high level of TMB presented a worse prognosis. Functional analysis of the differentially expressed genes (DEGs) showed they were enriched in the pathway of the formation of extracellular matrix organization and sulfur compound metabolic. A 5-gene-based (including SFRP4, IL7R, FBLN2, COLEC10 and CHGA) model has been constructed to explore the independent prognosis capacity of each HCC patient. The establishment of model well predicting the median of 1- , 3- and 5-year the


Introduction
Hepatocellular carcinoma (HCC) has become the sixth most prevailing type of cancer and the second leading cause of cancer-related death in the world, accounting for a majority of the primary liver cancer 1 . Given the rapid progress and its multiple complications, most patients were diagnosed until the disease is in the middle-advanced stages. Moreover, a high-proportioned of patients with early-stage HCC experienced relapse after surgical interventions. The high mortality rate worldwide mirrors the di culty in clinical diagnosis 2 . Tumor mutation burden (TMB) has been identi ed as a reliable indicator for immunotherapy effect, which provides a possible strategy in the most extensively studied pre-clinical eld. TMB represents the total number of the substitution, insertion or deletion mutation for every megabase in the exon coding region 3 . It is a novel marker to forecast the bene t of the treatment of ICIs for a range of tumors, such as lung, endometrial, breast, and colorectal cancer [4][5][6][7][8] .
In spite of the connection between TMB and the immune microenvironment, as well as that between TMB and immunotherapy in HCC have been preliminarily investigated recently, the correlation between TMB and overall survival (OS) in patients remains unclear [9][10][11] . Additionally, due to the complexity and high cost of sequencing, it is unlikely to be widely used in clinical evaluation of the e cacy of immunotherapy. Meanwhile, studies on HCC mentioned the relationship between TMB and speci c genetic changes were mercifully rare.
The main objective of our research is to explore the potentially predictive value of TMB in terms of the OS of HCC patients, and investigate hub genes related to TMB levels. To predict the likelihood of OS rate of each individual in HCC patients, we identi ed 5 core genes related to TMB and correspondingly construct a 5gene-based prognostic model. In addition, we provided an overview of the overall mutation pro le in the HCC sample to explore new possibilities for individualized treatment.

Data download and processing
The public genomic and matched clinicopathological data related to HCC were downloaded from TCGA databases (https://portal.gdc.cancer.gov), and the genomic data format was FPKM. Some samples were removed due to the loss of data, 357 tumor samples were left to undergo further analysis. Similarly, the somatic mutation data of HCC were obtained from the TCGA databases and visualized with the "maftool" package in R software 12 .

TMB calculation and grouping
TMB is the sum of substitutions, insertions, and deletions per megabyte in the exon in a tumor sample. So we calculated TMB by variants number/ the exon length (usually about 38 million) via Perl scripts (Table  S1) . X-tile was used to nd the optimal cut-off value of TMB combined with patient OS rate, then de ned anyone higher than the threshold as TMB-H, lower one as TMB-L ( Figure S1).

Differentially expressed genes acquiring
Then we used the R package "limma" to process the data and screen out differentially expressed genes (DEGs) associate with TMB. All DEGs were identi ed using |log2 FC| > 1 and false discovery rate (FDR) < 0.05, and the"pheatmap" package was utilized for hierarchical clustering.

Functional enrichment analysis of DEGs
The gene ontology (GO) pathway enrichment analysis and KEGG pathways analysis of DEGs were processed via R package "org. Hs. eg. db". "clusterPro ler" and "ggplot2" package were used for annotation and drawing 13 . Then, we conducted GSEA enriched GO to explore the underlying pathways signi cantly in two groups with p< 0.05.

Construction of TMB Prognostic index (TMBPI)
The OS rate of DEGs was analyzed using the GEPIA database (http://gepia.cancer-pku.cn/) 14 . Then we screened out the hub genes according to logrank p < 0.05. After that, we divided all the patients into high-and low-risk groups based on the risk score (RS). K-M curves were applied to present the relationship between RS levels and OS. Finally,We drawn the survival receiver operating characteristic curve (ROC) and caculate the area under the curve (AUC) for assessing the value of prognosis model.

Statistical analysis
R Studio(https://www.rstudio.com) was used for the statistical analysis 15 . OS was calculated by K-M and log-rank test methods.The comparison of continuous variables in two groups were performed by Student's t-test or one-way analysis of variance. p value<0.05, the difference was statistically signi cant.

Overview of TMB in HCC
First of all, we assessed the mutations of each HCC sample in TCGA cohort to offer insights into the factors linked to HCC mutagenesis. We found that that] SNP, missense mutations, and C > T mutation were most frequent varient type in HCC. C > T variation was the highest mutation type reached 15420 (Fig. 1). The last bar chart showed the top 10 mutated genes in HCC, and the results revealed that the variation of TP53, TTN, CTNNB1, MUC16, PCLO, and ALB, RYR2, ABCA13, MUC4, APOB (p < 0.001) were higher.

Waterfall and interrelationship of top genes
In view of the MutSigCV algorithm, the waterfall diagram mapped and characterized the overview of HCC somatic mutations( Fig. 2A). Interrelationship plot was used to reveal the correlation between the expression of top 20 mutated genes, the results showed that the genes pairs of OBSCN and CTNNB1or FLG(p < 0.001) LRP1B and TTN or APOB, MUC16 and SPTA1 or ABCA13 or RYR2, PCLO and CACNA1E or HMCN1, RYR2 and XIRP2, APOB and LRP1B, CSMD3 and FLG, LRP1B and CACNA1E, HMCN1 and SPTA1(p < 0.05) were co-expression. In addition, AXIN1 and CTNNB1 were mutually exclucised(p < 0.05) (Fig. 2B). 3.3 TMB and clinical relevance X-tile was utilized to determine the optimal cut-off value at 4.61 (Fig. S1). Then we divided 357 samples into TMB-H and TMB-L groups based on this value. The scatter chart showed the distribution of the two groups of data. The mean value of TMB-H group and TMB-L group are 8.19 and 2.79 (Fig. 3A). The results of K-M analysis showed that TMB was related to prognosis(p < 0.05). Consistent with previous studies, patients with high TMB tend to have poor prognosis 16 (Fig. 3B, p < 0.05). As showed in Fig. 3C-E, the relationship between TMB and clinicopathological features including patients' gender, age, and tumor grade (tumor cell differentiation) and TNM staging revealed that TMB was connected to N stage (p < 0.05), age (p < 0.05), and gender (p < 0.05) in HCC patients. Older (age > 65) men without lymphatic metastasis tend to show higher TMB. But no association of TMB levels with other clinicopathologic characteristics was identi ed, such as T/M stage, clinical staging or grading( Figure S2).

Enrichment analysis for the DEGs
39 DEGs associated with TMB in all were identi ed(Table. S2). Fig. 4A showed the hierarchical clustering heatmap. The main biological process (BP), molecular function (MF) and cellular component (CC) associated with hub gene are presented in Fig. 4B. KEGG pathway analysis revealed that the DEGs were primarily enriched in the extracellular matrix organization, extracellular structure organization, sulfur compound biosynthetic process and sulfur compound metabolic process (Fig. 4C). The results of GSEA analysis revealed that RNA metabolic processes were enriched in TMB-H group, while other pathways including regulation of supramolecular ber organization, regulation of organelle organization, positive regulation of organelle organization, actin lament organization, regulation of cytoskeleton organization, Actin polymerization or depolymerization, regulation of actin lament-based process and regulation of actin lament organization enrichment in TMB-L group (Fig. 4D).

Cox regression analysis
Analysis on OS demonstrated that hub genes (SFRP4, IL7R, FBLN2, COLEC10, and CHGA) (Fig. 5) in the model might be regarded as independent prognostic factors for HCC.Based on these ve hub genes, univariate and multivariate cox regression analyses were utilized to establish a prognosis model, ROC was utilized to verify the accuracy of the model. We downloaded the transcriptome data from 357 HCC matched with clinical information by R package "merge". Derived from the multivariate Cox regression model, we calculated the TMBPI as the following formula: PI = (0.08894538×SFRP4−0.00991219×COLEC10−0.00069051×CHGA−0.02461234×FBLN2−0.16547922×IL7R). The RS for each individual was obtained by TMBPI, and then we divided the sample into high and low-risk groups based on the median of the risk score (Table. S3). the AUC of 1-, 3-and 5-yaer survival rate were 0.64, 0.67 and 0.59 (Fig. 6B). And the results exhibited that the patients in the high-risk group tend to have a worse prognosis(P<0.001) (Fig. 6C).

Discussion
HCC is a highly aggressive disease and consistent with megascopic heterogeneity. Some studies have found that several signaling pathways are speci cally disrupted in it. For instance, PI3K/AKT and IKK/NF-κB pathway were activated in HCC so that promote cell proliferation and induce epithelial-mesenchymal transformation 17,18 . In addition, the immune system takes a crucial part in tumor development. Marina et al.
have observed that the activation of CTNNB1 pathway promoted immune evasion and drug resistance 19 .
As a pivotal factor for the immune system to recognize and remove hepatoma cells, TMB affects the prognosis of patients before and after treatment a lot. Moreover, a signi cant correlation between TMB and the objective response rate has been observed in various cancers 20 . In present study, we found the missense mutation was the most frequent type in HCC. Studies have shown that missense mutations in TP53 differentiate HCC into different subtypes and may contribute to the progression of the disease 21 . SNP variants account for the vast majority in HCC, and in our study, the number of C > T variation reached 15,420, which is linked closely to the occurrence and development of HCC 22,23 .
Functional analysis results of TMB-associated DEGs showed they were mainly relateded to the extracellular matrix organization and sulfur compound metabolic process. Extracellular matrix is one of the components of the immune microenvironment of HCC, some achievements have been made in the application of targeted drug delivery 24 . The metabolism of elemental sulfur may be closely related to autophagy in HCC 25 .
Moreover, we found a distinctly inverse association between the activity of RNA metabolic process and TMB. An cummulating evidence pointed out the regulatory effect of different non-coding RNAs (ncRNAs) were obviously in various etiologies, including HBV, HCV and NAFLD, and eventually lead to HCC [26][27][28] . This perhaps one of the most reason for the discouraging overall prognosis in TMB-H group.
Then we combined the TMB-associated DEGs and HCC patients' OS rate for analysis. As the hub genes, SFRP4, IL7R, FBLN2, COLEC10 and CHGA were screened out which were reported to play an important role in HCC. SFRP4 is located in chromosome 7p14.1, it contains a cysteine-rich domain. This domain can regulate the Wnt signaling pathway by directly binding to Wnt thus form silencing complexes and suppress HCC 29 . IL-7R have a place in T cell differentiation and lymphocyte development 30 . The up regulation of IL-7R may active the intracellular pathways, induce expression of associated molecules, and promote the proliferation and migration of hepatoma cells 31 . COLEC10 codes Collectin Liver 1 (CL-L1), a member of the C lectin family 32 . Studies have shown that COLEC10 is regulated by miR-452-5p, promotes proliferation, invasion and migration in hepatoma cells 33 . CHGA (or CGA) is highly expressed in HCC samples, previous study showed it can be used as an auxiliary diagnostic tool for HCC 34 . In addition, risk model was introduced based on RS. This study showed signi cant statistics difference of prognosis between the FBLN2 high and low expression groups. However, the role of FBLN2 in liver cancer and its application value have not been reported, which may be a target worthy of exploration. Based on the above 5 genes, multivariate Cox regression model analysis of speci c overall probability was constructed in each patient. In the current study, we accurately predicted the possibility of OS in HCC patients. The AUC of 1-, 3-and 5-year survival rates were 0.64, 0.67 and 0.59 respectively. The model also showed a high degree of consistency when combined with the clinical data of our center. As far as we know, the joint analysis of ve biomarkers, which have not been put forward before, will provide better clinical indications for patient to explore novel prognostic factors in HCC. Viewed from one perspective, our model dispense with identifying somatic mutations altogether in patients. Importantly, we aim to adopt a more e cient approach to signi cantly reduce the cost of sequencing and make hub gene-targeted sequencing technology more routine. Based on the expression of these ve genes in patients, we intend to calculate the RS, which may prove useful in indicating patient outcomes for those who were not appropriate to undergo surgical treatments. Viewed from another perspective, accurate prognostic assessment is the crux of appropriate treatment. In routine clinical operation, pathological staging is essential for oncologists and patients to evaluate HCC prognosis. Due to lack of adequate clinical training for liver biopsies, a mass of patients had to be evaluated preoperatively by imaging(e.g, CT, MRI) or routine biochemical indexes(e.g, AFP, CEA, CA199). In this study, ve biomarkers, less-invasive and more exible, would be detected based on body uid(e.g, blood, urine) to prognostic prediction. Meanwhile, it was the rst time that TMB,DEGs with HCC clinical data are put together in our model for predicting patient outcomes. Utilizing molecular analysis combined with pathological staging would predict more accuracy than traditional systems for HCC patients in the era of precision medicine.
The substantial predictive ability and value of the genomic model for HCC patients should be recognized, we should also be aware of the several limitations in our study. First, unfortunately, no information about immunotherapy for each patient was retrieved in current study, which limited our study of TMB mediated changes in the HCC immune microenvironment. Additionally, because only a subset of the tumors in the TCGA project had both clinicopathological and TMB data available for analysis, the present object is limited to a small sample. Finally, lackness of information about the whole exon sequencing for HCC patients in our unit, We don't have much clinical data on our own to test this model. Thus, the e ciency of the prognosis model deserves more external validation using independent cohorts in the future.

Conclusion
In conclusion, we established a novel 5-gene-based prognosis model for outcome prediction for individual patient in HCC. This model might be a promising tool for risk strati cation of prognosis in HCC.

Authors' contributions
Shujun Zhou designed and mainly executed this project. Qingling Li nished writing and modifying the R language code, and was a major contributor in writing the manuscript. Bin Yu, Qifa Ye and Yanfeng Wang guided the idea of the project and revised the manuscript. All authors read and approved the nal manuscript. Figure 1 Overview of TGCA HCC cohort mutations.   Supplementary Files