Identication of Key Genes in HCV-induced HCC at Early Stage on the Background of HCV-cirrhosis

Background: Hepatocellular carcinoma (HCC) is one of the most common and deadly malignant tumors worldwide. Hepatitis C virus (HCV)-induced cirrhosis (HCV-cirrhosis) is one of the leading causes of HCC occurrence. Due to the lack of effective biomarkers, most patients with HCV-induced HCC (HCV-HCC) are at advanced stages when they are rst diagnosed. Our study aims to employ transcriptomic proling dataset to identify potential biomarkers that can predict HCV-HCC at early stage on the background of HCV-cirrhosis. Methods: The dataset incorporating HCV-cirrhosis and HCV-HCC subjects at different stages from Gene Expression Omnibus was analyzed to identify gene signature-related to HCV-HCC on the background of HCV-cirrhosis. Multiple-genes based risk score-related prediction model was established to predict HCV-HCC samples at different stages, and receiver-operating characteristic (ROC) curve analysis was used to select the best cutoff value of risk scores for prediction and to evaluate the prediction model. Samples with risk scores higher than cutoff value were dened as high score group. Results: Highly clustered 20 genes were identied and were all gradually increased as HCV-HCC progressed from HCV-cirrhosis to very advanced HCC. Compared with HCV-cirrhosis, the prevalence of HCV-HCC at any of the stages in high score group was 100%. The risk score-related prediction model based on the 20 genes-biomarker got the accuracy of over 95.2 % with area under ROC (AUC) over 0.94. To implement the biomarker easily in real life and make it economical, we tried to limit the gene numbers. Interestingly, CDKN3 and FAM83D were signicantly decreased in HCV-cirrhosis tissues as compared to normal tissues and then increased as HCV-HCC progressed. The results were attractive that the prediction model based on this 2 genes-biomarker indicated that prevalence of HCV-HCC in high score group was still 100% and got the predictive accuracy of over 95.2 % with AUC over 0.96, revealing even a better performance. Conclusion: Our study indicated a 2-genes-based biomarker that could identify HCV-HCC at earlier stages on the background of HCV-cirrhosis, which might be a promising biomarker for early diagnosis of HCV-HCC and potential novel treatment target.


Background
Hepatocellular carcinoma (HCC) is a primary malignant tumor of the liver that threatens human health. HCC mainly occurs in patients with chronic liver disease (often cirrhosis), largely resulting from chronic hepatitis B virus (HBV) and/or hepatitis C virus (HCV) infection [1,2]. With the introduction of universal immunization to eliminate transmission of HBV infection, HBVinduced HCC has been largely prevented [3]. However, the incidence of HCV-induced HCC (HCV-HCC) has increased by fteento twenty-fold over a 30-year period [4]. The mortality of HCV-HCC is still very high, as incidence of HCV-induced cirrhosis (HCVcirrhosis) continues to increase [5]. Once HCV-cirrhosis is established, the risk for developing HCC is about 1-4% annually [6]. The 5-year survival rates of HCC patients diagnosed at early stage exceed 70% with effective surgical treatment, while patients present at advanced stage have a median survival time of less than 12 months [7]. Early detection of HCC has important implications for clinical outcomes. Therefore, patients with HCV-cirrhosis should perform surveillance to increase early detection of HCC and survival [8].
However, because of the lack of accurate and effective biomarkers in the early stage of HCV-HCC on the background of cirrhosis, most patients with HCC are at advanced stages when they are rst diagnosed. To improve the early diagnosis rate of HCV-HCC, identi cation of biomarkers with highest speci city and sensitivity is strongly required and is an active eld of research. However, there is no good potentially useful molecular biomarkers to precisely predict HCC [9]. It is very di cult to diagnose HCC with a single gene, while several studies have shown that multiple genes could help to predict the occurrence of HCC [10][11][12]. In the past, studies mainly focused on identifying biomarkers by comparing the gene expression pro les between tumor and non-tumorous tissues [11,[13][14][15]. But there is no research that discusses HCV-HCC occurrence on the background of HCV-cirrhosis. Therefore, it is promising to identify multi-genes-based biomarker to predict HCV-HCC at early stage on the background of HCV-cirrhosis. Now, there are rich public repositories of RNA-seq and microarray datasets, such as Gene Expression Omnibus (GEO) database [16] and The Cancer Genome Atlas (TCGA) [17]. Further analysis of those datasets combined with computational approach such as prediction model can provide the opportunity to establish a novel biomarker for disease diagnosis with high accuracy [18,19].
In our study, an open source dataset with HCV-HCC tissues at different stages and HCV-cirrhosis tissues from GEO was included. We found 20 HCV-HCC-related genes that were signi cantly up-regulated and these genes were progressively increased from HCV-cirrhosis to very advanced HCC. More importantly, the risk score-associated prediction model based on the 20 genes got predictive accuracy of over 95% and area under the receiver operating characteristic curve (AUC) over 0.94, indicating very good predictive power of the 20-genes-based biomarker to distinguish HCV-HCC at earlier stages and HCVcirrhosis. To optimize the biomarker with limited gene number for better clinical application in the future, we investigated the role of CDKN3 and FAM83D that were interestingly decreased in HCV-cirrhosis tissues as compared with normal tissues in predicting HCV-HCC. We were surprised to nd that this two-genes-based biomarker showed excellent diagnostic ability that the risk score-associated prediction model got the predictive accuracy of over 95.2% with AUC over 0.96. Therefore, this twogene-based biomarker might be a promising biomarker for early diagnosis of HCV-HCC on the background of HCV-cirrhosis and could also be the potential novel treatment target in the future.

Materials And Methods
Retrieval of datasets on HCV-cirrhosis and HCV-HCC from public database We searched transcriptome pro les of HCV-cirrhosis and HCV-HCC tissues from GEO database. The search terms we used included "HCV", "cirrhosis" and "HCC." Our selection criteria are as follows: (1) studies involved the use of adult liver tissues; (2) studies included HCV-cirrhosis tissues and HCV-HCC tissues at different stages; (3) microarray or RNA-seq datasets.
According to the retrieval criteria, the open source dataset GSE6764 from the GEO database was included [20]. The log 2 of the total genes in the dataset was calculated and the expression was normalized. Demographics for GSE6764 were summarized in Table 1. HCV-cirrhosis tissues and very advanced HCV-HCC tissues were de ned as the discovery group. While, there were four validated groups: validated group-1 (advanced HCC tissues and cirrhosis tissues), validated group-2 (early HCC tissues and cirrhosis tissues), validated group-3 (very early HCC tissues and cirrhosis tissues) and validated group-4 (normal tissues and cirrhosis tissues).We illustrated our work ow for analytical procedure of the dataset in Fig. 1. Identi cation of gene signature-related to HCV-HCC on the background of HCV-cirrhosis HCV-HCC-related genes were de ned as highly clustered differentially expressed genes (DEGs) between HCV-cirrhosis and very advanced HCV-HCC tissues which might have pivotal implications in driving HCV-cirrhosis to HCV-HCC. Firstly, we used GEO2R online software from GEO to determine the DEGs between cirrhosis and very advanced HCC tissues, and the cutoff value of adjust p-value was < 0.05 and |LogFC| was ≥ 4. Next, STRING database 11.0 (https://string-db.org/) combined with Cytoscape software 3.8.0 was used to show the interaction of the DEGs. The MCODE plug-in in the Cytoscape software was used to nd the highly clustered DEGs which were the gene signature-related to HCV-HCC on the background of HCV-cirrhosis, and the cutoff criteria were 'maximum depth = 100', 'KCore = 2', 'node score cutoff = 0.2', and 'degree cutoff = 2' [21].
To visualize the expression pattern of the gene signature, heatmaps of the gene expression were generated using R 4.0.1 software to differentiate HCV-HCC at each stage from HCV-cirrhosis. Principal component analysis (PCA) was then used to investigate the classi cation performance of the gene signature visualized in 3D-plots by scatterplot3d package in R.
Establishing multiple-genes based prediction model In order to verify that the gene signature could distinguish between HCV-cirrhosis and HCV-HCC, multiple-genes based risk score-related prediction model was established. Higher or lower than median value of risk score was used to differentiate high or low score group.
The formula for risk score is [22]: (n: the count of genes; w i : the weight value of the i th gene ( Table 2); e i : the expression level of the i th gene; u i : mean value for the i th gene among whole samples; s i : standard deviation value for the i th gene among whole samples.) The receiver operating characteristic (ROC) curve was utilized to evaluate how well the gene signature-related risk scores could distinguish between HCV-HCC at different stages and HCV-cirrhosis by using pROC package in R software. We also used ROC curve analysis to determine the cutoff value of risk score. Then, we used higher or lower than the cutoff value of risk score to differentiate high or low score group.
To investigate the possible mechanism of the gene signature related to HCV-HCC on the background of HCV-cirrhosis, we used Database for Annotation, Visualization and Integrated Discovery (DAVID 6.8, https://david.ncifcrf.gov) to perform Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the genes in the 20-gene signature. Pathways with p-value less than 0.05 were selected.

Results
Finding gene signature-related to HCV-HCC on the background of HCVcirrhosis in the discovery group According to the retrieval criteria, we included the open source dataset from GEO database identi ed under GSE6764 for HCVcirrhosis and HCV-HCC tissues. We de ned highly clustered DEGs between cirrhosis and very advanced HCC tissues as HCV-HCC-related genes, which might have important implications in driving HCV-cirrhosis to HCV-HCC. After GEO2R analysis, we found 94 genes in very advanced HCC as compared to cirrhosis in the discovery group ( Fig. 2A). By using the STRING database, we found totally 252 interactions which was visualized by using Cytoscape software. A group of genes was showed to be highly connected, which might play a pivotal role in the network. Then, this highly clustered gene module was mined by using the MCODE plug-in, holding a high connectivity (cluster score = 19.895) with 20 nodes and 189 edges (Fig. 2B). What caught our attention was that the 20 genes were all up-regulated in very advanced HCC tissues as compared to cirrhosis tissues.
Then, heatmap was carried out to visualize the expression pattern of the 20 genes, and patients in the discovery group was obviously classi ed into two groups (Fig. 2C), most of which corresponded to the patients' diagnosis, indicating that the 20gene signature might be able to differentiate very advanced HCC from cirrhosis. Then we performed PCA to verify the classi cation power of the gene signature. The PCA plot indicated that the 20-gene signature clearly distinguished very advanced HCC from cirrhosis (Fig. 2D). Therefore, this 20-gene signature might have a signi cant power of discriminating very advanced HCC from cirrhosis patients.
Verifying the gene-signature for discriminating HCV-HCC from HCVcirrhosis using multiple-genes based prediction model To verify the classi cation power of the gene signature in distinguishing between very advanced HCC and cirrhosis tissues, we established a multiple-genes based prediction model with a scoring system representing a linear combination of the 20-gene expression values with a weight value to allocate each sample with a score to measure the possibility of risk [22]. The risk scores from very advanced HCC were signi cantly higher than those in cirrhosis tissues, which was consistent with our expectations (Fig. 2E). The higher the score, the higher the probability of diagnosis of very advanced HCC. Next, we used higher or lower than median value of risk score to differentiate high or low score group, and the results showed that the prevalence of very advanced HCC in high score group was 100%, while that in low score group was 0%. We then utilized the ROC curve to evaluate this risk score model. The accuracy was 100% with speci city and sensitivity of 100%, and AUC for the ROC curve was 1 (Fig. 2F). Therefore, this risk score-related prediction model based on the 20 genes-biomarker could be used to predict very advanced HCV-HCC from HCV-cirrhosis.
Predicting HCV-HCC at earlier stage using the multiple-genes based prediction model in validated groups The 20-gene signature-related risk score model was able to predict HCV-induced very advanced HCC from HCV-cirrhosis.
However, we were more interested in the potential role of this 20-gene signature in prediction of advanced HCC, early HCC and even very early HCC in validated groups, which was signi cant in catching HCC as early as possible. We found that the 20 genes were gradually increased as HCC processed from cirrhosis to very advanced stage (Fig. 3). But there was no obvious difference between normal tissues and cirrhosis tissues. Therefore, the 20 genes were strongly associated with HCV-HCC progression and might be used to predict HCV-HCC at earlier stage on the background of HCV-cirrhosis.
Heatmaps was used to display the expression pattern of the 20 genes in validated groups. Except for validated group-4 that the expression pro le was similar between normal tissues and cirrhosis tissues, the other validated groups showed obviously upregulated expression pattern of the 20 genes in advanced HCC, early HCC, as well as very early HCC compared to cirrhosis tissues (Fig. 4). We assumed that this 20-gene signature might also be able to discriminate advanced HCC, early HCC or even very early HCC from cirrhosis tissues. PCA plots showed that the 20-gene signature could clearly distinguish advanced HCC tissues, early HCC tissues, as well as very early HCC tissues from cirrhosis tissues, but could not distinguish between cirrhosis and normal tissues.
Next, we used risk score-related prediction model to investigate the role of the 20-gene signature in validated groups. The 20gene-related risk scores from advanced HCC tissues were signi cantly higher than those in cirrhosis, while, more importantly, it was repeatedly observed in early HCC and very early HCC as compared to cirrhosis ( Fig. 5A and D). Therefore, this 20-gene signature had the statistical power to predict HCC at earlier stage. We used higher or lower than median value of risk score to differentiate high or low score group, and the results showed that the prevalence of advanced HCC in high score group was 70%, while that in low score group was 0% (Fig. 5B). The prevalence of early HCC and very early HCC in high score group was 90.9% and 70% respectively, while that in low score group were 0% and 9.1% respectively. ROC curve evaluation showed that the risk score-related prediction model based on the 20 genes-biomarker got the accuracy of 100%, 100% and 95.2% in advanced HCC, early HCC and very early HCC respectively as compared with cirrhosis tissues in the validation groups, while AUC was 1, 1 and 0.94 respectively. ROC analysis showed that the cutoff value of risk score in the group with cirrhosis and all HCC tissues was − 9.01 (Fig. 5E). Then we used higher or lower than − 9.01 to differentiate high or low score group. The results indicated that the prevalence of advanced HCC, early HCC and very early HCC in high score group was all 100%, while that in low score group was 0%, 0%, and 7.1% respectively (Fig. 5C). Therefore, the risk score-related prediction model based on the 20-genes-biomarker combined with ROC analysis could better discriminate HCC at each stage from cirrhosis. The 20 genes might be associated with HCV-HCC onset and progression on the background of cirrhosis, and they had the statistical power to predict HCV-HCC at earlier stage.
Identifying the core genes among the 20-gene signature related to HCV-HCC on the background of HCV-cirrhosis In order to explore interactions among the 20 genes, we used DAVID online software to analyze KEGG pathway related to the 20 genes and the top 6 hits were shown in Table 3. The results showed that cell cycle that played an important role in the carcinogenesis or progression of tumors was the most signi cant pathway, which was consistent with the signi cant role of cell cycle-related pathogenesis in HCC [23]. Therefore, these genes could not only predict the occurrence of early HCV-HCC, but also might be potential therapeutic targets. In our study, we found CDKN3 and FAM83D were the only two genes that were statistically signi cant between the normal tissues and the cirrhosis tissues. Interestingly, the expression of these two genes was decreased in HCV-cirrhosis tissues as compared to normal tissues, and was gradually increased as HCV-HCC developed. CDKN3 and FAM83D are the important regulators of cell cycle that however have received little attention, especially in HCV-HCC occurrence and progression on the background of HCV-cirrhosis [24][25][26]. We hypothesized that CDKN3 and FAM83D played a more signi cant role in HCV-HCC occurrence and progression and might be better biomarkers for early detection of HCV-HCC on the background of HCV-cirrhosis. Investigating the role of risk score-related prediction model based on CDKN3 and FAM83D In order to implement the biomarker easily in real life and make it economical, we tried to limit the gene numbers of the 20genes-based biomarker. Then we established risk score-related prediction model based on CDKN3 and FAM83D according to the interesting result that the two genes were rst downregulated in HCV-cirrhosis as compared to normal tissues and then increased as HCV-cirrhosis processed to HCV-HCC. CDKN3 and FAM83D-related risk scores from very advanced HCC tissues were signi cantly higher than those in cirrhosis, while, more importantly, it was repeatedly observed in advanced HCC, early HCC and very early HCC as compared to that in cirrhosis (Fig. 6A). The prediction model based on this 2 genes-biomarker got the predictive accuracy of over 95.2% with AUC over 0.96, revealing even a little bit better performance than that based on the 20-genes-based biomarker (Fig. 6B). Therefore, limiting the gene number to only two genes (CDKN3 and FAM83D) improved the performance of the multiple-gene based risk score-related prediction method. ROC analysis showed that the cutoff value of risk score in the group with cirrhosis and all HCC tissues was − 1.114. Then we used higher or lower than − 1.114 to differentiate high or low score group (Fig. 6C-6F). The results indicated that the prevalence of very advanced HCC, advanced HCC, early HCC and very early HCC in high score group was all 100%, while that in low score group was 0%, 0%, 0% and 7.1% respectively. Therefore, the 2-genes-based biomarker could well predict HCV-HCC at earlier stage on the background of HCVcirrhosis, and might be the potential treatment target in the future.

Disscusion
HCC is one of the most common and deadly malignant tumors worldwide [27]. HCV infection is a pivotal cause of cirrhosis, with signi cantly increased incidence of HCC development [5]. In patients with HCV-HCC, the morbidity and mortality remain high with late diagnosis and poor outcome. Therefore, there is an urgent need to nd potential markers for early diagnosis and treatment on the background of cirrhosis. Successful HCV-HCC animal models are more di cult to build. Meanwhile, liver biopsy is the most reliable method for the diagnosis of liver cancer, but it is an invasive test with potential complications. At present, a large amount of microarray data or RNA-seq data was published, but there is a lack of su cient analysis to provide further guidance for clinical and scienti c researches. Brie y, in our study, we investigated the data with HCV-cirrhosis tissues and HCV-HCC tissues from GSE6764 to compare the gene expression changes between HCC tissues at different stages and cirrhosis tissues. We found 20 hub genes which might be able to diagnose asymptomatic HCV-HCC patients at very early stage, more importantly, these genes could be used for screening HCV-HCC with relatively high accuracy. In order to implement the biomarker easily in real life and make it economical, we limited the gene numbers of the 20-genes-based biomarker to two genes (CDKN3 and FAM83D). The prediction model based on this 2 genes-biomarker showed even a little bit better performance than that based on the 20-genes-based biomarker. Therefore, CDKN3 and FAM83D played a signi cant role in HCV-HCC occurrence and progression, and this 2 genes-biomarker might be used to predict HCV-HCC at earlier stage in the future as well as new potential treatment target.
Here, rstly, the group with HCV-cirrhosis and very advanced HCV-HCC tissues were classi ed as the discovery group. A highly clustered module with 20 genes were identi ed in the discovery group. We established the risk score-related prediction model based on the 20-gene signature in the discovery group to investigate its discriminating role. The risk score in the very advanced HCC tissues was signi cantly higher than that in cirrhosis tissues. Interestingly, the prediction model got the accuracy of 100% with AUC for the ROC curve being 1. Therefore, the 20-gene signature could be used to predict very advanced HCC from cirrhosis, and this risk score-related prediction model based on multiple-genes might be used to investigate the role of the 20gene signature in HCV-HCC at earlier stages.
Next, we were more interested in the potential role of this 20-gene signature in prediction of HCC at earlier stages in validated groups, which was signi cant in catching HCC as early as possible. The results showed that the 20 genes were progressively increased from cirrhosis to advanced HCC stages. Heatmap and PCA plots showed that these genes could discriminate advanced HCC, early HCC or even very early HCC from cirrhosis tissues. The risk score-related prediction model based on the used for early detection of HCV-HCC. Then we used higher or lower than the cutoff value of risk score to differentiate high or low score group, the prevalence of HCV-HCC at any of the stages in high score group was 100%. The risk score-related prediction model based on multiple-genes could well predict HCV-HCC at earlier stage on the background of HCV-cirrhosis, which might be also used in other studies to investigate the prediction role of multiple-genes biomarker.
However, it is better to limit the gene number of multiple-genes biomarker to implement the biomarker easily in real life and to make it economical. Among the 20 genes, interestingly, the expression of CDKN3 and FAM83D was decreased in cirrhosis tissues as compared to normal tissues, and gradually increased during HCC progression. Therefore, these two genes might have better predictive ability to identify HCV-HCC at early stage on the background of HCV-cirrhosis. Consistent with our nding, CDKN3 and FAM83D were important regulators of cell cycle, and studies had also shown that they were highly expressed in tumor tissues compared to normal tissues, and overexpression of CDKN3 and FAM83D was correlated with the poor outcome in HCC patients [24,[29][30][31][32][33]. However, studies focusing on the role of these two genes in HCV-HCC occurrence and progression on the background of HCV-cirrhosis are limited. Then we established the risk score-related prediction model based on CDKN3 and FAM83D, indicating that prevalence of HCV-HCC in high score group was still 100% and the prediction model got the predictive accuracy of over 95.2% with AUC over 0.96, showing even a little bit better performance than that based on the 20-genes-based biomarker. Therefore, CDKN3 and FAM83D played a signi cant role in HCV-HCC occurrence and progression, and this 2 genes-biomarker was able to predict HCV-HCC at earlier stage with high accuracy using the risk scorerelated prediction model.
In the past, studies focusing on analysis of HCC-related genes usually included patients with HCC caused by various factors such as chronic hepatitis B or C virus infection and alcoholic liver disease [18,19,34]. But there is limited research that only discusses HCV-related liver cancer on the background of HCV-cirrhosis, so we re-analyzed the data in GSE6764. We investigated the development of HCC at different stages on the background of cirrhosis which were all caused by hepatitis C.
This was better for us to understand development of HCV-HCC from cirrhosis and HCC progression process. Compared with other studies, we included normal liver tissues instead of liver tissues adjacent to tumors,which helped us to detect hub genes between cirrhosis and normal tissues to investigate whether the genes also played a role in development of cirrhosis. No single biomarker has adequate sensitivity or speci city for HCC diagnosis. More and more studies are beginning to notice the use of genome-wide expression analysis to predict and diagnose diseases [35,36]. Compared with the single-gene biomarker, our multiple-genes based risk score-related prediction model got high accuracy with high sensitivity and speci city, which could be also used in other studies.
Taken together, our study found 20 hub genes related to the progression of HCV-HCC on the background of HCV-cirrhosis, while more importantly, the risk score-related prediction model based on the 20 genes got high accuracy in discriminating HCV-HCC at earlier stage on the background of HCV-cirrhosis. It is more important that we went further to identify CDKN3 and FAM83D-based biomarker which could have better predictive accuracy to discriminate HCV-HCC at earlier stage from HCVcirrhosis. Actually, there are a lot of researches focusing on developing novel biomarkers for HCC with few studies focusing on HCV-HCC, but further work is always needed to implement the utilization of novel biomarker to meet the real clinical demand. Now, identi cation of cost-e cient novel biomarkers with predictive technology such as risk score-related prediction model used in our study for the detection of diseases will be promising. In this study, only GSE6764 met the screening criteria, making it di cult to verify our result in other studies currently. Further work about the expression pro le and predictive effect of this 2genes signature in HCV-HCC on the background of HCV-cirrhosis is needed to be performed with a larger sample size.

Conclusion
In conclusion, this study found that the upregulated 20 hub genes were closely related to the entire spectrum of development of HCV-HCC on the background of HCV-cirrhosis. Multiple-genes based risk score-related prediction model was established to investigate the role of the 20-genes-based biomarker as well as CDKN3-and FAM83D-based biomarker in identifying HCV-HCC at earlier stage. Interestingly, CDKN3 and FAM83D that were both important regulators of cell cycle could better predict HCV-HCC at earlier stage on the background of HCV-cirrhosis with better performance using the multiple-genes based risk scorerelated prediction model. Therefore, CDKN3-and FAM83D-based biomarker might be used to predict HCV-HCC at earlier stage on the background of HCV-cirrhosis in the future as well as new potential treatment target in HCV-HCC. Availability of data and materials The data of this study are from GEO database. The data that support the ndings of this study are available from the corresponding author upon reasonable request.