Identication of a 9-gene signature related to stemness for prognosis in hepatocellular carcinoma basing on weighted gene coexpression network analysis

Cancer stem cells (CSCs) are heterogeneous with self-renewal and differentiation ability. The mRNA expression-based stemness index (mRNAsi) described the similarity between tumor cells and CSCs, which is positively associated with the poor prognosis of cancer patients. However, the key prognostic genes related to mRNAsi in hepatocellular carcinoma (HCC) remains unclear.

through WGCNA (Bai et al., 2020). In the present study, we identi ed a gene signature related to prognosis and mRNAsi in HCC, and analyzed the relationships between those mRNAsi and prognosis-related genes and clinical characteristics, immune subtypes, and tumor microenvironment.

Data download and preprocessing
The RNA-Seq data and the corresponding clinical information of 374 liver HCC (LIHC) and 50 normal liver tissue samples were downloaded from The Cancer Genome Atlas TCGA databases (https://www.cancer.gov/) in November 2019. The human.gtf was downloaded from the ensemble website (https://asia.ensembl.org/info/website/upload/gff.html). The mRNAsi was obtained from a previously published study (Malta et al., 2018). Immune subtypes were downloaded from the xena browser (https://xenabrowser.net/datapages/).

Establishment of WGCNA modules and identi cation of stemness index and prognosis-related genes
The WGCNA R package (Langfelder and Horvath, 2008a) was used to construct a co-expression network.
In the rst, Limma R package (Ritchie et al., 2015) was used to identify the differentially expressed genes (DEGs) with selection criteria: false discovery rate (FDR) < 0.05, and |log2FC| > 1. After deleting the outliers by setting the abline value to 20000, DEGs were used to construct a weighted co-expression network, which is satis ed scale-free topology (soft threshold β = 9; scale-free R 2 = 0.89). The average linkage hierarchical clustering is established by hierarchical clustering based on topological overlap matrix (TOM)-based dissimilarity. Gene clustering tree is obtained to identify the mRNAsi-related gene module and epigenetically regulated mRNAsi (EREG-mRNAsi)-related gene module (minimum gene number size of 50, P < 0.05). Once the modules of interest were selected, hub genes in corresponding modules and gene expression pro les were screened out by thresholds of gene signi cance (cor. GS) > 0.3, module membership (cor. MM) > 0.7, and P < 0.05. Survival analysis of hub genes was performed by the survival(T., 2020) and survminer R package (Kassambara., 2020) with the Kaplan-Meier (KM) method.
The genes signi cantly associated with overall survival (OS) (P < 0.05) were screened out and their receiver operating characteristic curves (ROCs) were performed by using the survival ROC R package (J., 2013). The genes in the module with the highest correlation with mRNAsi and prognosis were identi ed as mRNAsi and prognosis-related genes.

GO and KEGG Analysis
To evaluate the biological functions of mRNAsi and prognosis-related genes, we used the clusterPro ler R package (Yu et al., 2012) for GO functional annotations and KEGG pathway analysis. The threshold values were as follows: p < 0.05, and FDR < 0.05.
To verify the co-expression of mRNAsi and prognosis-related genes, we used the corrplot R package (Wei and Simko, 2013) to obtain the correlation between the expression levels of each gene, as well as the Person's correlation coe cient.
Clustering of tumor samples basing on the hub gene-signature, and comparison of OS and clinical characteristics To further analyze the role of mRNAsi and prognosis-related genes in HCC, we clustered the tumor samples into two subgroups by ConsensusClusterPlus R package (Wilkerson and Hayes, 2010) according to the expression levels of those hub genes, and performed PCA analysis using the limma R package (Ritchie et al., 2015) and visualized by the ggplot2 R package (Villanueva and Chen, 2019). Then, The difference in OS between the two groups was compared using the survival R package(T., 2020). The differences in clinical characteristics (age, gender, T1-T4, N, M, stage I-IV, and grade) between the two groups were also analyzed. For comparison, t-test was used for numerical variables and the nonparametric test was performed for categorical variables. The survival and forestplot R packages (Lumley., 2020; were used for univariate Cox regression analysis, and the survival R package(T., 2020) was used for multivariate regression analysis.

Gene set enrichment analyses (GSEA)
GSEA is a computational method that determines whether an a priori de ned set of genes shows statistically signi cant, concordant differences between two biological states (Subramanian et al., 2005). GSEA v4.0.3 for Windows was used. The gene sets including c5.all.v7.0.symbols.gmt and c2.cp.kegg.v7.0.symbols.gmt were downloaded from the Molecular Signatures Database (http://software.broadinstitute.org/gsea/msigdb/index.jsp). The normalized enrichment score (NES) was acquired by 1000 permutations. A gene set is considered to be signi cantly enriched when P-value is < 0.05 and false discovery rate (FDR) is < 0.05.
Relationships between each mRNAsi and prognosis-related gene and clinical characteristics, immune subtypes, and tumor microenvironment The correlations between each hub gene in mRNAsi and prognosis-related gene pro le and clinical characteristics in HCC were analyzed using the beeswarm R package (Eklund., 2016). The in ltration of immune and stromal cells in HCC tumor was evaluated by immune and stromal scores using the Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data (ESTIMATE) algorithm (Yoshihara et al., 2013), and HCC purity was assessed by ESTIMATE score. Correlations between gene panel and these scores were calculated using Spearman's correlation analysis, correlations between gene expression and immune subtypes (Thorsson et al., 2018) in the tumor microenvironment were analyzed using an ANOVA model, and the correlation between HCC purity and mRNAsi-related gene expression was detected by Spearman's correlation test. Immune subtypes including C1 (wound healing), C2 (IFN-γ dominant), C3 (in ammatory), C4 (lymphocyte depleted), C5 (immunologically quiet), and C6 (TGF-β dominant) were characterized as previously described (Thorsson et al., 2018).

Expression of hub genes in HCC in Human Protein Atlas (HPA)
The protein levels of mRNAsi and prognosis-related genes in HCC were veri ed by pathological staining images from the HPA database (https://www.proteinatlas.org/).

WGCNA analysis of DEGs basing on mRNAsi
The bioinformatics analysis was performed to identify the DEGs between 50 normal (non-tumor) samples and 374 LIHC samples obtained from TCGA. As shown in Fig. 1, there are a total of 1146 signi cantly downregulated and 5593 signi cantly upregulated genes (Supplemental Table 1). To further analyze the DEGs obtained in the study which the highest correlated with mRNAsi, WGCNA was constructed to identify important gene modules (Fig. 2). After eliminating 22 outliers, the samples were clustered into 325 1-cluster and 27 2-clusters ( Fig. 2A). Then, the 1-cluster was selected, and WGCNA analysis was performed on 7667 DEGs, and nally, 13 modules were obtained after merging the similarity modules (Fig. 2B). Among the modules, the grey module was not signi cantly correlated with EREG-mRNAsi, but other modules were signi cantly correlated with mRNAsi and EREG-mRNAsi (Fig. 2C). Among the 6 signi cant negative modules, the magenta module was most signi cantly correlated with mRNAsi, with a correlation coe cient of -0.71 (P < 0.001). Among the 6 signi cant positive modules, the blue module was signi cantly correlated with mRNAsi, with the biggest correlation coe cient of 0.38 (P < 0.001) (Fig. 2C).

Identi cation of mRNAsi and prognosis-related genes
The KM analysis of the DEGs was performed and a total of 245 of the DEGs had signi cantly correlated with the 3-year, 5-year survival of HCC patients (Supplemental Table 2).
To evaluate the relationship between mRNAsi and the survival of HCC patients, we compared the prognostic-related DEGs and genes in all modules. There were not prognostic-related genes in the magenta module. Most prognostic-related DEGs were also appeared in the turquoise (55 DEGs) and blue (50 DEGs) modules. However, the correlation coe cient between the turquoise module and mRNAsi was only 0.12 (Fig. 2C). So, we subsequently analyzed the mRNAsi and prognosis-related DEGs in the blue module.
With cor. GS > 0.3, cor. MM > 0.7, and P-value < 0.05, a total of 51 hub genes were obtained in the blue module ( Fig. 2D, Supplemental Table 3) and 49 genes were obtained in the magenta module (Fig. 2E). The genes in the magenta module were not considered due to their poor prognostic value. Among the hub genes in the blue module, 9 hub genes were prognostic-related genes, including PSMG3, SNRPD1, DTYMK, PIGU, NME1, TXNL4A, IPO4, PES1, and REXO4 (Fig. 3). A better prediction e ciency of 1 year, 3 years, and 5 years of the 9 hub genes was obtained.

Veri cation of expression of mRNAsi and prognosis-related genes in HCC samples
Basing on the expression of those hub genes, the HCC and normal tissue samples were clustered. All the 9 hub genes were signi cantly upregulated ( Fig. 4A, P < 0.001). The expression of 9 genes was positively correlated, and the correlation coe cient was greater than 0.5 ( Fig. 4B, P < 0.001). NME1, DTYMK, SNRPD1 and TXNL4A in a PPI network, and REXO4, PES1 and IPO4 in another PPI network (Fig. 4C). PIGU and PSMG3 were separate with other hub genes, respectively (Fig. 4C).

Identi ed two clusters of HCC samples base on mRNAsi and prognosis-related genes expression
We clustered the tumor samples according to stemness index and prognosis-related gene expression. The clustering results showed that skew of cumulative distribution function (CDF) towards 0 at k = 2 (Fig. 5A), and areas of CDF at k = 2 and k = 3 were bigger than other cluster numbers (k) (Fig. 5B). However, the robustness at k = 3 is poor (Fig. 5C), suggesting that the tumor samples divided into two clusters was the optimal choice. To determine whether our classi cation was correct, we analyzed the clustering of the two clusters of samples using principal component analysis (PCA), the distribution of the two subgroups was signi cantly different from each other (Fig. 5D).
Compared with the clinical outcomes and clinicopathological features between two sample subgroups In order to further analyze the differences between the HCC samples of the 2 clusters, we used KM survival analysis and one-way ANOVA to analyze the differences in survival and clinicopathological characteristics of the two clusters of patients. The results showed that the KM survival curves of OS of the two clusters were signi cantly different, and the survival rates of cluster 1 patients within 9 years were signi cantly greater than that of cluster 2 patients (Fig. 6A, P < 0.01). Consistent with the survival, the prognosis of cluster 1 was better than that of cluster 2 (alive vs. dead, P < 0.05; Fig. 6B). Comparing the differences in hub genes expression and clinicopathological characteristics between the two subgroups, the expression of mRNAsi and prognosis-related genes in cluster 2 was signi cantly higher than that in cluster 1, and the T, stage, and grade between the two subgroups were signi cantly different ( Fig. 6B, P < 0.05). The results suggest that the abnormally high expression of mRNAsi and prognosisrelated genes is closely related to clinicopathological characteristics and prognosis.

Functional annotation of mRNAsi and prognosis-related genes
To further understand the biological function of the mRNAsi and prognosis-related genes, we performed GO and KEGG enrichment analysis ( Fig. 7 and Table 1). GO analysis includes three parts of biological process (BP), cellular component (CC), and molecular function (MF). BP analysis showed that the top 10 enriched terms were pyrimidine nucleoside triphosphate biosynthetic process, ribonucleoprotein complex biogenesis, pyrimidine nucleoside triphosphate metabolic process, nucleobase-containing small molecule interconversion, pyrimidine nucleotide biosynthetic process, pyrimidine-containing compound biosynthetic process, pyrimidine nucleotide metabolic process, spliceosomal complex assembly, pyrimidine-containing compound metabolic process, and nucleoside diphosphate metabolic process. The signi cantly enriched BP terms were all associated with nucleic acid metabolic processes, and the major enriched genes were DTYMK and NME1 ( Fig. 7A and Table 1). Table 1 The enriched GO terms and pathways with associated hub mRNAsi and prognosis-related genes.  7A and Table 1).
The KEGG pathway analysis showed that DTYMK and NME1 enriched in pyrimidine metabolism, SNRPD1 and TXNL4A enriched in spliceosome, and PIGU enriched in glycosyl phosphatidylinositol (GPI)anchor biosynthesis pathways ( Fig. 7B and Table 1).

Prognostic value of mRNAsi and prognosis-related genes
To better understand the prognostic role of mRNAsi and prognosis-related genes in HCC, we performed univariate Cox regression analysis based on the expression status of each hub gene, with the patient's prognosis as the outcome variable, in the analysis we assigned a risk score to each patient, patients were divided into high-and low-risk subgroups (Supplemental Table 4 (Fig. 8A).
As all genes were signi cant in univariate cox regression analysis, we included all genes in the multifactorial regression model for calculation, which showed that PIGU was an independent prognostic factor (HR = 2.01, 95% CI = 1.51-2.66). For the results of multifactorial regression analysis, we applied KM survival analysis to establish a survival model for the difference in OS between the high-risk and lowrisk subgroups, and the results showed that there was a signi cant difference in the survival of patients in the high-risk and low-risk subgroups at different times ( Fig. 8B, P < 0.001), and ROC for model validation showed that the prediction e ciency of the model was 0.672 (Fig. 8C), suggesting that the model has a good predictive role.
Gene set enrichment analysis (GSEA) delineates biological pathways basing on the risk score We performed GSEA analysis based on the risk scores and found that in the high-risk subgroup, GO analysis were predominantly enriched in RNA catabolic process, regulation of cell cycle phase transition, cell cycle G2-M phase transition, mRNA cis splicing via spliceosome, non-canonical wnt signaling pathway, rRNA catabolic process, negative regulation of response to DNA damage stimulus, and regulation of intrinsic apoptotic signaling pathway ( Fig. 9A and Supplemental Table 5). KEGG analysis was predominantly enriched in spliceosome, RNA degradation, pyrimidine metabolism, DNA replication, n glycan biosynthesis, cell cycle, RNA polymerase, pathways in cancer, wnt signaling pathway, mismatch repair, etc. (Fig. 9B and Supplemental Table 5).
In the low-risk subgroup, the GO analysis enriched in cellular amino acid catabolic process, alpha amino acid catabolic process, aspartate family amino acid catabolic process, aromatic amino acid family catabolic process, branched chain amino acid catabolic process, glutamate metabolic process, sulfur amino acid metabolic process, monocarboxylic acid metabolic process, alpha amino acid metabolic process, etc. (Fig. 9C and Supplemental Table 5). The KEGG analysis enriched in tryptophan metabolism, valine leucine and isoleucine degradation, glycine serine and threonine metabolism, arginine and proline metabolism, tyrosine metabolism, phenylalanine metabolism, alanine aspartate and glutamate metabolism, pyruvate metabolism, and starch and sucrose metabolism ( Fig. 9D and Supplemental Table 5).
Relationship between mRNAsi and prognosis-related genes and clinical characteristics Further analysis of the correlation between 9 hub genes and clinical characteristics (Fig. 10) revealed that PSMG3 was signi cantly associated with grade (Fig. 10A), SNRPD1 was signi cantly associated with age and grade (Figs. 10B-C). DTYMK was signi cantly associated with age, stage, grade, and T (Figs. 10D-G). PIGU was signi cantly associated with gender, stage, grade, and T (Figs. 10H-K). IPO was signi cantly associated with grade (Fig. 10L). NME1 was signi cantly associated with stage, grade, and T (Figs. 10M-O). TXNL4A and REXO4 were signi cantly associated with stage, grade, and T (Figs. 10P-U). It was not observed signi cant association between PES1 and the clinical characteristics.
Relationship between mRNAsi and prognosis-related genes and immune microenvironment The relationships between 9 hub genes and immune subtypes were analyzed (Fig. 11). It was demonstrated that patients with C3 and C5 immune subtypes had better survival than those with C4 (Tamborero et al., 2018). No HCC patients were characterized to the C5 immune subtype (Fig. 11A). All the 9 hub genes were highly expressed in LIHC samples, and their higher expressions were correlated with C1, C2, and C4, indicated they promote angiogenesis, M1/M2 macrophage polarization, and M2 response, thus associated with prognosis of HCC.
In the microenvironment analysis (Fig. 11B), the 9 hub genes were negatively correlated with the stroma (stromal score) (P < 0.001), and SNRPD1 and TXNL4 were positively correlated with immune in ltrate (immune score) (P < 0.001). There was a negative correlation between NME1 and tumor purity (ESTIMATE score) (P < 0.05).

Discussion
The origin of hepatoma stem cells is still unclear, and it is generally believed that genetic mutation of normal hepatic stem cells (Oishi et al., 2014) and various factors-induced mutation of some mature hepatocytes to form hepatoma stem cells via de-differentiate and regain the ability to continue to proliferate and differentiate (Bu and Cao, 2012;Karakasiliotis and Mavromara, 2015). Some hub genes and pathways associated with the cancer stem cell characteristics in LIHC were previously identi ed using WGCNA analysis of transcriptome stemness index (Bai et al., 2020). In this study, we found that PSMG3, SNRPD1, DTYMK, PIGU, NME1, TXNL4A, IPO4, PES1, and REXO4 are potential novel biomarkers for prognosis of HCC and functional mechanism exploration of the CSCs in HCC. Those mRNAsi-and prognosis-related genes in HCC was signi cantly higher than that in normal liver tissues. The expression of those hub genes was con rmed by HPA databases, except TXNL4A. The mutation of REXO4 was implied in prolactinoma (Melo et al., 2016). The prognosis of HCC patients with overexpression of those genes was poorer, showing the 9-gene signature is a credible prognostic indicator of HCC. In addition, the nine genes were negatively correlated with stroma in the microenvironment. SNRPD1 and TXNL4 were positively correlated with immune cell in ltration, and NME1 was negatively correlated with the tumor purity, suggesting that those genes may affect the purity of tumor cells and mediate the microenvironment stimulus.
Proteasome assembly chaperone 3 (PSMG3) is signi cantly correlated with tumor grade. However, its antisense strand PSMG3-AS1 that is a lncRNA partially complementary to PSMG3, was found to promote the progression of breast, liver, and lung cancers by increasing the stability of PSMG3 (Cui et al., 2020;Yue et al., 2020;Zhang et al., 2020). PPI analysis showed that PSMG3 not interacted with other identi ed hub genes, its function in HCC should be investigated in the future.
Small nuclear ribonucleoprotein Sm D1 (SNRPD1) is a novel gene signature for prognosis prediction in ovarian cancer . Overexpression of SNRPD1 is signi cantly associated with poor survival of lung adenocarcinomas (Yi et al., 2020). It was demonstrated that SNRPD1 is associated with mesenchymal stem cell osteogenic differentiation by WGCNA (Yang et al., 2019). We found that SNRPD1 is signi cantly enriched in the spliceosome pathway, which might contribute to maintaining the stem cell characteristics of liver CSCs.
Deoxythymidylate kinase (DTYMK) is involved in dTTP biosynthesis, which is part of pyrimidine metabolism. LKB1 mutations in human non-cellular lung cancer exhibit a positive effect on DTYMK sensitivity, suggesting that DTYMK is a potential therapeutic target for lung cancer (Liu et al., 2013). In low-differentiated liver cancer, pyrimidine metabolism rate-limiting enzyme is a signature gene of cancer stemness, which is associated with poor prognosis (Yeh et al., 2017), which is consistent with our results.
Nucleoside diphosphate kinase A (NME1) is a member of the NME gene family (Puts et al., 2018). Extracellular NME has a role in promoting stem cell pluripotency and inducing central nervous system development as well as neuroprotective effects (Romani et al., 2018). Low expression of NME1 is associated with poor prognosis, low survival rate, and lymph node in ltration in breast cancer, ovarian cancer, and melanoma (Hartsough and Steeg, 2000). However, overexpression of NME1 is a poor prognostic signature for HCC (Yang et al., 2017;Hindupur et al., 2018). Functional analysis revealed that NME1 enrichment is associated with pyrimidine metabolism.
Thioredoxin like 4A (TXNL4A) is a pre-mRNA splicing factor gene, which is upregulated in a maternal HF diet caused hepatic hypermethylation in male mice (Seki et al., 2017), and endometrial cancer . The role of TXNL4A in the transformation of cells in HCC tumors to stem cells remains unclear.
Importin-4 (IPO4) acts as a nuclear transport receptor for nuclear protein import. IPO4 was found to be involved in poor prognosis in primary gastric cancer . IPO4-mediated CEBPD nuclear import can increase cervical cancer chemosensitivity by inhibiting PRKDC-driven DNA damage repair, thereby increasing cervical cancer incidence (Zhou et al., 2020). However, IPO4 has not been reported in hepatocellular carcinoma. We found that IPO4 is associated with the grade in HCC. IPO4 might inhibit DNA damage repair of HCC cells, which should be studied in the future.
Cell growth of liver CSCs was regulated by CD44 via a miR-105-5p/PSE1 signaling pathway (Wei et al., 2019). Our study showed that PES1 was not signi cantly correlated with clinical stages and grades. The relationship between CSCs and clinical stages and grades should be carefully considered.
Phosphatidylinositol glycan anchor biosynthesis class U protein (PIGU) is involved in the recognition of either the GPI attachment signal or the lipid portion of GPI (Hong et al., 2003). PIGU overexpression was found to express differentially in different stages of HCC and could be used for prognostic strati cation of HCC patients (Cao et al., 2019;Sarathi and Palaniappan, 2019). We found that PIGU is associated with gender, grade, stage, and T, and is enriched in spliceosomes(El Marabti and Younis, 2018), pyrimidine metabolism (Yeh et al., 2017), DNA repair (Gerson et al., 2006;Maugeri-Saccà et al., 2012) and Notch signaling pathway, etc. The Notch pathway promotes the tumor stem cell properties of CD90 + cells in HCC (Luo et al., 2016). Thus, PIGU might play an important role in HCC CSC stemness.
To sum up, our study found a series of stemness index and prognosis-related genes in HCC. These genes can be used as biomarkers to assist in the diagnosis of tumor stem cells in hepatocellular carcinoma. Among these genes, PIGU can serve as an independent prognostic factor for the survival of HCC patients.
All the 9 hub genes were highly expressed in LIHC samples, and their higher expressions were correlated with C1, C2, and C4, indicated they promote angiogenesis, M1/M2 macrophage polarization, and M2 response, thus associated with prognosis of HCC. Angiogenesis is critical for tumor development and therapy resistance (Zeng et al., 2019;Zeng and M.Fu, 2020). Further function and mechanism investigations need to be taken out for validating the roles of those hub genes.

Declarations ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable.

DATA AVAILABILITY STATEMENT
The datasets supporting the conclusions of this article are included within the article and its additional les.
Supplemental Table   Supplemental Table 5 is not available with this version