Bioinformatic Analysis of Key Genes Related to Tumor Microenvironment in Hepatocellular Carcinoma


 Background: Tumor microenvironment (TME) plays important roles in the development of different types of cancer. However, the critical regulatory members of TME related to hepatocellular carcinoma (HCC) remain unclear. In this study, a bioinformatic analysis based on Cancer Genome Atlas (TCGA) and Estimation of STromal and Immune cells in Malignant Tumor tissues using Expression data (ESTIMATE) datasets was conducted to predict the key genes affecting TME in HCC.Material and Methods: First, 340 patients and 20531 genes’ expression data with ESTIMATE scores were filtered and combined to identify differentially expressed genes. Next, protein-protein interaction (PPI) network and functional enrichment analysis were conducted to find hub genes. Then, log-rank test and functional enrichment analysis were conducted on the consensus genes and hub genes. Finally, Kaplan-Meier curves of the hub genes were drawn. As verification, those genes were searched on Oncomine database.Results: Among all differentially expressed genes, 916 genes were expressed in both the immune and stromal groups. The Gene Ontology (GO) terms they enriched were T cell activation, leukocyte migration, collagen-containing matrix, external side of plasma membrane, receptor ligand and activator activity. Cytokine-cytokine receptor interaction was the most significant Kyoto Encyclopedia of Genes and Genomes (KEGG) term. Furthermore, cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 hub genes were low expressed in 304 patients, participating in a variety of responses including immune response−activating cell surface receptor signaling, immune response−activating signal transduction, clathrin−coated vesicle membrane, immune receptor activity， peptide binding and amide binding pathways. Their low expression was also verified on Oncomine database.Conclusion: cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 participated in many aspects related to TME, and their low expression constructs a signature, may predict a poor 5 years’ survival in hepatocellular carcinoma.

Conclusion: cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 participated in many aspects related to TME, and their low expression constructs a signature, may predict a poor 5 years' survival in hepatocellular carcinoma.

Background
Liver cancer was the sixth most commonly diagnosed cancer and the third leading cause of cancer deaths worldwide in 2020 [1]. Hepatocellular carcinoma (HCC) comprises about 75-85% of liver cancer cases. Despite the developments in diagnostic methods, surgical techniques and novel drugs, late diagnosis and malignancy lead to poor long-term survival rate in HCC. The median survival following diagnosis is approximately 6-20 months [2]. Thus, discovering predictive biomarkers for HCC is vital for early diagnosis, effective treatment and better prognosis.
Apart from novel predictors like lncRNAs and ciRNAs [3,4], factors involved in immune system are also worth exploring in cancer. One such factor, the tumor microenvironment (TME) is de ned as the cellular and physical environment surrounding the primary tumor, which has recently become a research hotspot [5]. It has cellular components like in ammatory and immune cells, stromal components like extracellular matrix (ECM) proteins, soluble cellular factors, etc [6]. The interactions of tumor cells with these molecules and cells in the TME can in uence the biological behavior of cancer cells and the prognosis of cancer patients [7]. For example, a selenoprotein known as glutathione peroxidase 3 (GPx3), has a dichotomous role in different tumor types, which means it could act as both tumor suppressor and tumor promoter [8]. Therapeutic strategies based on TME have been tested in clinical trials and showed promising results [9].
Meanwhile, as an excellent tool, The Cancer Genome Atlas (TCGA) database collects gene expression quanti cation data and clinical information of different types of cancers [10]. Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data (ESTIMATE) is a tool for predicting tumor purity, using gene expression data to detect the presence of in ltrating stromal/immune cells in tumor tissues [11]. These values, combined with the expression data have been successfully applied to predict the overall survival rate in different types of cancers, such as glioma and colorectal cancer [12,13]. TME has also been used for survival analysis in other tumors like thyroid cancer [14]. However, there is insu cient evidence of TME in HCC bioinformatic analysis. This study aimed to nd hub genes associated with HCC and TME using TCGA dataset combined with ESTIMATE immune score, therefore to construct a new model to predict the survival rate of HCC.
Samples with immune score and stromal score were separately studied.

Data analysis
All downloaded data were analyzed with R software (R version 4.0.2). Clinical data from 340 patients and 20531 genes' expression information were ltered from the raw data. ESTIMATE algorithm is based on single sample gene set enrichment analysis and generates three scores: 1) stromal score (that captures the presence of stroma in tumor tissue), 2) immune score (that represents the in ltration of immune cells in tumor tissue), and 3) estimate score (that indicates tumor purity). ESTIMATE score can re ect the purity of cancer, along with many biological behaviors and factors affecting tumor microenvironment.
However, due to heterogeneity of different kinds of tumor, ESTIMATE score could show a variety form of results. It could not simply re ect patients' survival, nor the prognosis. Therefore, ESTIMTE score here were used as a lter, and had to combined with clinical research datasets like TCGA, only genes with a meaningful ESTIMATE score were considered participating in further analysis [16]. Immune/stromal score and mRNA expression information were merged based on sample codes. The ESTIMATE scores for each dataset were calculated and patients were divided into two equal groups of high or low ESTIMATE score by median split [17]. Limma package (ver 3.44.3) [18] was used to identify differentially expressed genes between samples with high score and low score. Genes with fold change (FC) more than two times (i.e. |log2 FC| >1) and p-value <0.05 were considered as differentially expressed [19]. Subsequently, genes expressed differentially in both immune score and stromal score groups were considered meaningful in HCC's TME, and underwent further analysis.

Functional enrichment analysis
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were constructed with clusterPro ler (ver 3.16.0) [20] and BiocManager (ver 1.30.10) package. They contained pathways of different aspects of biochemical functions. Only functions with p-values <0.05 were considered to be signi cant.
Protein-protein interaction (PPI) network construction: STRING database (http://string-db.org/, version 11.0) was used to construct the PPI network [21]. For higher con dence, gene-gene interactions with integrated scores >0.90 were selected rather than 0.4 and visualized using Cytoscape (version 3.8.0) [22]. A total of 30 hub genes were ltered using plug-in cytoHubba (ver 0.1) to demonstrate interactions.
Survival analysis and Oncomine veri cation.
These genes underwent survival analysis with Kaplan-Meier method, and curves were generated. Moreover, GO and KEGG analyses were conducted on hub genes, and the difference between overall genes' function and hub genes' function was demonstrated.

Differences in gene expression
We conducted gene expression analysis to identify the genes that were over-expressed or underexpressed in tumor tissues and stromal tissues. A total of 340 patients with immune score and stromal score were carefully screened from the raw data, whose clinical characteristics are shown in Table 1. Heat map showed rst 50 gene expression data between high and low immune/stromal score. Volcano maps with all genes were also drawn ( Figure 1-a, 1-b; Figure 2-a, 2-b). In the immune score group, 50 genes were over-expressed (marked as up), while 1224 genes were under-expressed (marked as down). In the stromal group, 66 genes were up, while 1863 genes were down. Besides, 916 common genes were matched from 1274 differentially expressed genes in the immune score group and 1929 genes in the stromal score group, which were shown in the Venn map ( gure 3). Because key gene expression pro le would be shown in survival analysis, we intersected the differently expressed genes only rather than providing up regulated and down regulated genes together in the Venn map.

Functional enrichment analysis
We performed functional enrichment analysis to further study the consensus genes. GO contained three groups: biological processes (BP), cellular components (CC), and molecular functions (MF), with 12 pathways in each group. T cell activation, collagen-containing ECM and receptor ligand binding were the most enriched GO terms, while cytokine-cytokine receptor interaction was the most signi cant KEGG term (Figure 4-a, 4-b).

Protein-protein interaction (PPI) network
We performed PPI network analysis to elucidate the interactions between consensus genes. The PPI network of differentially expressed genes contained 893 nodes, 3050 edges, and PPI enrichment p-value: <1.0e-16. Isolated nodes and nodes not connected to the main pathway were excluded. After processing with Cytoscape, 78 genes remained in the main stem of the PPI network, of which 30 hub genes were selected using plugin cytoHubba ( Figure 5-a, 5-b) by Degree algorithm.

Functional enrichment analysis on hub genes:
Speci cally, we conducted GO and KEGG pathway analyses within the hub genes to compare with overall consensus genes (Figure 6-a, 6-b). Those genes were expressed in 304 patients, with 234 alive and 70 dead. The immune response−activating cell surface receptor signaling pathway, immune response−activating signal transduction, clathrin−coated vesicle membrane, endocytic vesicle membrane, immune receptor activity, peptide binding and amine binding were enriched in the GO pathway, while Th1, Th2 and Th17 cell differentiation were enriched in the KEGG pathway.

Survival analysis:
We conducted survival analysis on the 30 hub genes. Among those, cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 were considered most signi cant. Kaplan-Meier curve showed that low expression of these genes was linked to poor overall survival of HCC patients within ve years (p<0.05) (Figure 7).

Oncomine veri cation:
We searched cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 on Oncomine database to recheck their expression. The lters were genes, cancer vs normal group, hepatocellular carcinoma, under-expression gene rank and gene rank within top 10%. As a result, they were indeed low expressed in hepatocellular carcinoma in Roessler Liver group, a non TCGA dataset associated group, with all p value<0.05. Their expression data were listed, making the signature more convincing (Figure 8).

Discussion
Hepatocellular carcinoma (HCC) is one of the most common malignancies worldwide, with an extremely poor prognosis [23]. One possible reason is the tumor microenvironment (TME) in HCC can shift from immunocompetent status to immunosuppressive status due to drugs or pathological condition [24]. Because HCC TME is potentially immunogenic and typically characterized by in ammation, immunotherapy has been proposed as a potential novel therapeutic approach, which has prompted studies on various immuno-therapeutic strategies in advanced HCC patients. Therapies like chimeric antigen receptor T (CAR-T) cell therapy, checkpoint inhibitors, and onco-vaccines have shown promising results [25]. These studies were focused on a speci c target provided by either experimental results or novel discovered genes from data analysis. The GO and KEGG pathways' results in this study showed that the immune system involved in HCC consists of response−activating cell surface receptor signaling pathway and immune response−activating signal transduction. Clathrin−coated vesicle, immune receptor activity, Th1, Th2 and Th17 cell differentiation are also involved. However, these concepts are insu cient to elucidate the mechanisms. Therefore, whether to specify the GO/KEGG method or develop a new analysis method should be examined in future research.
Through bioinformatic analysis, this study found several under-expressed hub genes related to HCC. Of these, cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 may participate in TME, and can be used as a signature. These genes were well documented in previous studies. cd3e and cd3g belong to the CD3 receptor group, which is crucial for T-cell development and regulation. Their defect can cause severe combined T and B-cell immunode ciency, which increases the chances of malignancy [26]. In terms of hla genes, rs3077-C, an hla-dp allele, may affect hepatitis B virus surface antigen clearance [27], therefore, mutation or low expression of HLA-related genes may exacerbate HBV-related liver cancer. Since HLA is highly individualized, therapy can be adjusted accordingly [28]. Lymphocyte-speci c proto-oncogene (lck) belongs to SRC family of protein tyrosine kinases (PTKs). The encoded protein is a key signaling molecule in the selection and maturation of T-cell development. It works in Lnc-Tim3 pathway, promoting T cell exhaustion, a phenotype that is correlated with compromised anti-tumor immunity. It may in uence cancer therapies aimed at modulating the acquired immune system as consequence [17]. Mitogenactivated protein kinase kinase kinase kinase 1 (map4k1), also known as hematopoietic progenitor kinase 1 (hpk1) is a negative regulator of T-cell activation along with DUSP22 (JKAP), and DUSP14 [29,30]. However, as a group, their interactions remain vague. They may participate in a common pathway, and need further study before they can be used for cancer diagnosis and survival prediction.
This study was based on bioinformatic analysis on TCGA and Oncomine database. Further analyses, such as cross comparison with GEO database [31], single or multiple factor Cox regression could also be conducted as supplements. Besides, new treatment strategies, new data can affect the prognosis and the analysis per se, therefore, researchers need to update data base timely, making studies based on data analysis more precise and convincing. Although bioinformatics is a well-accepted research method and provides prospective directions for future research, further analysis with datasets, in vitro experiments like PCR or gene knockout are still required to con rm gene expression and function.
Conclusions cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 participated in many processes related to TME, and their low expression constructs a signature, may predict a poor 5 year survival in hepatocellular carcinoma.   Venn plot of 1274 genes expressed differentially in the immune score group and 1929 genes in the stromal score, 916 consensus genes expressed in both immune score and stromal score. Figure 5 a, Protein-protein interaction (PPI) network of 78 differentially expressed genes and identi cation of 30 hub genes. Figure 5-b, PPI network of differentially expressed genes with integrated scores >0.9. Figure 5c, PPI network of 30 hub genes with their degrees.  Oncomine search results showed cd3e, cd3g, hla-dpa1, hla-dpb1, lck, and map4k1 were low expressed in Roessler Liver group, with all their p value<0.05.