Identification of biomarkers of hepatocellular carcinoma gene prognosis based on immune-related lncRNA signature of transcriptome data

Validated the established lncRNA signature of 343 patients with HCC from The Cancer Genome Atlas (TCGA) and 81 samples from Gene Expression Omnibus (GEO). Immune-related lncRNAs for HCC prognosis were evaluated using Cox regression and Least Absolute Shrinkage and Selection Operator (LASSO) analyses. LASSO analysis was performed to calculate a risk score formula to explore the difference in overall survival between high- and low-risk groups in TCGA, which was verified using GEO, Gene Ontology (GO), and pathway-enrichment analysis. These analyses were used to identify the function of screened genes and construct a co-expression network of these genes. Using computational difference algorithms and lasso Cox regression analysis, the differentially expressed and survival-related immune-related genes (IRGs) among patients with HCC were established as five novel immune-related lncRNA signatures (AC099850.3, AL031985.3, PRRT3-AS1, AC023157.3, MSC-AS1). Patients in the low - risk group showed significantly better survival than patients in the high risk group ( P = 3.033e−05). The signature identified can be an effective prognostic factor to predict patient survival. The nomogram showed some clinical net benefits predicted by overall survival. In order to explore its underlying mechanism, several methods of enrichment were elucidated using Gene Set Enrichment Analysis. five lncRNA signatures has important clinical implications for predicting patient outcome and guiding tailored for with with further prospective validation. surgical resection, transplantation, ablation,


Background
Hepatocellular carcinoma (HCC) is the sixth most common cancer in the world and the fourth leading cause of cancer-related death, and its annual incidence is increasing rapidly (Bray et al., 2018). Treatments for HCC include surgical resection, transplantation, ablation, transarterial chemoembolization, and tyrosine kinase inhibitors such as sorafenib, lenvatinib, and regorafenib have proven to be effective treatments for improving patient survival (Forner et al., 2018). The latest research shows that through immune checkpoint treatment, cytotoxic T lymphocyte antigen 4 (CTLA-4) and programmed death 1 (PD-1) blocking antibodies, and chimeric antigen receptor (CAR) T cells can be used to neutralize cancer cells. Immunotherapy drugs have shown beneficial results in the treatment of other solid tumors (such as kidney and lung cancers, and melanoma) (Thoma, 2018, Rheinheimer et al., 2020, Rodriguez-Cerdeira et al., 2017. At the American Society of Clinical Oncology Annual Meeting 2019 (ASCO 2019), the phase I/II CheckMate-040 trial results showed that a combined objective response rate of nivolumab (PD-1 antibody) and ipilimumab (CTLA-4 antibody) was 31% and the longest duration of response was 17.5 months. These results indicate that immunotherapy has potentially beneficial effects on HCC.
In the past few years, long non-coding RNA (lncRNA) has attracted attention in tumor research. They are a class of non-coding RNAs (ncRNAs) longer than 200 nucleotides that have no protein-coding potential (Gao et al., 2020). At present, there are more than 20 lncRNAs being studied for human cancer and medical applications. They are differentially expressed in different tissues and cancers (including HCC) (Wu and Du Y, 2017). Although the role of most lncRNAs in liver cancer is still not understood fully, a small number of them have been extensively studied. For example, lncRNA HULC is upregulated in liver cancer and has been shown to promote liver cancer growth, metastasis, and drug resistance (Emmrich et al., 2014, Lee et al., 2015, Huang et al., 2020, Chu et al., 2011. These important roles and unique characteristics of lncRNA indicate that they have potential clinical value in the diagnosis and treatment of liver cancer. In this study, a genome-wide analysis was performed for determining the lncRNA expression profile of patients with HCC from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO) databases. This expression profile aided identification of potential immune-related lncRNA signals to predict the survival of patients with HCC. We determined the prognostic markers of five immune-related lncRNAs from the TCGA and GEO datasets. The results obtained in this study can help predict the prognosis of patients with HCC, provide tailored treatments, and further prospective verification.

Differentially expressed immune genes co-expressed in TCGA and GEO databases and their interactions
After raw data pre-processing, samples from 343 patients with HCC (with a survival time longer than 30 days) and 50 healthy subjects contained in TCGA (using only patient samples) were used. An additional 81 HCC samples were retrieved from the GSE27150 data set. These samples were analyzed on the immune-related database through the Gene Set Enrichment Analysis (GSEA; https://www.gseamsigdb.org/gsea/index.jsp). We used two datasets, IMMUNE_RESPONSE (M13664) and IMMUNE_SYSTEM_PROCESS (M19817), to obtain immune-related genes (IRGs). A total of 331 IRGs were downloaded from the TCGA database. Then, by selecting the intersecting genes, 42 IRGs coexpressed in the TCGA and GEO databases were obtained. Subsequently, a total of 24 differentially expressed genes (DEGs) were identified, including five downregulated and 19 upregulated genes. The expression of each gene is displayed in the heat map (Fig. 1a), and the volcano map shows the distribution of genes (Fig. 1b).
Subsequently, the protein-protein interaction (PPI) network of the DEGs was constructed and visualized using the STRING (Search Tool for the Retrieval of Interacting Gene) database, as shown in (Fig. 1c). A total of 22 nodes and 47 edges were present in the PPI network, and the histogram shows the node number of each gene (Fig. 1d).

The acquisition of risk genes and their correlation with clinical prognosis and their previous correlation
In order to obtain IRGs related to clinical prognosis from the TCGA database, univariate and lasso Cox regression analyses were performed. P < 0.001 was set as statistically significant in univariate Cox regression analysis. The prognostic gene signature was calculated as: Risk Score R package "survival" was used to explore the best dividing lines of patient risk score, and the Kaplan-Meier survival curves were constructed. Eight risk genes were obtained (Table 1), and the Kaplan-Meier survival curve (Fig. 2) demonstrates that these eight genes were related to prognosis in both TCGA and GEO databases and shows that the high-risk group had lower 5-year survival rate than the low-risk group of patients with HCC (P < 0.001). Next, construction and analysis of the correlation matrix between different genes (Fig. 2) revealed that the IRG MAP3K7 had the highest correlation with HELLS (helicase, lymphoid specific) gene, and the Spearman's rank correlation coefficient (rs) was calculated to be 0.46. All other rs values were ≤ 0.4. The "CorrPlot" R package was utilized to create this figure.

Acquisition of immune-related risk lncRNAs and principal component analysis
To identify the immune-related lncRNAs, the lncRNAs in the TCGA database were co-expressed with eight previously related IRGs and checked for correlation using the "limma" R package. P values were considered statistically significant for the total score when < 0.001. Therefore, we obtained 678 differentially expressed lncRNAs, 11 of which were related to negative regulation, and the rest were related to positive regulation.
Subsequently, univariate and multivariate analyses of survival prognostic factors of these immunerelated lncRNAs was performed using "survival" R package. In univariate Cox analysis, hazard ratio (HR; 95% confidence interval (CI)) > 1 indicated high risk, and HR < 1 indicated low risk. P value (survival time) < 0.0002 was considered statistically significant. A total of 14 related lncRNAs were obtained, of which one was low risk, and 13 were high risk (Fig. 3a). Then, multivariate Cox analysis was performed to obtain five immune-related risk lncRNAs (Table 2).
Then, we conducted principal component analysis (PCA) on immune genes, immune-related lncRNAs, and immune-related risk lncRNAs to test the ability to identify the prognosis of patients with HCC. The results show that immune genes (green and red circles) were clustered together, and immunerelated lncRNAs (red-green and red circles) were more dispersed. Immune-related risk lncRNAs (redgreen and red circles) were essentially separated (Fig. 3b).

Correlation between prognosis model and clinicopathological characteristics
The TCGA database included 343 patients with information on their clinical features such as gender, age, T (tumor), grade, clinical stage, lymph node metastasis, and survival time and status. Selecting 1.0 to be the median risk score for the entire discovery group as the cutoff value, 343 patients were classified as high-risk or low-risk patients (Fig. 4). The figure also shows the expression value, risk score, and survival time and status of the five lncRNAs of each patient. Further analysis of the data revealed that the Kaplan-Meier curve shows that the survival probability of the high-risk patients was significantly lower than that of the low-risk patients, and the P value of the log-rank test was 3.033e−05 (Fig. 5a).
In addition, the area under the curve (AUC) of patients' survival time from the TCGA database revealed that the AUC of risk genes = 0.762, the AUC of stage = 0.743, and the AUC of tumor (T) stage = 0.752, indicating that immune-related lncRNA signatures have a moderate potential to monitor survival (Fig.   5b). Compared with other clinicopathological features, our model achieved the largest AUC value, which also reflects its excellent predictability.
We then analyzed the relationship between the expression of risk genes and clinical features such as T, grade (G), and stage (S). In the TCGA database, the expression of AC099850.3, AL031985.3, and PRRT3-AS1 was found to be associated with T, AL031985.3, PRRT3-AS1, and MSC-AS1 were found to be associated with grade, and AC099850.3 and AL031985.3 were found to be related to the stage (P < 0.05; Fig. 5c).

Building and validating the predictive nomogram
The TCGA data set included 343 patients with complete clinical information. Based on the stepwise Cox regression model, a prognostic nomogram that predicts the overall survival rate at 1, 2, and 3 years was established (Fig. 6). The parameters such as risk score, age, gender, grade, stage, and tumor, node, metastasis (TMN) were included in the nomogram.

Gene set enrichment analyses
We further analyzed immune-related lncRNAs and the enrichment score from the GSEA analysis.
This analysis used the permutation lists, and the leading-edge proteins were considered the hits before the enrichment score for the gene set. GSEA was performed for elucidating Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) analysis. For GO pathways, most enrichment pathways in high-risk patients were linked to cell cycle, gene signal, and RNA binding. In addition, in the low-risk patients, most protein-related pathways were found to be enriched. KEGG pathway analysis showed that in the high-risk group, most cell cycle-related pathways were found to be enriched, and in the low-risk patients, most metabolic pathways were found to be enriched (Fig. 7).

Discussion
As a heterogeneous disease with poor prognosis and high mortality, it is crucial to investigate novel biomarkers that can effectively predict the survival of patients with HCC . In recent years, with the development of sequencing, many human transcripts without protein coding potential called non-coding RNAs (ncRNA), including microRNA (miRNA), small nucleolar RNA (snoRNA), long non-coding RNA (lncRNA), and circular RNA (circRNA) have been identified. Our understanding of lncRNAs has greatly expanded, and it is known that they are more abundant than standard protein-coding mRNA Conn, 2017, Shi et al., 2013). LncRNAs also have important roles in cancer immunity, such as antigen release, immune cell migration and filtration, antigen presentation, and immune activation , Atianand et al., 2017. Immunotherapy for liver cancer has also received extensive attention (Dal Bo et al., 2020, Sim andKnox, 2018). TCGA is an open database that contains samples of hundreds of patients with various malignancies. The GEO database has been widely used in bioinformatics analysis because it is a comprehensive database that stores large amounts of gene expression data. In this study, we downloaded RNA sequencing data from TCGA and GEO databases, and obtained the lncRNA expression profiles of patients with HCC in the dataset for analysis.
In this study, we used a TCGA training cohort including 343 patients with HCC, examined the relationship between the expression of immune-related lncRNAs and the prognosis of patients with HCC, and verified their effectiveness in the GSE27150 dataset. First, by co-expressing genes from two databases and performing Cox regression analysis, 24 genes related to immunity were obtained.
Univariate and lasso Cox regression analyses were used to identify related immune genes to predict and construct prognostic gene signatures. Eight genes related to immunity were generated as prognostic factors for HCC. Then, through co-expression, univariate, and multivariate analyses, five immunerelated lncRNAs affecting the prognosis were obtained. The overall survival rate of patients was found to be significantly correlated with lncRNAs. Through multiple Cox regression analysis and risk-scoring methods, five immune-related lncRNA signals were obtained, and patients with HCC were divided into high-risk and low-risk groups. Survival analysis showed that the overall survival rates of the two groups were significantly different. Since there was a possibility of overturning or five lncRNA signatures being false positive, we used a separate cohort of 343 patients with HCC to further validate the prognostic value of the five lncRNA signatures. The results of the independent verification showed that the five lncRNA signatures had good repeatability and robustness in predicting the prognosis of patients with HCC. However, due to molecular heterogeneity, traditional clinical predictors had limited success in predicting the survival of patients with HCC. The results of multivariate analysis indicate that the five lncRNA signatures did not depend on traditional clinical risk factors and molecular characteristics.
In addition, in order to detect the correlation of the survival time of patients with HCC using the AUC, we found that the AUC of five immune-related lncRNAs = 0.762, then the AUC at S = 0.743, and the AUC at T = 0.752, indicating that the related lncRNA signal has medium potential. Compared with other clinicopathological features, our model obtains the largest AUC value, which also reflects its excellent predictability. In addition, we found a close relationship between dangerous lncRNA expression and clinicopathological factors (T, G, S). In order to make lncRNAs more effective and clinically meaningful, we established and validated the predicted nomogram, which was able to effectively determine the patient's risk. Finally, the results of GSEA showed that risk lncRNAs are associated with multiple pathways in KEGG and GO. On the other hand, the results further demonstrate a strong link between the signature and the immune system, and indicate the potential of signatures in predicting the survival of patients with HCC.

Conclusions
In summary, this data analysis study identified five immune-related lncRNA signals that can predict the prognosis of patients with HCC. Our findings also provide a basis for the selection of therapeutic targets and prognostic molecular markers.

Identification of DEGs
The immune-related gene set in the GSEA database (gene set: IMMUNE_RESPONSE, M19817, and IMMUNE_SYSTEM_PROCESS, M13664) was identified, then the IRGs were obtained using genetic data from TCGA through the co-expression method (co-data moderated random forest (CoRF) > 0.4, P < 0.001). Then, using the intersection, 24 IRGs co-expressed in the TCGA and GEO databases were obtained. Subsequently, DEGs were identified using "limma" package (http://www.bioconductor.org/) in R software version 3.6.2. The threshold value of DEGs was set as logfold change (FC) > 1.0 and P value < 0.05. Further, "pheatmap" package in R software was used for clustering analysis.

PPI network analysis
The PPI network was constructed based on online search databases (STRING; http://string-db.org) (version 11.0) (Franceschini et al., 2013) for identifying interactive genes to make predictions. Analyzing functional interactions between proteins may provide insights into disease development or development mechanisms, including direct (physical) and indirect (functional) associations. In this study, the STRING database was used to evaluate their functional associations and construct a PPI network, and interactions with a combined score > 0.4 were considered statistically significant. Subsequently, R software was employed to obtain the histogram of node number.

Construction of the prognostic immune gene signature
Univariate and lasso Cox regression analyses (O'Connor, 2000) were used to identify relevant immune genes to predict and construct prognostic gene signatures. P < 0.001 was considered statistically significant in univariate Cox regression analysis. The prognostic gene signature was calculated as: Risk Score = (coefficient of gene1 × gene1 expression) + (coefficient of gene2 × gene2 expression) + ⋯ + (coefficient of geneN × GeneN expression). The R package "survival" was used to explore the best dividing line of patient risk score, and draw Kaplan-Meier survival curve (P < 0.001). Next, the "CorrPlot" R package was used to construct the correlation matrix between different genes.

Acquisition of immune-related risk lncRNAs and PCA
The "limma" R software package was used to screen differential genes between lncRNA and the eight genes obtained, and the absolute value of cor > 0.4. When cor is positive, it is related to positive regulation and vice versa. P < 0.001 indicates statistically significance.
Then, we conducted univariate and multivariate analyses on the survival prognostic factors of these immune-related lncRNAs using the "survival" R package. In univariate Cox analysis, HR (95% CI) > 1 indicated high risk, and HR < 1 indicated low risk. P value (survival time) < 0.0002 indicated statistical significance.

Correlation between prognosis model and clinicopathological characteristics
Using the "pheatmap" package for analysis, the median risk score of the entire discovery group was set to 1.0 as the cutoff value, and 343 patients with HCC were divided into low-risk and high-risk patients.
The relationship between the expression value of five lncRNAs and risk score, survival time, and status of each patient was analyzed. The "survival" and "survminer" packages were used to obtain Kaplan-Meier curves and P < 0.01 was considered to be statistically significant. The R software package "survivalROC" was used to study the prognostic value of gene signatures over time (Lambert and Chevret, 2016). Then, the R software package "ggpubr" was used to analyze the relationship between risk gene expression and clinical characteristics (such as T, G, and S).

Building and validating a predictive nomogram
Nomogram (Iasonos et al., 2008) was established by including all the prognostic factors. The calibration chart and the consistency index were used to study the calibration and identification of the nomogram (using the bootstrap method with 1,000 replicates), respectively. Based on the stepwise Cox regression model, the prognostic nomograms predicting the total survival time of 1, 2, and 3 years were established.

Statistical analysis
R (version 3.6.2) was used to analyze the data. The "limma" package was used to identify DEGs.
The threshold value of DEGs was set to log2 fold change (FC) > 1.0 and P value < 0.05. The "survival" and "survminer" R packages were used to explore the best dividing line of risk score, and draw the Kaplan-Meier survival curve (P < 0.001). The "CorrPlot" R package was used to construct the correlation matrix between different genes. In univariate Cox analysis, HR (95% CI) > 1 indicated high risk, and HR < 1 indicated low risk. P value (survival time) < 0.0002 indicated statistical significance. The R software package "survivalROC" was used to study the prognostic value of gene signatures over time.
The R software package "ggpubr" was used to analyze the relationship between risk gene expression and clinical characteristics.

Declarations
Ethics approval and consent to participate: Ethical approval was waived since we used only publicly available data and materials in this study.

Competing interest:
All authors have read and approved to submit it to your journal. There are no conflicts of interest of any authors in relation to the submission.  PCA for all immune genes, immune lncRNAs, and risk lncRNAs; red dots represent high-risk genes, and green dots represent low-risk genes