Identication of an Epithelial-Mesenchymal Transition-Related Long Non-coding RNA Prognostic Signature for Hepatocellular Carcinoma

Introduction: Hepatocellular carcinoma (HCC) is one of the most common malignant tumors with poor prognosis. Epithelial–mesenchymal transition (EMT) is crucial for cancer progression and associates with a worse prognosis. Thus, we aimed to construct an EMT-related lncRNAs signature for predicting the prognosis of HCC patients. Methods: We built an EMT-related lncRNA risk signature in the training set by using Cox regression and LASSO regression based on TCGA database. Kaplan-Meier survival analysis was conducted to compare the overall survival (OS) in different risk groups. Cox regression was performed to explore whether the signature could be used as an independent factor. A nomogram was built involving the risk score and clinicopathological features. Furthermore, we explored the biological functions and immune states in two groups. Results: 12 EMT-related lncRNAs were obtained for constructing the prognosis model in HCC. The Kaplan-Meier curve analysis revealed that patients in the high-risk group had worse survival than low-risk group. Time-dependent ROC and Cox regression analyses suggested that the signature could predict HCC survival exactly and independently. The prognostic value of the risk model was conrmed in the validation group. The nomogram was built and could accurately predict survival of HCC patients. GSEA results showed that in high-risk group cancer-related pathways were enriched, and exhibited more cell division activity suggested by Gene Ontology (GO) analysis. Conclusions: We established a novel EMT-related prognostic risk signature including 12 lncRNAs and constructed a nomogram to predict the prognosis in HCC patients, which may improve prognostic predictive accuracy for HCC patients and guide the individualized treatment methods for the patients with HCC.


Introduction
Hepatocellular carcinoma (HCC) is one of the most common aggressive tumors that ranks as the fourth leading cause of cancer-related mortality worldwide [1]. Despite the proven e cacies of several treatment methods, such as surgical resection, liver transplantation, ablation therapy, radiotherapy, targeted therapy, systemic chemotherapy, and immunotherapy [2], the prognosis remains unsatisfactory in HCC patients. According to the World Health Organization estimates, more than 1 million patients will die from HCC by 2030, with the 5-year survival rate being no more than 18% [2,3]. Such poor survival rates are mainly due to the low rates of early HCC diagnosis, rapid progression of the disease, resistance to the treatment, high heterogeneity, and lack of early prognostic indicators for HCC. Therefore, there is an urgent need to identify highly sensitive and speci c prognostic biomarkers for predicting the clinical outcomes for HCC patients.
Epithelial-mesenchymal transition (EMT) is a process that leads to morphological and functional changes, in which the polarized epithelial cells convert from epithelial to mesenchymal-like or mesenchymal broblastoid phenotypes, which is characterized by the dissolution of cell-cell junctions, synthesis of extracellular matrix, and increasing cell motility that consequently result in the cells gaining invasive and metastatic properties [4,5]. Numerous studies have shown that EMT plays vital roles in the development and progression of diverse types of cancer, including HCC [6,7]. Transforming growth factor-β (TGF-β), one of the most potent inducers of EMT, is able to promote brogenesis and carcinogenesis in HCC. Furthermore, it is correlated with the invasion and metastasis of HCC [8].
Epithelial cadherin (E-cadherin), an important marker of EMT, is directly associated with the worse prognosis and short survival rates in HCC patients [9]. In a previous study, it was reported that the downregulation of E-cadherin led to the microvascular invasion and metastasis of cells in 323 HCC patients [10]. Moreover, the EMT-transformed HCC cells stimulated with TGF-β were resistant to sorafenibinduced apoptosis [11]. Increased expression of EMT markers may be related to the poor prognosis in HCC patients exhibiting aggressive local recurrence after radiofrequency ablation [5]. Therefore, EMTrelated genes might be used as novel and ideal prognostic and therapeutic targets for the treatment of HCC.
Long non-coding RNAs (lncRNAs), which account for 80-90% of all ncRNAs, structurally contain more than 200 nucleotides. Recently, it have attracted wide attention as biomarkers in the early diagnosis and prognosis tumors, thus lncRNA have been regarded as potential therapeutic targets of tumors [12][13][14].
However, the potential values of EMT-related lncRNAs as prognostic indicators and therapeutic targets have not yet been explored in the patients with HCC. Therefore, in this study, we rst built and veri ed an EMT-related lncRNA prognostic model and nomogram based on the data extracted from The Cancer Genome Atlas (TCGA) and then explored the biological functions and related immune responses of EMT-related lncRNAs in the patients with HCC. A owchart of this study is shown in Fig. 1.

Methods And Materials
Data acquisition and processing The RNA sequencing data of 424 HCC patients and the corresponding clinical data were obtained from the TCGA data portal based on the fragments per kilobase of transcript per million mapped reads (FPKM) values (https://cancergenome.nih.gov/). The genes with low expression levels with an average value < 0.5 were eliminated and the included data were commented and sorted into 710 lncRNAs and 11282 To extract the prognosis-associated EMT-related lncRNAs, we performed the univariate Cox regression analysis and log-rank tests. We then incorporated the prognostic EMT-related lncRNA candidates into the least absolute shrinkage and selection operator (LASSO) regression and multivariate Cox regression to construct the prognostic risk models in the training set. According to the multivariate Cox regression coe cient and expression value of each lncRNA in the patients with HCC, the risk score of each patient was evaluated using the risk score equation as follows: Risk score = coef (lncRNA1) × exp (lncRNA1) + coef (lncRNA2) × exp (lncRNA2) +…+ coef (lncRNAn) × exp (lncRNAn).
where coef (lncRNA) represents the regression coe cient and exp (lncRNA) is the expressive value of each EMT-related lncRNA. The patients were divided into high-and low-risk groups based on the median cut-off of risk scores. We then performed the Kaplan-Meier survival analysis to compare the differences in the overall survival (OS) between the two groups in the training and validation cohorts. Meanwhile, the strati cation analysis was performed based on the following clinicopathological features: age (≤ 65 and Then, univariate and multivariate Cox regression analyses were performed to explore whether the constructed prognostic model could be used as an independent factor to determine the prognosis of HCC patients by integrating the following clinicopathological factors: age, sex, grade, clinical stage; P < 0.05 was considered to be statistically signi cant. A multi-indicator receiver operating characteristic (multi-ROC) curve was drawn and the area under the curve (AUC) of the ROC curve was estimated to assess the predictive accuracy of the risk score and other clinicopathological features for the survival of patients.
In addition, the relationship between the prognostic EMT-related lncRNAs and their co-expressed mRNAs was analyzed using a co-expression network. The Cytoscape v.3.8.2 software (http://www.cytos cape.org/) and "ggalluvial" R package were used to visualize the co-expression correlation. Construction and validation of the predictive nomogram Based on the risk scores and conventional clinical characteristics of the training group, we established an OS-associated nomogram to predict the 1-year and 3-year survival probability of HCC patients by using the "survival" and "rms" R packages. We then performed internal veri cation to assess the predictive precision of this nomogram. A calibration curve was drawn to appraise the uniformity of the predicted result and the ROC curve was sketched to assess the predictive accuracy of the nomogram model.

Enrichment analysis
The transcriptome data of the entire group from TCGA were selected for gene set enrichment analysis (GSEA) using the GSEA v.4.1.0 software (http://www.gsea-msigdb.org/gsea/index.jsp). Based on the prognostic signature, 343 samples were divided into high-and low-risk groups. The c2.cp.kegg.v7.4.symbols.gmt downloaded from MSigDB ( http://www.gseamsigdb.org/gsea/msigdb/collections.jsp ) was used as the reference and a false discovery rate (FDR) of q-value < 0.05 was used to identify the signi cantly enriched pathways. Thereafter, the signi cant differentially expressed genes (DEGs) were extracted from TCGA database depending on the risk of grouping and the statistical signi cance was de ned as FDR < 0.05 and |log2FC| ≥ 1. According to the DEGs, Gene Ontology (GO) analysis was used to evaluate the related biological pathways in the two risk groups.

Immune cell in ltration and immune-related pathways
According to the EMT-related lncRNA prognostic model, single-sample gene set enrichment analysis (ssGSEA) was conducted to quantitatively analyze the differences in the in ltrating levels of immune cells as well as the immune-related pathways between the high-and low-risk groups in the HCC patients by using "GSVA" and "GSEABase" R package [25]. The result was visualized with boxplot using the "ggpubr" R package. P < 0.05 was considered statistically signi cant.

Statistical analyses
Statistical analyses were performed using the R software v.4.0.4 (https://www.r-project.org/ ) and SPSS v.24 (IBM, New York, USA). Chi-square test was used to analyze the baseline characteristics of patients with HCC. LASSO and Cox regression analyses were used to build a prognostic model. First, univariate Cox analysis and log-rank tests were conducted to screen for the prognostic EMT-related lncRNA candidates with P < 0.05. Then, the LASSO algorithm was used to reduce the data dimensionality. The risk score for the 12 selected lncRNAs was calculated using a multivariate Cox model. Kaplan-Meier analysis was used to evaluate the differences in the survival of the two risk groups. The ROC curve was used to estimate the predictive ability of the risk models. The predictive values of the nomogram for 1-year and 3-year OS were assessed using calibration and ROC curves. All statistical tests were two-sided and P < 0.05 was considered statistically signi cant.

Identi cation of EMT-related lncRNAs in HCC
First, the gene expression matrix and clinical information of 374 HCC samples were acquired from TCGA database. After removing the repeated samples, patients with incomplete clinicopathological data and less than 30 d follow-up time, 343 patients with complete follow-up information and 319 patients with complete clinicopathological data were included in this study for subsequent analysis; 343 patients were divided into the training group (n = 240), testing group (n = 103), and entire group (n = 343) at a ratio of 0.7:0.3:1. The clinicopathological features of the HCC patients included in this study are shown in Table   1. Then, 200 EMT-related genes were downloaded from MSigDB and 41 genes were excluded owing to their low expression levels. A total of 159 EMT-related genes were identi ed in HCC derived from the TCGA database. According to the Pearson correlation coe cient analysis, 449 EMT-related lncRNAs were screened based on the ltering criteria of correlation coe cient < 0.3 and P < 0.001.  Fig. 2A-B). Twelve lncRNAs were ltered out to construct a prognostic signature with multivariate Cox regression coe cients, including AC103760.1, AC015908. ; the EMT-related lncRNA prognostic signature was built in the training set. Then, we calculated the risk score of HCC patients in the testing and entire groups according to the above-mentioned formula. HCC patients in the three groups were further divided into two groups: low-risk and high-risk groups (median value = 0.8488). In the training set, Kaplan-Meier survival curves for OS indicated that the HCC patients in the high-risk group had shorter survival than those in the low-risk group (P < 0.001) as shown in Fig. 2C. Time-ROC curve analysis was performed to evaluate the precision of the prognostic model and the results demonstrated that the risk score model provided a precise predictive effect of 1-year, 3-year, and 5-year OS, the AUC values were 0.828, 0.811, and 0.758, respectively (Fig. 2D). Risk score curves and scatter plots were used to explain the relationship between the risk score and survival status of patients with HCC and the results revealed that the higher the score at risk, the higher the mortality rate observed in the affected patients ( Fig. 2E-F). The heatmap also showed that seven lncRNAs (LINC00942, PRRT3-AS1, AC012146.1, AC092171.2, AC099850.3, CASC19, and AL158206.1) were upregulated in the high-risk group, whereas the other ve lncRNAs (AC103760.1, AC015908.3, LINC02362, LINC02499, and F11-AS1) were upregulated in the low-risk group (Fig. 2G). In the validation cohort, Kaplan-Meier analysis of the testing and entire groups was the same as that of the training cohort ( Fig. 3A and D). The AUC values of the risk score for 1-year, 3-year, and 5-year OS were 0.880, 0.835, and 0.867, respectively, in the testing group and 0.810, 0.770, and 0.735, respectively, in the entire group ( Fig. 3B and E). The results of the risk score curves, scatter plots, and heatmap in the validation groups showed similar trends to those observed in the training group ( Fig. 3C and F).  To further explore the prognostic value of the EMT-lncRNA signature in HCC patients, strati cation analysis was performed in the entire group based on age (≤ 65 and > 65 years), gender (male and female), grade (grade 1-2 and grade 3-4), and clinical stage (stage I-II and stage III-IV), and the results showed that male patients, patients with grade 1-2, different age, and clinical stage in the high-risk group had worse survival than those in the low-risk group; however, there was no signi cant difference in female patients and patients with grade 3-4 mainly owning to the small sample, indicating that the prognostic signature was applicable in most different subtypes of HCC (Fig. 4).
Furthermore, univariate and multivariate Cox proportional hazards regression analyses were performed to evaluate whether the risk score model could be independently associated with the OS of HCC. The results showed that the 12 lncRNAs signature was an independent predictor in the training group and two validation groups (Fig. 5A-F). Multi-ROC curve analysis was implemented to identify the accuracy of the risk score model, and the results showed that the risk score model provided a more precise predictive role for survival than other clinical parameters with AUC = 0.825 in the training group, AUC = 0.823 in the testing group, and AUC = 0.794 in the entire group (Fig. 5G-I). The above results demonstrated that the EMT-related lncRNA prognostic model had an accurate predictive ability for prognosis in HCC patients.
We constructed a co-expression network of the 12 EMT-related lncRNA-mRNAs using Cytoscape and Sankey diagrams to visually present the correlation among EMT-related genes, EMT-related lncRNAs, and risk types of lncRNAs ( Fig. 6A-B).

Construction and evaluation of the predictive nomogram
We integrated age, gender, grade, and clinical stage with risk scores to establish a nomogram to predict the 1-year and 3-year OS of HCC patients. A comprehensive score was obtained by combining each clinical factor; the higher the total score of patients, the worse the prognosis (Fig. 7A). The predictive consistency and accuracy of the nomogram for 1-year and 3-year OS were assessed using calibration and ROC curves. The calibration curves indicated the prognostic nomogram model nearly in accordance with reality, and ROC curve analysis demonstrated that the nomogram provided an accurate survival prediction with AUC = 0.795 for 1-year OS and 0.773 for 3-years OS as shown in Fig. 7B-E, demonstrating that the nomogram exhibited preferable clinical practicality in predicting the prognosis of HCC.

Enrichment analysis
GSEA analysis was conducted in the entire group to probe the potential biological mechanism of the EMT-related lncRNA prognostic signature. The GSEA results revealed that tumor-related pathways, including bladder cancer, thyroid cancer, pathways in cancer, VEGF signaling pathway, cell cycle, and DNA replication were highly enriched in the high-risk group with FDR (q-value) < 0.05 as shown in Fig. 8A.
To further determine the biofunctions associated with the risk signature, differentially expressed genes were extracted from the TCGA database in the low-and high-risk groups. A total of 969 DEGs (upregulated 832 genes, downregulated 146 genes, fold change > 1, P < 0.05) were identi ed in the highrisk group compared to the low-risk group. We then selected the DEGs for GO enrichment analysis, and the results showed that these DEGs were enriched in terms associated with cell division process ( Fig. 8B-C) and suggested that the proliferation ability of HCC cells in the high-risk group was notably more rapid than that in the low-risk group. Immune cell in ltration and immune-related pathways Lastly, ssGSEA was conducted to compare the difference in in ltrating levels of immune cells and immune-related pathways between high-and low-risk groups in HCC patients, and we calculated the score of 16 immune cells and 13 immune function-related pathways based on ssGSEA analysis. The results revealed that the in ltration level of immune cells, including aDCs, macrophages, Th2 cells, and Treg cells, were signi cantly upregulated in the high-risk groups, but natural killer (NK) cells were upregulated in the low-risk group (P < 0.05; Fig. 9A). As for the immune-related pathways, antigenpresenting cell (APC) co-stimulation, checkpoint, and major histocompatibility complex-I (MHC-I) were signi cantly upregulated in the high-risk group; nevertheless, the INF-I and INF-II responses were upregulated in the low-risk group (P < 0.05) (Fig. 9B).

Discussion
Hepatocellular carcinoma is one of the most common malignant tumors worldwide, exhibiting high recurrence rates, high mortality rates, and poor prognosis in patients [26]. Thus, identifying potential prognostic biomarkers is critical for predicting the overall survival of patients with HCC and guiding the individual therapeutic strategies for high-risk patients. Previously, most studies emphasized the impact of EMT on the tumor metastatic progression and resistance to therapy, while only a few studies have focused on its prognostic value in cancer. In addition, a wealth of recent studies have shown that lncRNAs could serve as novel and effective biomarkers and therapeutic targets for tumors [27]. Therefore, it will be bene cial to explore the e cacies of the EMT-related lncRNAs as biomarkers for predicting the prognosis in HCC patients.
In this study, we focused on the prognostic value of EMT-related lncRNA signature. Based on public expression data and clinical information from the TCGA dataset, we identi ed 12 prognostic EMT-related lncRNAs that could e ciently predict the survival of HCC patients. First, the 343 HCC patients with completed follow-up information were randomly divided into training and validation groups. In the training HCC patient cohort, we screened out 12 prognostic EMT-related lncRNAs to construct a prognostic signature for predicting survival using Cox regression and LASSO regression. The risk scores were calculated based on the 12-lncRNA expression pro les and coe cients, and Kaplan-Meier analysis showed that the survival of HCC patients in the high-risk group was worse than that of patients in the lowrisk group, and the risk score model could correctly predict the 1-year, 3-years, and 5-years OS with AUC = 0.828, 0.811, and 0.758, respectively. In addition, the prognostic availability of the risk model was validated in the testing and entire groups, and univariate and multivariate Cox regression analysis and ROC curve analysis demonstrated that the EMT-related lncRNA signature provided more sensitivity and speci city, and acted as independent indicators of HCC prognosis compared with other clinical parameters. Next, we established a nomogram based on age, gender, grade, clinical stage, and risk score and veri ed predictive consistency and accuracy with calibration curve and ROC curve, all of which indicated that the nomogram had a satisfactory uniformity with actual survival and provided better clinical practicality than the traditional tumor grade or clinical stage system. tissues, and its low expression was also remarkably correlated with poorer overall survival based on bioinformatic analysis [28]. lncRNA F11-AS1 was under-expressed and attenuated tumor growth and metastasis capacity via the F11-AS1/miR-3146/PTEN axis in HCC [29]. In addition, the low expression of lncRNA F11-AS1 was correlated with poor prognosis in patients with HBV-related HCC via regulation of the miR-211-5p /NR1I3 pathway [30]. A study revealed that LINC00942 acts as an oncogene that promotes cell proliferation and colony formation and inhibits cell apoptosis, subsequently elevating METTL14-mediated m6A methylation levels in breast cancer initiation and progression by LNC942-METTL14-CXCR4/CYP1B1 signaling axis [31]. The lncRNA CASC19 was widely reported and markedly upregulated in multiple tumor tissues and cells, higher CASC19 expression is associated with tumor progression and poorer prognoses, including colorectal cancer, advanced gastric cancer, non-small lung cancer cells, and pancreatic cancer [32][33][34][35]. Two other studies revealed that a high expression level of CASC19 was positively correlated with radio resistance in cervical cancer and nasopharyngeal carcinoma [36,37]. However, the other seven prognostic EMT-related lncRNAs (AC103760.1, AC015908.3, LINC02362, AC012146.1, AC092171.2, AC099850.3, and AL158206.1) have been rarely reported, and our research may provide a new perspective for their potential functions. Moreover, most previous studies have emphasized that single lncRNAs function as prognostic indicators and relevant mechanisms; thus far, no research has applied EMT-related lncRNAs to the prognosis of HCC patients. Therefore, the construction of a prognostic EMT-associated lncRNA signature may provide invaluable insights for current prognostic predictions of HCC.
Subsequently, GSEA analysis was performed to investigate the related functional differences and underlying mechanisms of the risk model constructed using the 12 lncRNA signature in HCC. The results demonstrated that various tumor-related, proliferation, and metastasis-related pathways were highly enriched in the high-risk group, including the VEGF signaling pathway, DNA replication, and cell cycle. The VEGF signaling pathway is composed of the VEGF superfamily, VEGF receptors, some signal proteins, and nuclear transcriptional regulatory factors, which can regulate cell biological behavior and participate in tumorigenesis, tumor development, invasion, and metastasis [38]. The expression of VEGF and its receptors, including VEGFR1, VEGFR2, and VEGFR3, is elevated in HCC cell lines and tissues, and increased VEGF expression is associated with high HCC tumor grade, vascular invasion, and portal vein invasion [39]. Hence, VEGF signaling pathway inhibitors have been implicated as a therapeutic strategy for the prevention of HCC, for example, sorafenib for advanced-stage HCC [40]. A study reported that increased expression of p28 in HCC induces EMT by activating the PI3K-AKT-HIF-1α pathway to promote Twist1, VEGF, and metalloproteinase 2 expression, which consequently leads to tumor angiogenesis, vascular invasion, and metastasis [41]. The VEGF signaling pathway was enriched in the EMT-related lncRNA signature high-risk group, which might be one of the contributors to the poor prognosis. Moreover, according to the GO analysis, DEGs in the high-risk group were focused on cell division activities, which promote the proliferation of cancer cells and accelerate tumor growth, which may be a reason for unfavorable prognosis in the high-risk group.
Given the signi cant role of the tumor microenvironment in the process of tumorigenesis, involving immune cell in ltration and immune-related pathways, we investigated the roles of immune in ltrating cells and immune function-related pathways of the tumor microenvironment in the prognosis of HCC.
ssGSEA analysis revealed that immune cells, including aDCs, macrophages, Th2 cells, and Treg cells, were enriched in high-risk groups, but NK cells were upregulated in the low-risk group. In addition, APC costimulation, check-point, and MHC-I were signi cantly upregulated in the high-risk group; however, the INF-I and INF-II responses were upregulated in the low-risk group. Macrophages play a prominent role in the tumor microenvironment, acting as a cancer promoter in HCC by presenting M1 and M2 polarization, and M2 polarization induces cellular proliferation, angiogenesis, and EMT in HCC [42]. A study found that HIF1α/ IL-1β signaling promotes hepatoma EMT through macrophages in a hypoxic-in ammatory microenvironment [43]. It has been reported that the in ltration of macrophages is proportional to Tregs and inversely proportional to CD8 + T cells in HCC, and the absence of NK cells may trend toward M2 macrophages and accelerate liver brogenesis [42]. Given the above results, we speculated that the immune response ability in the high-risk group was lower in the low-risk group owing to immunosuppressive cells, such as macrophages and Tregs. Conversely, the anti-tumor immune response pathway was enriched in the low-risk group, which may contribute to a favorable prognosis in this group.
Although we successfully established a prognostic signature model and the nomogram performed well in predicting the survival of HCC patients, our study did have a few limitations. First, the 12 EMT-related lncRNAs prognostic signature was constructed and evaluated using limited data and clinical information from TCGA database and the 12 EMT-related lncRNAs identi ed in the training set were not completely comprised either in the Gene Expression Omnibus database or in the International Cancer Genome Consortium dataset, so this study was performed without any external validation, which may have limited the practicality and generalizability of our model. In addition, some of the model-building lncRNAs reported in this study have not been previously studied; therefore, further investigations are necessary to elucidate their functions. Finally, it is critical to verify the functional features and molecular mechanisms of the risk signature model by conducting further biological experiments and clinical research.
In conclusion, our study established a novel EMT-related prognostic risk signature including 12 lncRNAs and constructed a nomogram to predict the prognosis in HCC patients, which may aid in improving the prognostic predictive accuracy of HCC and guide the individualized treatment methods for the patients with HCC.