Identification of a Three-gene Signature Acting as a Novel Prognostic Biomarker in Intrahepatic Cholangiocarcinoma Patients

Background: Intrahepatic cholangiocarcinoma (iCCA) patients have poor outcomes due to the lack of biomarkers for the selection of treatment options. The present study was conducted to find biomarkers with independent prognostic vaule in iCCA patients. Methods: Gene transcriptome profiles of E-MTAB-6389, TCGA-CHOL and GSE26566 were obtained from ArrayExpress, The Cancer Genome Atlas and the Gene Expression Omnibus databases, respectively. Bioinformatic analyses were performed to screen novel biomarkers for predicting the prognosis of iCCA patients. Using multivariate Cox regression analyses, a 3-gene signature (BTD-FER-COL12A1) with potential prognostic value was identified and validated in both a training cohort and two validation cohorts. Results: A total of 177 iCCA patients were included in this study. From the key gene modules significantly associated with liver cirrhosis and overall survival (OS) of iCCA patients, we identified 89 hub genes for functional analyses. Cox-regression analyses in both the training and validation cohort indicate that FER, COL12A1 and BTD were independent risk factors for iCCA patients. A 3-gene signature (BTD-FER-COL12A1) with independent prognostic value in iCCA patients was validated in the training cohort, as well as in two validation cohorts. In terms of predicting the prognosis of iCCA patients, the receiver operating characteristics (ROC) curves showed that this 3-gene signature had superior prediction power to BTD, FER, and COL12A1 alone, as well as known biomarkers (MUC1, MUC13) of iCCA. Immunohistochemical staining of samples from The Human Protein Atlas showed that FER and COL12A1 were positively expressed in iCCA tissue, although BTD was not, while none of these genes was detected in normal tissue. These findings were consistent with the expression status of BTD, FER and COL12A1 at the transcriptional level. In addition, we found that FER and COL12A1 were significantly associated with the degree of infiltration by tumor-infiltrating immune cells. Conclusion: We discovered a three-gene signature with independent prognostic value as a novel biomarker for prediction prognosis of iCCA patients. Our findings may help to find novel therapeutic targets for precision treatment of iCCA. These results that the of multiple tool Mutations in the collagen XII gene define a new form of extracellular matrix-related


Gene set enrichment analysis (GSEA) and construction of the protein-protein interaction network
To find the significant enrichment gene set in iCCA, we used the software "GSEA4.0.2" [14] to analyze the expression matrix of the E-MTAB-6389 dataset. In GSEA, we selected "h.all.v7.0.symbol.gmt gene sets" as the gene set database and "HTA-2_0" as the chip platform. The STRING database (version 11.0) and Cytoscape software (version3.7.2) [15] were used to construct the protein-protein interaction network with a minimum interaction score of 0.9.

Survival analysis and risk score model construction
Survival analysis was performed for hub genes using the R package "survival" (https://cran.rproject.org/web/packages/survival/index.html) and "survminer" (https://cran.rproject.org/web/packages/survminer/index.html). The R package "pROC" [16] was used to plot receiver operating characteristic (ROC) curves and calculate the area under the ROC curve (AUC), which was used to evaluate the prognostic capability of hub genes. Kaplan-Meier survival curves were generated to assess the overall survival (OS) status of iCCA patients. In risk score model construction, we first used univariate Cox regression analysis to identify hub genes that significantly (P < 0.01) affect the prognosis of iCCA patients. Multivariate Cox regression analysis was then performed to determine the independent prognostic factors among those hub genes and other variables such as age, sex and cirrhosis of the liver. Next, the regression coefficients and expression values of the hub genes found to significantly affect the prognosis of iCCA were used to construct the risk score model, according to the formula ∑coef(gene i )* exp(gene i ). After iCCA patients were divided into high-risk and low-risk groups based on the optimal cut-off value for the risk score, we finally investigated the prognostic value of the risk score model in both the training cohort and two validation cohorts.

Methylation and gene expression analyses
We used the online platform MEXPRESS (http://mexpress.be) [17] to investigate the association of the abnormally expressed hub genes with the methylation status of their promoter region.

Analysis of the correlation between the expression of hub genes and tumor-infiltrating immune cells
To analyze the association between the expression of hub genes and tumor-associated infiltrating immune cells (CD4 + T cells, CD8 + T cells, macrophages, dendritic cells, and B cells), we utilized the website tool TIMER (https://cistrome.shinyapps.io/timer/) [18] to explore the association of hub gene expression with the degree of infiltration by tumor-infiltrating immune cells, as well as with tumor proportion.

Validation the expression status of hub genes in protein level
The Human Protein Atlas (https://www.proteinatlas.org/) [19] of gene-encoded protein levels in tissues was used to validate the expression status of our identified hub genes.

Identification of robust and stable DEGs
After screening and checking the detailed sample information and the matrix design of these three downloaded datasets, we included a total of 177 iCCA patients (32 from TCGA-CHOL, 77 from E-MTAB-6389, and 68 from GSE26566) with gene expression data in our study. Figure 1 shows the detailed workflow for identification, validation, and functional analyses of the intersecting DEGs. As shown in Figure 2A and C, we used R package "TCGA-biolinks" to process the raw matrix data of TCGA-CHOL for background correction and normalization. Similarly, we used the R package "oligo" to process the raw matrix data of E-MTAB-6389. As shown in Figure 2B and D, we identified one sample (JFE-058G-HTA) as an outlier in the process of sample by sample correlation after normalization; this outlier was excluded from E-MTAB-6398 in the subsequent analyses. A total of 2,696 intersecting, stable DEGs were identified between the iCCA samples and matched normal tissue samples ( Figure 2E). As shown in Figure 2F, 1,470 DEGs were upregulated and 1,226 were downregulated in iCCA patients.

WGCNA and identification of hub genes
After extraction of the intersecting DEGs with expression values from the E-MTAB-6389 dataset, we conducted WGCNA to identify key genes modules significantly associated with clinical traits of iCCA patients. As shown in Figure 3A-B, in the analysis of network topology for selection of the optimal 7 soft-threshold (power), a soft-threshold value of 3 (the lowest power for scale-free topology that fits an index of 0.9) was identified to construct a hierarchical clustering dendrogram. As shown in Figure   3C, no outliers sample was found in the sample clustering analysis. By setting the dendrogram height cut-off value to 0.25 for module merging, eight modules were generated (Figure 3D-E). From the heatmap of module-trait relationships (Figure 3F), we found that the black module was significantly correlated with alcohol drinking (correlation coefficient = 0.37, P = 9e-04) and cirrhosis of the liver The turquoise module contained genes that could not be placed in any module; these data were excluded from the subsequent analyses. According to the MM and GS values of each gene, we identified a total of 85 hub genes from the black and pink modules for further functional analyses.

GSEA and construction of the protein-protein interaction network
To find significant enrichment gene sets/functional pathways involved in the pathogenesis of iCCA, we conducted GSEA on the E-MTAB-6389 dataset. As shown in Figure 3G, the "hallmark-mitotic-spindle" functional pathway (NES = 1.55, NOM p-val = 0.002) was significantly upregulated in iCCA compared with normal tissue; the corresponding heatmap of this functional pathway is shown in Figure 3H. By using the STRING database (version 11.0) and Cytoscape software, we constructed protein-protein interaction network for the selected hub genes ( Figure 4A).

Functional enrichment analysis
The GO enrichment analysis of cellar components showed that the top five components were collagen-containing extracellular matrix (ECM), collagen trimer, microbody lumen, microbody part and peroxisomal matrix (Supplementary Figure 2A). In terms of biological process associated with those hub genes, alpha-amino acid metabolic process was the most significantly enriched functional 8 process (Supplementary Figure 2B). In the KEGG pathway analysis, these genes were predominantly associated with glycine, serine and threonine metabolism, retinol metabolism, cysteine and methionine metabolism, beta-alanine metabolism and primary bile acid biosynthesis (Supplementary Figure 2C). Based on the results of PPI and functional analyses, as well as a literature review, we finally selected FTCD, BTD, FER, FETUB, F13B, COL12A1, COL1A2, COL1A1and COL5A2 for survival analyses. We constructed a heatmap visualizing the expression status of these nine hub genes in 31 normal tissue and 77 iCCA tissue samples from the E-MTAB-6389 dataset. We also visualized the clinical traits (alcohol drinking, OS, cirrhosis of the liver and sample types) in these 108 samples. As shown in Figure 4B, BTD, FTCD, FETUB and F13B were downregulated in iCCA samples, while FER, COL1A1, COL1A2, COL12A1, COL5A2 were upregulated.

Establishment of a three-gene signature for predicting the prognosis of iCCA in the training cohort
The characteristics of iCCA patients from the TCGA-CHOL and E-MTAB-6389 datasets are summarized in Table 1. We randomly divided the E-MTAB-6389 dataset into the training cohort (n = 38) and validation cohort 1 (n = 39). Univariate Cox proportional hazard regression analysis was performed to identify the hub genes significantly associated with the OS of iCCA patients from the training cohort.
As a result, a total of six hub genes (SULF1, LAMA4, COL12A1, FER, BTD and CTH) were found to be significantly associated with the OS of iCCA patients (P < 0.05). Subsequently, multivariate Cox regression was performed for these six hub genes as well as the clinical characteristics of iCCA patients. As shown in Supplementary Figure 3A-C, BTD, FER and COL12A1, as well as the cirrhosis of liver were identified as independent prognostic factors for iCCA patients. As shown in Figure 5A-C, BTD was identified as a favorable prognostic factor for iCCA patients, while COL12A1 and FER were unfavorable prognostic factors. Based on the results of the multivariate Cox regression analysis, we constructed a risk model for predicting the OS of iCCA patients. The risk score (RS) formula = BTD 1.927 + COL12A1´(-2.429) + FER´(-1.313). As shown in Supplementary Figure 3D, the RS was identified as an independent prognostic factor for iCCA patients. In addition, as shown in Figure 5D, iCCA patients with high RS had an unfavorable outcome compared with that of patients with low RS (P 9 = 0.00047).

Validation of the independent prognostic value of the three-gene signature in iCCA patients from two validation cohorts
The prognostic capability of this three-gene signature (RS model) was validated in both the internal (validation cohort 1 from E-MTAB-6389) and external (validation cohort 2 from TCGA-CHOL) validation cohorts. Based on the expression value of BTD, COL12A1 and FER, the RS was calculated according to the formula described previously. In validation cohort 1, we found that iCCA patients with high RS had a poor outcome (P = 0.006) compared with that of patients with low RS ( Figure 6A). As shown in Figure 6B, we also found that iCCA patients with high RS in validation cohort 2 had an unfavorable outcome compared with that of patients with low RS (P = 0.019). In addition, the results of multivariate Cox regression applied in both validation cohort 1 ( Figure 6C; P = 0.038) and validation cohort 2 ( Figure 6D; P = 0.015) indicated that this three-gene signature (RS model) was an independent risk factor for OS of iCCA patients.

Evaluation of the predictive capability of the three-gene signature using ROC analysis
We constructed ROC curves to assess the performance of the three-gene signature in predicting survival of iCCA patients in validation cohort 1. As shown in Figure 6E, the AUC of the three-gene whereas BTD was not. None of these three hub genes were detected in the bile duct cells in normal liver tissue. The expression status of these three hub genes at the protein level was consistent with that at the transcriptional level. These results demonstrate the validity of the hub genes identified in this study.

The relationship between the expression level of the hub genes and their methylation status
To clarify the potential mechanism of these three hub genes that were found to be abnormally expressed in iCCA tissue (FER and COL12A1 were upregulated, while BTD was downregulated in iCCA), we screened the MEXPRESS (http://mexpress.be) database to investigate the association of the expression level of these three hub genes with the methylation status of the corresponding promoter region. As shown in Supplementary Figures 5-7, the DNA sequences of FER, BTD and COL12A1 contained numerous methylation sites, which were negatively correlated with their expression level.

The association of the expression level of the hub gene with the degree of infiltration by tumor-infiltrating immune cells and tumor purity
We used TIMER (https://cistrome.shinyapps.io/timer/) to explore the potential associations of the expression level of these three hub genes with tumor purity, as well as the degree of infiltration by tumor-infiltrating immune cells (B cells, CD8 + T cells, CD4 + T cells, macrophages, neutrophils and dendritic cells). Unfortunately, no significant associations were found between these three hub genes and tumor purity (Supplementary Figures 8A-C). Conversely, moderate or strong positive correlations were found between FER and the degree of infiltration by CD4 + T cells, macrophages, and neutrophils (Supplementary Figures 8A). Weak or moderate associations were found between COL12A1 and the degree of infiltration by B cells, macrophages, neutrophils and dendritic cells (Supplementary Figures 8B). However, no significant associations were observed between the infiltration of tumor-infiltrating immune cells and BTD expression level (Supplementary Figures   8C). These results indicated that BTD and FER regulate the degree of infiltration by tumor-infiltrating immune cells in iCCA.

Discussion
New systemic therapies, including immune-based therapies and biomarker-driven therapies, have 11 been applied to treat iCCA [4]; however, the prognosis of this intractable disease remains dismal.
Identification of novel biomarkers for iCCA is necessary for the discovery of potential therapeutic targets and elucidation of the molecular mechanisms of tumorigenesis in iCCA. In our present study, Univariate Cox regression analyses showed that SULF1, LAMA4, COL12A1, FER, BTD, CTH and cirrhosis of the liver were significantly associated with the OS of iCCA patients. In contrast, in the multivariate Cox regression analysis, BTD, FER and COL12A1, as well as cirrhosis of liver, were identified as independent prognostic factors in iCCA patients. Given that BTD was found to be a favorable prognostic factor in iCCA patients, whereas COL12A1 and FER were unfavorable prognostic factors, a three-gene signature (BTD-FER-COL12A1) was established to predict the survival of iCCA patients. We further validated the independent prognostic value of this three-gene signature in two validation cohorts (Figure 6C-D). In addition, we found that the prognostic accuracy of this threegene signature was superior to that of BTD, COL12A1 and FER alone ( Figure 6E). In comparison with other known prognostic biomarkers such as MUC1 [20] and MCU13 [21], we found that this threegene signature provides distinctly higher accuracy in terms of prognosis prediction for iCCA patients.
These results indicate that the application of multiple biomarkers is a power tool for predicting the prognosis of iCCA patients.

Searches of The Human Protein Atlas to validate the expression status of FER, COL12A1 and BTD at
the translational level showed that FER and COL12A1 were positively expressed in iCCA tissue, whereas BTD was not. While none of these three hub genes were detected in bile duct cells of normal liver tissue, the transcriptional expression status of these genes was consistent with that at the translational level. Thus, these findings indicate that these three abnormally expressed hub genes in iCCA are reliable biomarkers involved in the pathogenesis and tumorigenesis of iCCA. In addition, using MEXPRESS, we found that many methylated sites in the promoter regions of the hub genes correlated negatively with expression levels. This finding indicates that regulation of the expression of these hub genes by demethylation intervention as a therapeutic strategy for iCCA patients. In our present study, we discovered that FER was upregulated at both transcriptional and translational levels. In addition, we found significant correlations between the expression of FER and the degree of infiltration by CD4 + T cells, macrophages, and neutrophils (Supplementary Figure 8A), thus implicating FER as a new therapeutic target for iCCA treatment. COL12A1 (collagen type XII alpha 1 chain), also known as COL12A1L, encodes the alpha chain of type XII collagen that is a member of the FACIT (fibril-associated collagens with interrupted triple helices) collagen family. Abnormal COL12A1 expression can cause myopathy and is correlated with poor survival in gastric cancer patients [28,29]. Similar to the expression status of FER in iCCA, we found that COL12A1 was upregulated in iCCA tissue compared with normal tissue. Further studies are 13 required to clarify the exact mechanism of the increased expression of COL12A1 and FER in iCCA.
Some limitations of our present study should be noted. First, the small sample size in both the training cohort and the two validation cohorts may influence the stability of our results. Second, some clinical characteristics, such as the surgical resection margin status and lymph node status, were not included in our multivariate analysis model since the original data were unavailable. Finally, all our work is based on bioinformatic analyses and data mining from online databases, and further experimental studies are required to further confirm our findings.

Conclusion
We identified a three-gene signature that can be considered as a novel independent prognostic biomarker in terms of predicting the OS of iCCA patients. This information may aid in the discovery of novel therapeutic targets for precision treatment of iCCA patients.

Ethics approval and consent to participate
This study did not require ethical approval and consent to participate as the data used have been published previously or have been uploaded on online databases, and hence are already in the public domain.

Consent for publication
This study did not contain any individual person's data in any form.

Conflicts of interest
The authors declare no conflicts of interest.