3.1 Screening of DE-ARGs and candidate prognostic markers
Based on the TCGA and GeneCards, we isolated ARGs expression data and compared them between 82 ESCC and 11 paracancerous samples. As shown in the volcano map (Fig. 1A), 53 DE-ARGs (46 upregulation, 7 downregulation;Additional file 1) were screened by differential expression analysis. Heatmap of DE-ARGs was depicted in the Fig. 1B. Survival analysis of DE-ARGs was performed to identify candidate genes with p < 0.005. Ultimately, four genes including PBK, LAMC2, TNFSF10 and KL were selected for the following analysis. Figure 1C-F shows the survival probability was lower in low PBK and LAMC2 expression group than in high expression group, while survival probability was lower in high TNFSF10 and KL expression group than in low expression group. The enrichment of samples on the gene set was calculated by ssGSEA. Survival curves indicated that high levels of four candidate genes were implicated in shorter OS of ESCC patients (Fig. 1G).
3.2 Determination of prognostic genes and establishment of a risk model
To find vital genes in the prognosis of ESCC, univariate Cox regression was done based on four candidate genes. According to results (p < 0.05), TNFSF10 and PBK were identified with significantly effects on patient prognosis (Fig. 2A). Lasso regression was utilized to avoid overfitting in identifying stable markers from two survival related candidates (Fig. 2B, C). Ultimately, a prognostic signature comprising the two genes (TNFSF10 and PBK) was constructed using stepwise multivariate regression (Fig. 2D). Prognostic risk model was generated: riskscore = 0.4861524*Exp (TNFSF10) – 0.3919675*Exp (PBK). Furthermore, we found that TNFSF10 was a risk factor with HR > 1, while PBK was a favorable factor with HR < 1.
3.3 Verification of risk model
To explore predictive performance of the two-ARG signature (TNFSF10 and PBK), we computed risk scores. Using median risk score of 2.4077608, patients in training set were assigned into high- (n = 47) and low-risk (n = 34) groups. PCA showed that samples could be distinguished by the risk score profile (Additional file 2). A comparison of survival times revealed that patients in low-risk group outlived those in high-risk group (Fig. 3A). Then, ROC analysis investigated diagnostic value. AUC calculated respectively from ROC curves were 0.780 and 0.791 for 1 and 3 years, indicating that the risk model had some degree of predictive power (Fig. 3C). Thereafter, Fig. 3E presents the risk curve and the gene expression heatmap of two hub ARGs of patients. Furthermore, we tested prognosis effects of risk model in another independent dataset, GSE53622. K-M survival curves displayed that patients in high-risk set had a shorter OS (Fig. 3B). AUC for 1- and 3-year OS were 0.684 and 0.627 respectively (Fig. 3D). Patients’ risk score, survival time, and the two ARGs levels were also presented (Fig. 3F). In summary, risk model had a satisfying performance for OS prediction.
3.4 Functional annotation analysis on prognostic gene
To clarify the potential function and involved pathways of the two prognostic genes, GSEA was done based on hallmark gene set and immunomic signatures gene set. For TNFSF10, 32 entries were enriched in hallmark, such as a subgroup of genes regulated by MYC - version 1 (v1), genes encoding cell cycle-related targets of E2F TFs, genes involved in DNA repair, and so on (Fig. 4A). Besides, 4558 enriched entries were obtained in immunomic signatures, containing genes downregulated CD8 T cells (mock transduced versus wildtype), etc (Fig. 4B). For PBK, hallmark results indicated that 34 pathways were enriched including a subgroup of genes regulated by MYC - version 1 (v1), a subgroup of genes regulated by MYC-version 2 (v2), genes encoding proteins involved in glycolysis and gluconeogenesis (Fig. 4C). Based on immunomic signatures, PBK was involved in 4262 enrichment pathways, including genes downregulated in dendritic cells and B cells (Fig. 4D).
3.5 Correlation analysis of risk scores and clinical features
To investigate clinical value of prognostic signature, association of risk score with clinical features was investigated. Correlation analysis unveiled significant differences in risk scores between pathologic-N0, pathologic-N1, pathologic-N2 and pathologic-N3, and highly significant differences in the alive and dead categories of status (Fig. 5A, B). We also validated the prognostic prediction capability in different subgroups. K-M survival curves demonstrated that survival rates of pathologic-T2, pathologic-N0, pathologic M0 and tumor stage II were lower in high-risk group (Fig. 5C-F). Taken together, risk signature could function as a great potential indicator for OS prediction in ESCC.
3.6 Independent prognostic analysis and establishment of nomogram
To find out if our model was a clinically independent prognostic factor for ESCC patients, univariate and multivariate regression analysis were employed. It was concluded from the forest plot that risk score, pathological stage and N stage were significant factors associated with survival and could be utilized as independent prognostic factors (Fig. 6A, B). According to these three factors, we then built nomogram for 1- and 3-year OS prediction (Fig. 6C). C-index values for prognostic were 0.759. Calibration curve analysis revealed that observed versus predicted rates of 1- and 3-year OS displayed good consistency (Fig. 6D). DCA displayed that combined model including three independent prognostic factors (riskscore + N + stage) had best net benefit for 1- and 3-year OS than the alternative options (that of each factor alone) (Fig. 6E).
3.7 Annotated analysis of the function of DEGs
To further explore biological functions and pathways of ARGs, we analyzed DEGs between high- and low-risk groups in the training set. 1852 DEGs (722 upregulated; 1130 downregulated) were screened by differential analysis (Fig. 7A). A heat map was displayed in Fig. 7B. GO and KEGG analyses presented enrichment of DEGs in 450 GO terms and 39 pathways. GO pointed out the enrichment in biological processes like cellular response to interferon-gamma and type Ⅰ interferon signaling pathway (Fig. 7C). KEGG unveiled enrichment in cell adhesion molecules, and phagosome (Fig. 7D).
3.8 Immuno-infiltration analysis
The tumor microenvironment was analyzed to screen differences in immune infiltrating cells in risk groups. By using the CIBERSORT algorithm with values of p < 0.05, the relative proportions of 22 tumor-infiltrating immune cells were acquired. As illustrated by box plot in Fig. 8A, T.cells.CD4.memory.resting exhibited an increase in low-risk group, whereas expressions of B.cells.memory and T. cells.Follicular.Helper were significantly increased in high-risk group. To further investigate correlation between immune infiltrating cells and the prognosis genes, Spearman’s correlation analysis with p < 0.05 and | cor | > 0.3 was applied. The statistical correlations were summarized in Fig. 8B. The Scatter plot revealed that PBK was positively linked to Macrophages M0 cells, whereas TNFSF10 was positively correlated with Macrophages M1 cells in Fig. 8C.
3.9 Establishment of regulatory network
TRRUST database was applied to extract the TFs of prognostic genes and 17 pairs of mRNA-TF relationships were obtained. The miRNA-mRNA interaction relationships were downloaded from miRTarBase and TargetScan to screen for miRNA interacting with the two prognostic genes. A total of 34 overlapping targeting relationship pairs existed in the two databases. After miRNA-mRNA network and mRNA-TF network were predicted, miRNA-mRNA-TF relationships were tested. miRNA-mRNA-TF relationship pairs such as hsa-miR-562-TNFSF10-FOXO3 and hsa-miR-216b-5p-PBK-ATM were obtained by taking the intersection of mRNA-TF and miRNAs-mRNA (Fig. 9A, B). The lncRNASNP was utilized to predict lncRNAs targeted by 34 identified miRNAs. It can be seen from the miRNA-lncRNA regulatory relationship network diagram that 602 miRNA-lncRNA regulatory relationship pairs were predicted, including hsa-miR-130b-3p-LINC00339, hsa-miR-26a-5p-lnc-ATP6V1G3-4, etc. (Fig. 9C).
3.10 Chemotherapy drug susceptibility analysis
Based on the oncoPredict R package, association of risk group with drug sensitivity was explored by calculating the IC50 value. The box plot illustrated that IC50 values of predicted drugs, in the case of Tozasertib 1096 and WIKI4 1940, were significantly variant between risk groups by rank sum test at p < 0.05 (Fig. 10). We observed that patients in low-risk group were more sensitive to Tozasertib 1096, while patients in high-risk group were more sensitive to WIKI4 1940. It can be indicated that risk score can be a favorable source indicator for providing basis for drug selection.
3.11 Single cell analysis
ScRNA sequencing data of seven normal samples and seven ESCC samples were downloaded from the GSE145370 dataset and analyzed to identify and characterize cell populations. As with other clusters, significant DEGs in every cluster were identified by “FindAllMarkers” function and then cell types were identified in an unbiased manner by the specific markers. We observed that ten cell clusters including NK cells, T cells, monocytes/macrophages, B cells, mDC, pDC, plasma cells, mast cells, fibroblasts, and epithelial cells were identified by UMAP analysis (Fig. 11A). Subsequently, two prognostic gene expressions in each cell type was visualized by a dot plot in Fig. 11B. TNFSF10 was higher in epithelial cells and mast cells, while PBK was higher in epithelial cells than other cell types. The violin plot in Fig. 11C showed the expression of epithelial and mast cells in ESCC samples and normal samples. Additionally, the correlation value between nFeature-RNA and nCount-RNA was 0.93 and 0.96 for epithelial and mast cells (Fig. 11D). Subsequently, we further clustered the two-cell category and calculated the correlation between DEGs and prognostic gene expression based on |Cor| > 0.3 and p < 0.05. 964 DEGs were screened by cell clustering for epithelial (Fig. 11E, F), and 252 DEGs were screened for mast cells (Fig. 11G, H).
3.12 Validation of prognostic genes expression by qRT-PCR
In the training set, both TNFSF10 and PBK were increased in ESCC compared to the paracancerous tissues (Figs. 12A, B). Consistent with the results in TCGA, the average expression levels of TNFSF10 and PBK in validation set were significantly upregulated in the cancer tissues (Figs. 12C, D). To confirm expression pattern of the two biomarkers, qRT-PCR was performed. Results indicated that two gene expressions were completely congruous with data mining (Figs. 12E, F).