A Novel Prognostic Signature Based on Cell-In-Cell-Related Genes for Predicting Survival and Tumor Microenvironment in Pancreatic Cancer

Background: Pancreatic cancer (PC) is a highly malignant tumor featured with high intra-tumoral heterogeneity and poor prognosis. Cell-in-cell (CIC) structures have been reported in multiple tumor types, and their presence is thought to promote clonal selection and tumor evolution. Here, we aimed to establish a CIC-related gene signature for predicting the prognosis and evaluating immune microenvironment in PC. Methods: In this study, the gene expression data, as well as corresponding clinicopathological data of PC and normal pancreatic tissues were collected from The Cancer Genome Atlas (TCGA), Genotype-Tissue Expression (GTEx), International Cancer Genome Consortium (ICGC) and Gene Expression Omnibus (GEO) databases. Differential gene expression analysis, random forest screening, least absolute shrinkage and selection operator (LASSO) regression and multivariate Cox regression analysis were performed on 101 CIC-related genes to construct a prognostic gene signature. The effectiveness and robustness of the prognostic gene signature were evaluated by receiver operating characteristic (ROC) curves, Kaplan-Meier survival analysis and establishing the nomogram model. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted to annotate the biological functions of the differentially expressed genes (DEGs). Quantitative real-time PCR (qRT-PCR), western blotting and immunohistochemistry (IHC) staining were validated the core gene expression in both mRNA and protein levels. Results: A 4-gene signature was constructed to stratify patients into the low-risk and high-risk groups with distinct survival outcomes, somatic mutation proles and immune features. The high-risk group had poorer prognosis than did the low-risk group. This signature was found to be an independent prognostic factor for PC patients with favorable predictive eciency. Functional enrichment analyses showed that numerous terms and pathways associated with invasion and metastasis were enriched in the high-risk group. Moreover, the high-risk group had a higher tumor mutation burdens and lower immune cell inltrations. KRT7, as the most important risk gene, was signicantly associated with the worse prognosis of PC. CIC formation assay performing in PC cell lines indicated that KRT7 expression was correlated with CIC frequency. Conclusions: The signature based on four CIC-related genes could be applicable for predicting the prognosis of PC, and targeting CIC processes may be a potential therapeutic option. Further studies are needed to reveal the underlying molecular mechanisms and biological implications of CIC in PC progression.


Introduction
Pancreatic cancer (PC) is a highly lethal malignancy with rising trend in incidence and mortality worldwide, estimating that it will be the second leading cause of cancer related death by 2030 [1,2].
Currently, curative surgery followed by adjuvant chemotherapy remains a standard therapeutic approach, however, because of the concealed anatomical location of the pancreas, non-speci c symptoms and de ciency of reliable biomarkers, effective screening is not available for PC, and most patients present with locally advanced or metastatic disease at the time of initial diagnosis, leading to missed opportunities for surgical intervention [3]. Therefore, it is essential to clarify the mechanisms of PC progression and to develop more effective therapeutic strategies.
Cell-in-cell (CIC) structures refer to the presence of one or more living cells internalized into another living one with the formation of "bird-eye cells", which was rst reported approximately 120 years ago in tumor tissues [4]. CIC is a long-standing phenomenon that was virtually neglected for decades but is attracting great interest in recent years. CIC structures have been observed in various types of tumors and are associated with the worse prognosis, such as breast cancer, lung cancer and PC [5][6][7]. From cellular mechanisms, cell cannibalism and entosis are two of the best-characterized CIC processes in cancers [8].
Cell cannibalism generally refers to the engulfment of live or dead cells within cancer cells through a mechanism involving actin, ezrin and caveolin. Previous studies have reported that cancer cells exert potent engulfment activity directed toward homotypic cancer cells, lymphocytes, neutrophils, natural killer cells and mesenchymal stem cells [4,[9][10][11]. Entosis is a speci c form of cell cannibalism mainly induced by the extracellular matrix detachment, aberrant mitosis and glucose starvation [12]. It is thought to occur mostly between homotypic epithelial cells following the establishment of adherens junction via Ecadherin or P-cadherin, formation of mechanical ring enriched in vinculin and actomyosin contraction mediated by the Rho−ROCK−DIAPH1 signaling pathway [13]. Entosis is unlike cell cannibalism engulfed by outside cells, in which the internalized cells actively penetrate into outside cells driving by contractile actomyosin at the opposite cell cortex and subsequently die by lysosome-dependent degradation, leading to a non-apoptotic cell death [14]. In the past decade, numerous studies have provided evidence that the cannibalistic behavior is a hallmark of cancer, conferring cancer cells metabolic advantages under starvation conditions [8, 9,15]. Moreover, researchers have shown that entosis can promote direct competition between cancer cells in mixed populations and ploidy changes of outside cells, affecting the clonal selection and evolution of cancer cell populations [16,17]. A recent study has reported that in pancreatic ductal adenocarcinoma (PDAC), the most common type of PC, CIC structures were more prevalent in liver metastasis than primary tumor and poorly differentiated adenocarcinoma or adenosquamous carcinoma than well or moderately differentiated adenocarcinoma, suggesting that CIC phenomenon is associated with aggressive biology in PC [7]. Therefore, evaluating the CIC status is a powerful method for prognosis estimation. However, there are no studies focusing on the prognostic value of CIC-related gene signature and the molecular functions of them in PC.
In the present study, we systematically analyzed the expression pro les and prognostic values of CICrelated genes in PC patients from public datasets. Corresponding prognostic gene signature and a nomogram model were established and validated. To explore the potential mechanisms, we further performed functional enrichment analyses, and compared the somatic mutation pro les and immune features between two risk subgroups. The results of this study may help improve the current plight of PC treatment by designing combination therapeutic strategies based on targeting CIC-related processes.

Datasets and Processing
The RNA sequencing (RNA-seq) data of the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) datasets were downloaded from the University of California Santa Cruz (UCSC) Xena website (https://xenabrowser.net/datapages/), which included 167 normal samples and 179 tumor samples. The expression data were normalized to transcripts per million (TPM) values and transformed to log 2 (TPM+1) for further analyses. The RNA-seq data of the International Cancer Genome Consortium (ICGC) dataset was also downloaded from UCSC Xena website, which included 96 tumor samples. The expression data were normalized to counts per million (CPM) and transformed to log 2 (CPM). The normalized expression matrix from microarray datasets of GSE21501 and GSE62452 were downloaded from Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/), which included 132 and 69 tumor samples, respectively. The corresponding clinicopathological information and somatic mutation data of TCGA and ICGC datasets were obtained from TCGA (http://portal.gdc.cancer.gov/) and ICGC (http://dcc.icgc.org/) o cial website. By combining expression data with corresponding clinicopathological information, samples with absent information of survival status, age, gender, TNM, AJCC stage, grade and patients losing to follow-up or follow-up time less than 30 days were removed. Meanwhile, two samples from GTEx dataset were eliminated because of having low expression levels (less 10,000 genes). Ultimately, GTEx (165 normal samples) and TCGA (167 tumor samples) datasets were chosen as the training cohort. ICGC (87 samples), GSE21501 (98 samples) and GSE62452 (64 samples) datasets were applied to external validation. The clinical characteristics of patients were presented in Table 1. The microarray and RNA-seq data of PC cell lines were obtained from GSE21654, GSE40098 and the Cancer Cell Line Encyclopedia (CCLE) database (https://sites.broadinstitute.org/ccle/). The single-cell RNA-seq data of the CRA001160 dataset, including over 50,000 individual pancreatic cells from 24 primary PDAC samples and 11 normal samples, were analyzed and visualized using the Tumor Immune Single-cell Hub (TISCH) website (http://tisch.compgenomics.org/home/).

Identi cation of Differentially Expressed CIC-Related Genes
A total of 101 CIC-related genes were extracted from GeneCards website (http://www.genecards.org/), and prior literatures [4,8,12,14], searching by the keywords "cell-in-cell", "cell cannibalism" and "entosis". The list of 101 CIC-related genes was shown in Supplementary Table 1. The "limma" R package was used to identify the differentially expressed genes (DEGs) between normal and tumor samples. False discovery rate (FDR) < 0.05 and |log 2 (Fold Change)| ≥ 1 were determined as signi cance criteria for selecting DEGs.
The results of DEGs were visualized by volcano plot and heatmap.
Establishment and Veri cation of the CIC-Related Prognostic Signature The random forest screening (using "randomForest" R package) and least absolute shrinkage and selection operator (LASSO) regression analysis (using "glmnet" R package) were applied to screen important DEGs in TCGA cohort. And then, the overlapping CIC-related DEGs between these two methods were further selected to construct the best regression model by stepwise multivariate Cox regression analysis. Finally, four CIC-related genes prognostic signature was established and risk score of each patient was calculated by the regression coe cient of each gene and corresponding expression level using the following formula: Risk score = (Expr gene1 × Coef gene1 ) + (Expr gene2 × Coef gene2 ) + … + (Expr genen × Coef genen ). For external validation cohorts, the risk score of each patient was calculated by the same formula. The patients were divided into low-risk and high-risk subgroups according to the median of risk score. Principal component analysis (PCA) was applied to visualize the clustering conditions of the prognostic signature. The Kaplan-Meier survival analysis was used to compare the differences in overall survival (OS) probability between two groups, and time-dependent receiver operating characteristic (ROC) analysis was used to evaluate the predictive accuracy of the prognostic gene signature. The univariate and multivariate Cox regression analysis were performed to determine the independent prognostic factors associated with OS.
The nomogram was established to predict the 1-, 2-, 3-year survival probability based on the risk score and other clinicopathological characteristics. The corresponding calibration curves were drawn to assess the e ciency of the nomogram. The differential expression of the four genes signature in normal pancreas and pancreatic cancer samples was veri ed by the Human Protein Atlas (HPA) online database (http://www.proteinatlas.org) from protein level in forms of immunohistochemical staining images.

Functional Enrichment Analysis for DEGs Between Two Risk Groups
Based on Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases, functional pathway enrichment analyses of DEGs were conducted to clarify the functions of genes by applying the "clusterPro ler" R package [18]. GeneMANIA (http://genemania.org/), a platform for gene prioritization and predicting gene function, was used to predict the related networks and functions of CICrelated genes [19].

Somatic Mutation Analysis and Immune Feature Estimation
The landscape of somatic mutation was analyzed and visualized by using the "maftools" R package. Tumor mutation burden (TMB) was calculated and compared of individuals between the low-risk and high-risk groups. The estimate score, stromal score, immune score and tumor purity were calculated by the ESTIMATE algorithm [20]. The CIBERSORT algorithm was performed to quantify the relative abundance of 22 immune cells in ltrated in tumor microenvironment (TME) [21]. The immune subtypes were classi ed by using the "ImmuneSubtypesClassi er" R package. Representative immune checkpoints were extracted from previous literatures and the expression levels of them were compared. The immunotherapy response was analyzed by ImmuCellAI website (http://bioinfo.life.hust.edu.cn/ImmuCellAI) [22].

RNA Extraction and Quantitative Real-Time PCR Analysis
Total RNA was extracted from cultured cells by the TRIzol reagent (Life Technologies, #15596-026) and cDNA synthesis was performed using the RevertAid First Strand cDNA Synthesis Kit (Thermo Scienti c TM , #K1622) following the manufacturer's instructions. Quantitative real-time PCR (qRT-PCR) was performed in triplicate using SYBR Green Master Mix (Applied Biosystems, #A25742) [23]. The expression levels of GAPDH were used as the endogenous control and relative expression of KRT7 was calculated using the 2 -ΔΔCt method. The primer sequences were used for qRT-PCR as follows:

CIC Formation Assay
CIC formation assay was performed as previously described [13]. Brie y, about 2.0 × 10 5 cells were suspended in a six-well plate precoated with 1 mL solidi ed 0.5% soft agar for 8 h and then mounted onto glass slides by Cytospin preparation. Cells were xed by 4% paraformaldehyde solution and immunostained with Phalloidin (Abcam, #ab176753) and DAPI (Abcam, #ab104139). Structures with more than half of cell internalized were counted as CIC structures.

Clinical Specimens and Immunohistochemical Analysis
A total of 24 patients with primary PDAC who underwent surgical resection at the Peking Union Medical College Hospital (PUMCH) were recruited in this study following the guidelines set by the Ethics Committee of Peking Union Medical College Hospital. The tumor and adjacent normal tissues were xed by 10% formalin and embedded by para n. The sections of tissue specimens were used for immunohistochemistry (IHC) incubated with antibody against KRT7 (1:2000; Proteintech, #17513-1-AP). Manual staining and the estimation of IHC score were performed as previously described [23,24].

Statistical Analysis
Statistical analyses and visualization were performed using R (version 4.1.0) software and GraphPad Prism 9 (version 9.3.0). Kaplan-Meier analysis and log-rank test were used to evaluate associations with survival time. Student's t test, Mann-Whitney test, and chi-square test were utilized for the comparison between two groups. For frequencies of CIC formation performed in cell lines and relative mRNA expression levels con rmed by qRT-PCR, data are means ± standard deviation (SD) of three independent experiments. All P values of statistical results were based on two-sided statistical tests, and a P value < 0.05 was considered to be statistically signi cant.

Differential Gene Expression Analysis and Functional Enrichment Analysis of CIC-Related Genes
The expression levels of 101 CIC-related genes were explored in normal pancreas and PC samples using GTEx and TCGA datasets. PCA showed that the distribution difference between normal and tumor samples ( Figure 1A). A total of 49 DEGs were identi ed, including 42 upregulated genes and 7 downregulated genes, and visualized by the heatmap ( Figure 1B) and volcano plot ( Figure 1C). GO analysis suggested that these DEGs were mainly involved in reactive oxygen species metabolic process, receptor-mediated endocytosis, cell leading edge, endocytic vesicle, tubulin binding and cytokine activity (Supplementary Figure 1A). Moreover, KEGG pathway analysis indicated that apoptosis, phagosome, regulation of actin cytoskeleton, ferroptosis, transcriptional misregulation in cancer and focal adhesion were enriched (Supplementary Figure 1B).

Construction of the CIC-Related Prognostic Signature in PC
To reduce the number of genes needed for constructing the prognostic signature, we rstly utilized random forest screening to assign an importance factor to each CIC-related DEGs. And then, top 20 genes, sorted by importance, were selected into further analysis ( Figure 1D). Meanwhile, LASSO regression analysis was performed on 49 DEGs, and 18 candidate genes were retained by the most proper value of lambda (λ) ( Figure 1E). Subsequently, we combined 10 overlapping candidate genes between these two methods to establish the best regression model by stepwise multivariate Cox regression analysis ( Figure 1F). Finally, four signi cantly CIC-related genes contributing to OS in PC patients were con rmed ( Figure 1G), and the risk score of each patient was calculated using the following formula: Risk score = (0.362× expression level of KRT7) + (0.302 × expression level of AURKA) + (−0.114× expression level of CDKN2A) + (−0.377 × expression level of RARB). The correlation analysis among these four genes was shown in the circle plot ( Figure 1H). Moreover, the related networks and functions were predicted using geneMANIA website (Supplementary Figure 1C). We found that related functions mainly involved in the regulation of cell cycle, mitosis-related process and kinase regulator activity.

Evaluation and Validation of the CIC-Related Prognostic Signature
Based on the median of risk scores, patients in TCGA cohort were separated into the low-and high-risk groups. The scatterplots showed that, as the patient's risk score increased, the number of deaths increased and the survival time decreased. Compared with the low-risk group, the expression levels of KRT7 and AURKA in the high-risk group were upregulated, while the expression levels of CDKN2A and RARB were downregulated (Figure 2A). The Kaplan-Meier survival analysis indicated that patients in the high-risk group were signi cantly associated with shorter OS than those in the low-risk group ( Figure 2B). The PCA revealed that patients with different risk levels were distributed into two clusters ( Figure 2D). The area under curves (AUC) were 0.760, 0.766 and 0.807 in the 1-year, 2-year, and 3-year ROC curves, respectively ( Figure 2C). We also found that, compared with other clinicopathological characteristics, the AUC value of risk score was much higher, suggesting that it was a better prognostic indicator for PC patients ( Figure 2E). To demonstrate the robustness of the prognostic signature, the predictive e ciency was evaluated in three independent validation cohorts, including ICGC, GSE21501 and GSE62452. The patients from these three validation cohorts were strati ed into the low-and the high-risk groups based on the median of risk scores calculated by using the same formula as in TCGA modeling cohort.
Consistent with the results of TCGA cohort, patients in the high-risk group demonstrated the worse prognosis than those in the low-risk group. Similarly, the expression levels of KRT7 and AURKA in the high-risk group were increased, while the expression levels of CDKN2A and RARB were decreased ( Figure  2F  The univariate and multivariate Cox regression analyses were performed to evaluate the prognostic power of the signature. Based on the multivariate Cox regression analysis, the risk score was con rmed to be an independent prognostic factor for OS prediction in all four cohorts (Supplementary Table 2). In order to further improve predictive e ciency, the risk score and other clinical characteristics such as age, gender, grade, and stage were together used to establish the predictive nomogram in TCGA and ICGC cohorts. The C-index for the nomogram was 0.704 (95%CI 0.645−0.763) in TCGA cohort and 0.725 (95%CI 0.644−0.805) in ICGC cohort, indicating that the two nomograms both had well predictive performance ( Figure 3A and 3B). Subsequently, time-dependent ROC curves, the calibration curves and the decision curve analysis (DCA) were applied to further evaluate the effectiveness of established nomograms. The AUCs of ROC curves for predicting 1-, 2-, and 3-year survival were 0.716, 0.785 and 0.810 in TCGA cohort ( Figure 3C), 0.762, 0.787 and 0.890 in ICGC cohort ( Figure 3D). In addition, the calibration curves presented satis ed coherence between observed and predicted 1-year, 2-year and 3year OS in both cohorts ( Figure 3E and 3F). DCA was performed to further assess the predictive e cacy between the risk score and nomogram, and the results demonstrated that the nomogram achieved the highest net bene ts, suggesting that it was an e cient model to predict the prognosis of PC patients ( Figure 3G and 3H).

Functional Enrichment Analyses and Somatic Mutation Pro les of the CIC-Related Prognostic Signature
To further explore the biological functions and pathways associated with the established prognostic signature, we rstly analyzed the DEGs between the high-risk and low-risk groups ( Figure 4A and 4B). A total of 212 DEGs were identi ed in TCGA cohort, including 189 upregulated genes and 23 downregulated genes. Moreover, in ICGC cohort, 162 DEGs were identi ed, including 52 upregulated genes and 110 downregulated genes. Notably, KRT7 was one of the top 10 upregulated genes sorting by adjust P value in both TCGA and ICGC cohorts (Supplementary Figure 3A and 3B). The GO analysis showed that the DEGs were enriched in several cell differentiation and tumor metastasis-related processes, such as epidermal cell differentiation, extracellular matrix organization, ameboidal-type cell migration, intermediate lament cytoskeleton and cell-cell junction ( Figure 4C and 4D). And the KEGG pathway analysis demonstrated that the DEGs were mainly enriched in several pathways associated with tumor invasiveness and metastasis, such as PI3K−Akt signaling pathway, Wnt signaling pathway, Hippo signaling pathway, Focal adhesion, ECM−receptor interaction and Regulation of actin cytoskeleton ( Figure 4E and 4F). To clarify whether the risk score was associated with the mutational landscapes of PC patients, we compared the somatic mutation pro les between the high-risk and low-risk groups in TCGA cohort ( Figure 5A−5D). Notably, the mutation frequency in the high-risk group was 96.25%, while 78.21% in the low-risk group, indicating that the mutation frequency increased with the risk score elevation. Moreover, KRAS and TP53 were the top two genes with the highest mutation frequencies in both subgroups. We also found that more co-occurrence and mutually exclusive mutations were observed in the high-risk group when compared with the low-risk group ( Figure 5E). In addition, patients with higher risk scores demonstrated higher TMB levels (P < 0.001; Figure 5F). We further conducted the same analyses in ICGC cohort and similar results were veri ed (Supplementary Figure 4).

Analysis of Immune Features between Two Groups
PC harbored a highly heterogeneous TME. To further investigate whether the differences in prognosis between two risk groups were associated with immune cell in ltration, we rstly utilized ESTIMATE and CIBERSORT algorithms to explore the immune in ltration levels in TCGA and ICGC cohorts. We found that the high-risk group was characterized with lower estimate score, stromal score and immune score, while higher tumor purity ( Figure 6A; Supplementary Figure 5A). The composition and correlation of tumorin ltrating immune cells showed that CIC-related prognostic signature was positively correlated with T cells regulatory (Tregs) and Macrophages M0, and negatively correlated with T cells CD4 memory resting, natural killer (NK) cells activated, Monocytes, Mast cells activated ( Figure 6B and 6C) and Macrophages M2 (Supplementary Figure 5B and 5C). Patients in the high-risk group exhibited signi cantly higher in ltrating levels of B cells memory, Macrophages M0 (P < 0.01; Figure 6D) and Mast cells activated (P < 0.05; Supplementary Figure 5D), while lower in ltrating levels of B cells naive, Dendritic cells resting, Monocytes (P < 0.05; Figure 6D) and T cells CD8 (P < 0.01; Supplementary Figure   5D). Subsequently, the classi cation of immune subtypes showed that four subtypes and ve subtypes were identi ed in TCGA cohort and ICGC cohort, respectively ( Figure 6E; Supplementary Figure 5E). Patients in the high-risk group had more C1 (wound healing) and C2 (IFN-γ dominant) subtypes, and less C3 (in ammatory) subtype, suggesting the unfavorable prognosis. Emerging evidence has shown that immune features of TME are associated with the immune checkpoint blockade (ICB) therapeutic responses. The expression levels of some representative immune checkpoints including CD274 (PD-L1), CD276 (B7-H3), CTLA4 (CTLA-4), HAVCR2 (TIM3), LAG3 (LAG-3), PDCD1 (PD-1), TIGIT (VSIG9) and VTCN1 (B7-H4) were compared between two risk groups. We found that CD276 ( Figure 6F), CD274 and VTCN1 (Supplementary Figure 5F) were upregulated in the high-risk group, while TIGIT was downregulated ( Figure 6F). Besides, the results of ICB responses prediction demonstrated that the response rates were higher in the high-risk group, and responders had higher risk scores ( Figure 6G; Supplementary Figure 5G). Although the rate of response and the risk score between two groups were not statistically signi cant in TCGA cohort, an increased tendency was observed in the high-risk group and responders, respectively. These ndings indicated that patients with higher risk scores might be in an immunosuppressive state.

KRT7 is correlated with the unfavorable prognosis and CIC formation
As shown in the results of multivariate Cox regression and correlation analysis, KRT7 was the most important risk gene in the established CIC-related gene signature for signi cantly predicting prognosis of PC patients and positively correlated with other three genes expression ( Figure 1G and 1H). We investigated the protein and mRNA expression levels of KRT7 by HPA database, GTEx and TCGA datasets. We found that the expression of KRT7 in PC tissue was signi cantly higher than normal pancreas tissue and the expression level increased with the risk score elevation ( Figure 7A and 7B). Moreover, the survival analysis showed that, based on the median of KRT7 expression level, patients with KRT7 high expression had shorter survival than those with low expression ( Figure 7C). Given the extensive degree of intra-tumoral heterogeneity in PC, we further examined the expression of KRT7 among diverse cell types in TME at single-cell level. The results demonstrated that KRT7 was mainly expressed by malignant ductal cells and expression levels varied among different subpopulations of malignant cells ( Figure 7D). We also analyzed the gene expression features and correlations with prognosis of AURKA, CDKN2A and RARB, but none of them was as signi cant as KRT7 (Supplementary Figure 6). The related networks and functions of KRT7 mainly involved in cytoskeleton remodeling, cell differentiation and protein localization-related functions ( Figure 7E). To clarify whether KRT7 is associated with CIC formation, we compared the expression levels of it among different PC cell lines in three independent datasets ( Figure 7F). Two cell lines with relatively higher and lower expression of KRT7 at both mRNA and protein levels were selected respectively for CIC formation assay ( Figure 7G). We found that BxPC-3 and CFPAC-1 cell lines with relatively higher expression levels of KRT7 showed higher frequencies of CIC formation than PANC-1 and MIA PaCa-2 with relatively lower expression levels of KRT7 ( Figure 7H). Then, we performed IHC staining of KRT7 in 24 PDAC samples from the PUMCH cohort to further validate that high KRT7 expression was associated with unfavorable prognosis in PC (Table 2; Figure 8A). According to the median of IHC score, we next divided patients into low expression and high expression subgroups. Combined with KRT7 IHC scores and clinicopathological characteristics, patients from high expression group showed higher proportions of male, IIB-IV stage, lymphatic metastasis and death ( Figure 8B). Although the results of survival analysis were not statistically signi cant (P = 0.1026), KRT7 as a risk factor (hazard ratio = 2.901), high expression group had poorer prognosis ( Figure 8C).

Discussion
The global burden of PC has increased continuously over the past few decades, and as the leading cause of cancer-related death worldwide, its 5-year survival rate approached 10% for the rst time in 2020 [1]. Because of no effective screening for PC diagnosis at an early stage, the vast majority of patients are found to be locally advanced or metastatic disease at the time of initial diagnosis. Even after standardized treatment, including surgery and adjuvant chemotherapy, patients still have to face the great risk of recurrence and death [3]. Therefore, it is urgent to con rm reliable biomarkers for early detection screening and predicting OS of PC patients.
Acquiring necessary nutrients from a frequently nutrient-poor environment and utilizing these nutrients to maintain rapid proliferation and progression is a common feature of cancer cell metabolism [25]. In addition to scavenging nutrients from extracellular microenvironment, cancer cells can also engulf and digest whole living cells via two mainly CIC processes, cell cannibalism or entosis, for nutrient recovery [26]. Previous studies have demonstrated that cannibalistic activity is a hallmark of cancer, conferring metabolic advantages on cancer cells under energy stress [8,15]. Meanwhile, cancer cells with higher aggressiveness can become the "winner" subpopulation by eliminating their less competitive neighbors, which promote the ttest clones expanding within heterogeneous cancer cell populations [16]. CIC structures are prevalent in PC tissues and homotypic CIC (cancer cells internalized other cancer cells) constitutes the main subtype of overall CIC structures in PC [7,27]. Considering that CIC phenomena in cancer represent a highly aggressive behavior, indicating that CIC-mediated cell competition, by selecting the best competitive clones, may be a potential mechanism to promote PC progression. However, there are no studies to investigate the relationship between CIC-related genes and PC patient's OS through comprehensive bioinformatics analysis.
In the present study, we developed a risk scoring model based on four CIC-related genes (KRT7, AURKA, CDKN2A and RARB) in TCGA cohort, and further performed external validation for its robustness.
According to the value of hazard ratio, KRT7 and AURKA were considered as the risk genes, while CDKN2A and RARB were considered as the protective genes. KRT7 (Keratin 7) belongs to type II cytokeratin involving in cytoskeleton remodeling, epithelial intermediate laments formation and motility enhancement of cells [28]. Previous studies have observed KRT7 overexpression in many cancers, including ovarian, gastric, colorectal, and pancreatic cancers, which can facilitate cancer cells migration and metastasis [29][30][31][32]. AURKA (Aurora Kinase A) is a cell cycle-regulated kinase involving in microtubule formation, spindle pole stabilization during chromosome segregation, and contributing to tumorigenesis and progression [33]. AURKA-mediated phosphorylation is necessary for CIC-related processes, which promotes entosis in breast cancer cells via regulating microtubule plus-end dynamics and cell rigidity [34]. CDKN2A (Cyclin Dependent Kinase Inhibitor 2A), a well-known tumor suppressor, can induce cell cycle arrest in G1 and G2 phases by inhibiting binding of CDK4 or CDK6 with cyclin D1 and initiating p53-dependent cell cycle arrest. CDKN2A inactivation is found in the vast majority of PC patients, increasing cellular tness and proliferation, and promotes homotypic CIC formation in breast cancer cells [35,36]. RARB (Retinoic Acid Receptor Beta), by binding retinoic acid (biologically active vitamin A), can limit cell proliferation and abnormality to inhibiting tumorigenesis. A number of studies have documented that cancer cells show higher levels of RARB promoter methylation when compared with their normal counterparts, leading to functional silencing [37,38].
Based on the risk score of individuals, patients were divided into the high-risk and low-risk groups. Our results showed that PC patients in the high-risk group had signi cantly poorer OS than those in the lowrisk group, and the risk score was an independent prognostic factor with a credible e cacy evaluating by ROC analysis and nomogram model. To further explore the associations between established signature and differences of prognosis observed, biological functions, mutation pro les and immune features between two risk groups were compared in TCGA and ICGC cohorts. Functional enrichment analyses showed that several cancer-related terms and pathways were enriched, such as extracellular matrix organization, cell-cell junction, ECM-receptor interaction and PI3K-Akt signaling pathway, suggesting that patients in the high-risk group may be at higher degree of cancer-related pathways activation and immunosuppressive status [39]. Furthermore, a higher proportion of patients with KRAS and TP53 somatic mutations were detected in the high-risk group, as well-known driver genes in PC, increasing cancer risk of individuals. TMB, as a numeric index, re ects cancer mutation quantity. Higher TMB results in more tumor neoantigens, which increases chances for immunotherapy and is clinically associated with better ICB responses [40]. In this study, patients in the high-risk group had signi cantly higher TMB, indicating that these patients may bene t from ICB therapy. Previous studies have shown that the immune microenvironment plays a pivotal role in PC progression [39,41,42]. However, the roles of CICrelated genes for PC immune microenvironment are still not clear. In our results, higher in ltration levels of M0 macrophages and lower in ltration levels of CD8 + T cells were observed in the high-risk group, suggesting the presence of an immunosuppressive microenvironment [43]. Besides, it has been reported that tumor-in ltrating B cells contribute to responses to ICB therapy, and tumors of responders showed a higher in ltrating level of memory B cells, while lower in ltrating level of naïve B cells than tumors of nonresponders [44]. This nding may be the reason for our results that there are a signi cantly higher frequency of memory B cells and a signi cantly lower frequency of naïve B cells in the high-risk group. A previous study, by analyzing TCGA data, has identi ed six immune subtypes spanning cancer tissue types and molecular subtypes−wound healing (C1), IFN-γ dominant (C2), in ammatory (C3), lymphocyte depleted (C4), immunologically quiet (C5), TGF-β dominant (C6). C3 subtype has the best prognosis, while C1 and C2 subtypes present less favorable outcomes [45]. In our results, nearly half of patients in the low-risk group were C3 subtype, but most patients in the high-risk group were C1 and C2 subtypes, which was consistent with the association of immune subtypes and prognosis. ICB therapy is one of the most successful anti-cancer immunotherapies [46]. However, low response rates limit PC patients to bene t from ICB therapy [3]. We analyzed the expression levels of eight representative immune checkpoints between two risk groups and predicted ICB responses of patients. As shown in our results, the high-risk group exhibited higher expression of CD274, CD276 and VTCN1 than did the low-risk group. CD274 encodes PD-L1 protein, the immune inhibitory ligand of PD-1 death receptor, that is observed in various types of tumors, and serves as an immune suppressor by blocking T cells activation and cytokine production [46]. CD276 is widely expressed on a range of solid tumors, and plays a dual role in anti-tumor immunity, as a co-stimulatory regulator by enhancing the activity of T cells, or as a co-inhibitory regulator by inhibiting T cells and NK cells functions [47]. VTCN1 is mainly expressed on tumor cells and tumorassociated macrophages, which promotes immune escape by inhibiting the proliferation of T cells and enhancing the function of regulatory T cells [48]. TIGIT, as the only downregulated immune checkpoint in the high-risk group, is primarily expressed on T cells and NK cells, which can inhibit anti-tumor immunity by impairing T cell functions, preventing NK cell-mediated lysis, and enhancing the suppressive activity of regulatory T cells [49]. Therefore, these ndings may explain why a higher frequency ICB response rate was observed in the high-risk group. Highly intra-tumoral heterogeneity is the main obstacle to effective PC treatment. [50] The results of single-cell transcriptome analysis for PC have found that malignant cells contain distinct subpopulations, including those enriched for either proliferative or migrative features [41,51]. Meanwhile, this heterogeneity is also found in established cancer cell line such as neuroblastoma, melanoma and breast carcinoma cell lines [52,53]. Heterogeneous tumor populations can be divided into separate clusters with differently mechanical deformability [53]. Considering that softer tumor cells preferentially internalize stiffer neighboring cells in CIC processes, we speculate that CIC structures observed in PC may be a positive selection to promote the survival of clones with metabolic advantages and higher deformability, which confers a greater degree of the capability to invasion and metastasis on tumor cells [16,54]. To further verify the correlation of KRT7 expression and unfavorable prognosis for PC patients, we performed IHC scoring on PUMCH cohort. According to the comprehensive analysis of IHC scores and clinicopathological characteristics, KRT7 as the most important risk gene in this prognostic signature, was demonstrated to be signi cantly associated with male, higher stage, and lymphatic metastasis. Since we only chose a small sample size (n = 24), this may be the reason why Kaplan-Meier survival analysis had no obviously statistical differences. A validation cohort with greater sample size and long-term follow-up was warranted in the future.
To the best of our knowledge, this study is the rst attempt to construct a prognostic signature based on CIC-related genes and has shown a favorable e cacy on survival prediction in PC patients. However, our study has several limitations. First, this study mainly collected retrospective data from public databases, and thus prospective studies are needed to further validate our results. Second, since there are insu cient researches to delineate the dynamic CIC processes in PC, related genes included in this study may not be the core regulators of CIC for PC cells. Third, due to the limitations of bioinformatics analysis, future studies are needed to systematically elucidate the molecular mechanisms and implications of CIC-related genes such as KRT7 in PC.
In conclusion, our study developed a prognostic signature for PC based on four CIC-related genes to screen patients at high risk and predict survival. Further studies will be able to reveal new insights about CIC phenomena in cancer progression, which may be leveraged for PC therapy. Hospital. Written informed consent were obtained from all the patients enrolled in this study.

Consent for publication
All of the authors agreed to publish this paper in Cancer Cell International.  Median (range) 6 (1-12) 9 (8-12) 6 (1-6) OS, overall survival.     Differential gene expression analysis, GO and KEGG enrichment analysis between two risk groups.
(A and B) Heatmap of the DEGs between the high-risk and low-risk groups in TCGA (A) and ICGC (B) cohorts. (C−F) Representative terms of GO enrichment analysis and representative pathways of KEGG enrichment analysis between the high-risk and low-risk groups in TCGA (C and E) and ICGC (D and F) cohorts.

Figure 5
Somatic mutation pro les between two risk groups in TCGA cohort.
(A−D) MAF-summary plots and waterfall charts of somatic mutations in the high-risk group (A and B) and low-risk group (C and D). The top 10 mutated genes were shown.
(E) Correlation heatmaps of co-occurrence and mutually exclusive mutations in the high-risk and low-risk groups. (F) Distribution of TMB (left) and comparison between two risk groups (right).

Figure 6
Page 29/33 Estimation of immune cell in ltration and prediction of ICB responses in TCGA cohort. between two risk groups and the risk score between responders and non-responders. ns, not signi cant; *, P < 0.05; **, P < 0.01.