3.1 Landscapes of cell composition in normal, HSIL, MIC and CSCC tissue
To explore complex microenvironmental shifts during the various stages of HPV infection in the cervix, single-cell RNA sequencing was utilized to analyze the celluar composition of 20 fresh samples, including normal cervix (n = 5), HSIL (n = 4), MIC (n = 5) and CSCC (n = 6) (Fig. 1A). After implementing quality control measures and removing batch effects, 93,589 cells from these patients, with an average of about 22,701 expressed genes per cell. To reduce the dimensionality of all cells, we employed Uniform Manifold Approximation and Projection (UMAP) to identify cell clusters (Fig. 1B). The cell distribution included 19,169 cells from normal, 16,651 cells from HSIL, 27,593 cells from MIC and 30,176 cells from CSCC (Figure S1B). After UMAP, 25 clusters were identified and annotated into nine cell types based on the expression of marker genes. These cell types included 1,934 B cells (2.1%, marked with IGKC and JCHAIN), 6,774 columnar epithelial cells (7.2%, marked with MUC5B and TFF3), 2,964 endothelial cells (3.2%, marked with PLVAP and EMCN), 4,480 fibroblasts(4.8%, marked with APOD and COL3A1), 756 mast cells(0.9%, marked with TPSAB1 and TPSB2), 3,244 myeloid cells(3.5%, marked with C1QC and CYBB), 710 smooth muscle cells(0.8%, marked with MYL9 and RGS5), 71,085 squamous epithelial cells(76.0%, marked with KRT5 and TP63) and 1,642 T cells (1.8%, marked with CD3D and TRAC) (Fig. 1C).
Following the initial analysis, we delved deeper into the differences in cellular composition among samples and across various normal and three disease stages. As expected, squamous epithelial cells were the predominant cell type in most samples (Fig. 1D). Interestingly, we observed a gradual increase in the proportion of squamous epithelial cells with the stage of the disease severity (Fig. 1E). This trend indicated alterations in the microenvironment at different stages of response to progressive HPV infection. Notably, the proportions of each cell type varied significantly, suggesting the presence of intertumoral heterogeneity across different stages of HPV-induced cervical lesions.
3.2 Distinguishing tumor and normal cells based on squamous epithelial cell subclusters features
We performed further clustering specifically for the epithelial cells, dividing them into 24 subclusters (Fig. 2A). The distribution of samples or disease stages within each sub-cluster underscored the heterogeneity within the epithelial cells (Figure S1C). We calculated the CNV score using InferCNV for each cluster (Figure S1F). Interestingly, Cluster 1, 3, 7, 12, 19, and 21 exhibited lower scores (Fig. 2B). Surprisingly, cluster 23, which primarily consisted of normal samples, showed higher CNV scores. Further analysis revealed that Cluster23 exhibited a elevated expression of SOX2, a gene known for its critical role in maintaining stemness[38]. We also found that cells from normal samples were mainly enriched in cluster 3, 7, 12, 19, 21, 22, and 23. We conducted GO analysis within each cluster to identify distinct functional characteristics. As a result, Cluster 1 and Cluster 3 showed a significant proportion of genes associated with epidermis development, while Cluster 7, 12, 19, 21, 22, and 23 collectively displayed changes in genes related to cell structure (Figure S1E).
Given the heterogeneity in squamous epithelial cells, it was crucial to distinguish between tumor and normal cells. Consequently, we categorized the epithelial cells into two distinct groups: tumor cells and normal cells. Specifically, Cluster 1, 3, 7, 12, 19, 21, 22, and 23 were classified as normal cells, while the remaining clusters were characterized as tumor cells. Marker genes such as EPCAM, MUC5B, and SOX17 were predominantly expressed in normal cells, while CDKN2A, CEACAM6, and MKI67 were primarily expressed in tumor cells (Fig. 2C, S1D). We observed 20,577 normal cells, constituting 43.8% of normal samples and 50,508 tumor cells, making up 99.9% of non-normal samples. This subclassification of epithelial cells confirmed the presence of normal cells within the squamous epithelial cells of diseased samples.
Upon further analysis of our squamous epithelial cells sub-clustering, we discovered that genes associated with viral pathways, such as TSPAN6 and CD74, were highly enriched in tumor cells. In contract, genes associated with the biological oxidation process, such as ND5 and COX1, were predominantly found in normal cells (Fig. 2D). These evidences conclusively demonstrated significant differences between tumor and normal cells.
3.3 HPV expression showed significant differences across disease stages and cell types
We observed a high-frequency of HPV subtype expression in all cells, correlating with disease stages. We quantified HPV expression, calculating the expression level for each HPV gene from various HPV subtypes as well as the expression level of the entire HPV genome (total HPV expression, equivalent to the sum of all individual HPV genes). Surprisingly, we detected no HPV expression in normal cells. The expression levels of total HPV differed significantly between any two of the three stages, with the CSCC stage exhibiting the highest total HPV expression (p-value < 2.22e-16). This suggests that HPV expression in cells progressively increases with the severity of the disease stage. Interestingly, the majority of cells expressing HPV were squamous epithelial cells (97.02%), with 88.08% of these being tumor squamous epithelial cells (Fig. 3A). The HPV virus remained overexpressed in squamous epithelial cells compared to T cells (p-value < 2.2e − 16, t-test, Table S1), potentially due to the presence of HPV integration events within squamous epithelial cells. Furthermore, we found that genes for each HPV subtype were differentially expressed across different cell types (Figure S2A).
Focusing on squamous epithelial cells, we discovered a significant difference in HPV expression between tumor and normal cells (p-value < 2.2e − 16, t-test, Fig. 3B). As anticipated, significant differences in HPV expression across disease stages (HSIL versus MIC p-value < 2.2e − 16, MIC versus CSCC p-value < 2.2e − 16, HSIL versus CSCC p-value < 2.2e − 16, t-test) were observed in tumor squamous epithelial cells (Figure S2B). Upon investigating the expression of individual HPV genes within tumor squamous epithelial cells, we found that, compared to gene expression of other HPV subtypes, HPV33_ E5 (E5 gene in HPV33) showed the highest expression in HSIL, HPV16_E7 was more expressed in MIC, and HPV16_E5 had the highest expression in CSCC. This highlights different expression preferences for various HPV types across different disease stages (Fig. 3C). The expression levels of total HPV were significantly different between any two of the three stages, with the CSCC stage having the highest total HPV expression (p-value < 2.22e − 16). This suggests that HPV expression in cells progressively increases as the severity of the disease stage increases.
Interestingly, we observed that signaling pathways associated with the cell life cycle, such as the oxidoreductase complex and apical plasma membrane, were expressed in cells with HPV expression than in cells without HPV expression (Figure S2C, 3D). In conclusion, we observed different expression levels across different cell types, disease stages, and HPV-integration status, suggesting that HPV infection might affect the composition of the tumor microenvironment.
3.4 The preference of HPV integration events in squamous epithelial tumor cells
To examine the landscape of HPV integrations at the single-cell level, we devised a new pipeline to identify virus integration events in the human genome (We uploaded the code to https://github.com/fpengstudy/virus_detection_in_BD_scRNA-seq ). We used a combined reference sequence containing the genomes of both humans and the HPV virus (including common HPV subtypes: HPV16, HPV18, HPV33, HPV52 and HPV31) during sequence alignment. We then extracted reads that could map to both the human genome and the HPV genome. The alignment results of these selected reads were used to define the integration sites of the human genome and the breakpoints of the viral genome.
We identified instances of integration within individual cells. Through the implementation of our pipeline, we found HPV integration events in 4,250 out of 93,589 cells across 14 different tissues. This included 4,181 out of 66,904 squamous cells (4,067 out of 50,508 were tumor cells and 114 out of 20,577 were normal cells), 2 out of 1,934 B cells, 1out of 6,774 columnar epithelial cells, 2 out of 2,964 endothelial cells, 4out of 4,480 fibroblasts, 1 out of 756 mast cells, 10 out of 3,244 myeloid cells, 49 out of 1,642 T cells and 0 out of 710 smooth muscle cells (Table S2). It was evident that tumor squamous epithelial cells constituted a substantial proportion (95.69%) of the entire cell population. Squamous epithelial cells accounted for the greatest number of integration events in the individual samples. The number of integration events and the number of integration cells varied across samples and disease stages. Surprisingly, no breakpoints were detected in normal tissues. Notably, the proportion of cells with breakpoints varied among samples of different disease stages caused by HPV infection, revealing that HPV integration frequency differed for each disease stage and patient (Fig. 4A, S3A).
We classified squamous cells into two groups: tumor squamous epithelial cells and normal squamous epithelial cells. We observed a higher frequency of integrated cells in squamous epithelial cells compared to other cell types, both within the entire epithelial cell population and specifically within tumor epithelial cells. It is well-established that different stages of HPV-induced lesions exhibit distinct characteristics [39]. In this study, we investigated the proportion of cells with HPV integration across various disease stages. We noted an increase in the ratio of HPV-integrated cells among all cells from normal tissue to CSCC, which eventually plateaued (Fig. 4B). Given that squamous epithelial cells had the highest HPV-integration ratio, accounting for 98.38% of all integrated cells. Furthermore, the HPV-integration ratio of squamous epithelial cells was 6.25%, which was significantly higher than other cell types (B cells: 0.10%, columnar epithelial cells: 0.01%, endothelial cells: 0.07%, fibroblasts: 0.09%, mast cells: 0.13%, myeloid cells: 0.31%, T cells: 2.98%, and smooth muscle cells: 0%). Therefore, our subsequent analysis primarily focused on the squamous epithelial cells.
In a separate analysis, we compared the percentage of HPV-integration cells across three disease stages (HSIL, MIC, and CSCC). The results highlighted a significant difference in the integration rate between tumor and normal squamous epithelial cells. Additionally, we compared the integration ratios for HSIL versus MIC, MIC versus CSCC and HSIL versus CSCC in tumor cells and in normal cells. Notably, we found significant differences in HSIL-versus-MIC (p-value = 0.00306) within tumor cells and HSIL-versus-MIC (p-value = 6.846e − 05), MIC-versus-CSCC (p-value = 6.846e − 05), and HSIL-versus-CSCC (p-value = 0.001935) within normal cells (Fig. 4C).
When compared with normal squamous epithelial cells, genes that were up-regulated in integrated cells demonstrated a clear inclination toward viral-related and immune-related signaling pathways (Fig. 4D, right panel). Conversely, genes that were up-regulated in non-integrated cells were associated with biological signaling pathways related to ATP and oxidation reactions (Fig. 4D, left panel). This observation aligns with our findings regarding cells with HPV gene expression, suggesting that HPV integration influences microenvironmental changes in cells, irrespective of whether HPV integration detected in the transcriptome
3.5 The distribution of HPV-integrated genes across disease stages and the impact on the gene expression
To identify key genes with HPV integrations at the single-cell level, we annotated all HPV integration sites within the human genome and breakpoints within the viral genome, obtained through our pipeline. Given the limited number of integration events in certain cell types, our analysis concentrated on three cell types with a sufficient number of integration events: squamous cells, T cells, and myeloid cells. Each of these three types exhibited integration in more than 10 cells.
Out of 66,904 squamous cells, we detected a total of 4,605 integration events in 4,181 individual cells. PAN3, BABAM2, TCIM-SIRLNT, TEX41-PABPC1P2, KCNV1-LINC01608 and SPEN were identified as genes with high frequency HPV integration. In the 1,642 T cells analyzed, we observed 50 integration events in 49 cells, with a significant presence of integration events in the genes of BABAM2, KCNV1-LINC01608, and TCIM-SIRLNT. In the 3,244 myeloid cells examined, we found 10 integration events in 10 cells, with the genes BABAM2and TEX41-PABPC1P2 showing a significant presence of HPV integrations (Table S3).
Focusing on squamous cells, we listed all annotated genes within the human genome and examined them across various disease stages and patient samples. Our analysis revealed several genes, highlighted in Fig. 5A, that demonstrated a notably high frequency of integration in specific samples or disease stages (Table S4). It is worth noting that integrations in RNA-seq were distributed across the entire human genome, with a few hotspot genes being more prone to integration. This observation suggests that HPV DNA may exhibit a preference for specific genomic locations. Subsequently, we investigated potential variations in the preference of targeted locations for HPV-integration events among different disease stages across the human genome. As a result, we found that the integration location preferences varied. More specifically, more integration events in S100A8 and S100A11 were observed at the HSIL stage, more integration events in BABAM2 and TCIM-SIRLNT were observed at the MIC stage, while SPEN, TEX41-PABPC1P2 and KCNV1-LINC01608 were uniquely observed with HPV integrations at the CSCC stage (Fig. 5A).
We categorized high-frequency genes into two types based on HPV-integration characteristics. Genes that were consistently present with HPV integrations in a substantial number of individual cells (more than 40 cells) were classified as high-frequency-cell genes. Genes that contained HPV integrations in at least three disease stages (HSIL, MIC, and CSCC) were considered high-frequency-disease-stage genes.
In the category of high-frequency-cell genes, it is noteworthy that these genes typically appear in a single sample. Specifically, we identified three genes (PAN3, BABAM2, and SPEN) and three intergenic regions (TCIM-SIRLNT, TEX41-PABPC1P2, and KCNV1-LINC01608), that exhibited a high degree of integration across numerous cells. Notably, these integrated genes accounted for over 80% among all HPV-integration sites in the specific individual identified in this study (Table S5). Integration events in PAN3, BABAM2, and SPEN were detected in both tumor and normal cells, each gene expression showing significant difference in terms of the ratio of cells containing the integrations in that gene (p < 2.2e − 16, χ² test). It was reported that PAN3 was closely related to pancreatic cancer [40], previously linked to pancreatic cancer and acute lymphoblastic leukemia[41], was identified as a biomarker gene in ulcerative colitis in recent research [42], and BABAM2 was found to form a novel BABAM2-ALK fusion, potentially contributing to the pathogenesis of a small subset of gynecologic clear cell carcinomas [43]. SPEN, known for its high mutational frequency variety in mantle cell lymphoma [44], was the only high-frequency-cell gene that exhibited significant differences on gene expression between cells with and without integration.
In the category of high-frequency-disease-stage genes, we initially identified genes that were integrated at each of the three disease stages. We discovered eight genes (EGR1, S100A11, S100A8, KRT5, RPL34, ATP1B1, and EEF2) with HPV integrations in the disease stages of HSIL, MIC and CSCC, spanning various stages of the disease. Further analysis of high-frequency-disease-stage genes with HPV integrations across samples from all three disease stages revealed that S100A8 was the only gene with integration events in more than two cells at all three disease stages (Fig. 5B).
An intriguing observation from our study is that each cell typically harbors fewer than six integration sites. Furthermore, the average number of individual cell integration events exhibited a decreasing to increasing trend from HSIL to CSCC (Fig. 5C). We calculated the ratio of cell integration and patients with HPV-integration events at each disease stage, which revealing noticeable variations across different diseases (Fig. 5D).
Based on the subclusters of squamous epithelial cells, we next focused on both high-frequency-disease-stage genes and high-frequency-cell genes in tumor and normal cells. Interestingly, all high-frequency-disease-stage genes and high-frequency-cell genes were found to have HPV integrations (Fig. 5E) in tumor cells, while only a few were detected in normal cells (Fig. 5F). For instance, EGR1, a high-frequency-disease-stage gene, was found to be integrated by HPV in five tumor cells and one normal cell. In contrast, PAN3, BABAM2, and SPEN, classified as high-frequency-cell genes were found to contain HPV integrations in both tumor and normal cells. EGR1, a well-known cancer-associate gene, has been proven to enhance cervical cancer progression [45]. Additionally, we explored the expression of eight high-frequency-cell genes acorss four stages (Figure S3B). We also compared these genes between normal epithelial cells and tumor epithelial cells, finding that EGR1, S100A11, S100A8, KRT5, ATP1B1, RPL34 and RPS4X showed significant differences (Figure S3E). Upon examining the impact of HPV integrations for high-frequency-disease-stage genes within tumor and normal cells separately, we were surprised to find that all genes had significant differences between HPV-integrated and non-HPV-integrated cells in tumor cells. However, in normal cells, only EGR1, ATP1B1 and EEF2 exhibited a significant difference between HPV-integrated and non-HPV-integrated cells (Fig. 5F, S3F).
We extracted cells that had an expression level greater than zero for each of eight high-frequency-disease-stage genes. Among these genes, we discovered that EGR1, S100A11, S100A8, KRT5, RPL34 and RPS4X exhibited significantly different expressions when integrated by HPV integration versus those without HPV (Figure S3C, left panel). Additionally, we categorized the cells by disease stages and observed that the expression of these eight genes demonstrated differences between integrated and non-integrated cells across various disease stages (Figure S3C, right panel). We conducted distinct GSVA survival analyses for all genes in total disease stages and each stage separately, and we also explored eight high-frequency-disease-stage genes with integration across various disease stages. However, we did not identify any gene that could be associated with survival time (Figure S3D). Further biological experiments are necessary to validate for the involvement of these genes in the cervical carcinogenesis.
Furthermore, we observed that the presence of numerous identical breakpoints in many different cells within a specific sample. For instance, we identified several similar HPV integration events in more than 1,400 cells at chr8 in the case10 patient. Further pseudotime trajectory analysis of the cells for individual patients revealed that integration events at the same genomic sites in different cells persisted across all time nodes (Figure S3G), suggesting that they might be clonal expansion events in the tumor environment.
3.6 Breakpoints on the HPV genome across disease stages
In addition to studying the human genome at the integration level, we also examined the corresponding breakpoints (within the viral genomes. These breakpoints represent the points where the integration HPV fragment enters the human genome). We counted the number of breakpoints across different HPV subtypes, and observed 1,648 events in HPV33, 28 events in HPV35, 168 events in HPV58, 1,536 events in HPV73, 336 events in HPV18 and 889 events in HPV16 within all samples. We noted that the integrations of HPV33 primarily occurred in HSIL and MIC stages (3 samples), HPV35 in MIC and CSCC stages (3 samples), HPV58 in CSCC stage (1 samples), HPV73 in CSCC stage (1 samples), HPV18 mainly occurred in CSCC stage (1 samples) and HPV16 across HSIL, MIC, and CSCC stages (8 samples). These results suggested that different HPV types exhibit distinct preferences on disease stages in terms of integration events (Fig. 5G). We also found significant differences in the subtype distribution of HPV integration (p < 2.2e − 16, χ² test). Notably, HPV16 integrations were present at all disease stages in a total of 8 samples. We then divided all samples into three stages and performed the analysis for each disease stage separately. We observed an increase in the number of HPV subtypes as disease severity. (Fig. 5G, S4A). In a more comprehensive examination of HPV16, we found that the E1, E2, E5, and E6 genes contain breakpoints related to the integration events at all three stages, suggesting that these genes were truncated before being integrated into the human genome. This finding is consistent with previous studies and suggests a potential preference for the integration of these genes (Fig. 5H). A previous study reported that the E protein of the HPV virus, known as the major contributor to disease stages, is positively related to the oncogenesis and malignancy of HPV-induced cancers, which aligns with our results [46].
We proceeded to explore the potential relationship between integration events and HPV expression in relation to disease stages. We examined the expression of individual HPV genes within squamous epithelial cells that had been integrated. Compared to the expression of other HPV subtype genes, HPV33_E5 exhibited the highest expression in HSIL, HPV33_E1 and HPV33_E2 were more expressed in MIC, and HPV73_E6 showed the highest expression in CSCC. This highlights the varying expression preferences for different HPV types across different disease stages (Figure S4B). We also observed that the total HPV expression for all HPV subtypes genes varied across different disease stages. Furthermore, we compared the total HPV expression between samples with and without HPV integrations across different disease stages. Significant differences (p-value < 2.2e − 16, t test) between HPV integration were observed in all disease stages, revealing that HPV expression was higher in cells with integration than in those without at the cellular level (Fig. 5I). Additionally, we compared the expression of the HPV genes of each HPV subtype in integrated and non-integrated cells. We found that the expression of most subtypes was higher in integrated cells, but at the CSCC stage, the expression of HPV16_E5 was higher in non-integrated cells (Figure S4B, S4C).