Characterization of HPV integration in cervical cancer
The HPV positive rates significantly differed between the two populations with 90.0% (81 of 90) in the ZJU group and 94.1% (269 of 286) in the TCGA group (Table S1 and Figure S2a, binomial test; p = 4.7e-2). Overall, the proportion of HPV subtypes also differed between the ZJU and TCGA samples, such as the rate of HPV subtypes 16 and 58 which were much higher in the ZJU while the HPV18 samples accounted for more in the TCGA (Figure S2b). These results imply that the HPV-based tumor types vary in the two populations.
HPV integration events were observed in 57 of 81 (70%) and 239 of 269 (88%) HPV-positive tumor tissues in ZJU and TCGA, respectively (Fig. 1a). The binomial test proved that the integration rate was significantly different between the two populations (p = 1.3e-5). This difference is not negligible as longer reads and larger amounts of sequence data were used in the ZJU group, which favor detection of HPV integration (Figure S2c). The overall different rate of un-cleared HPV infection and HPV integration in the two populations may be due to the local virus sublineage diversity by geographical disparities through evolution or different genetic susceptibility among different races or ethnicities. In both populations, HPV integration occurred in all viral genes and affected all human chromosomes. In addition, MIPOL1, PTPN13, VMP1 and TP63 which have been previously reported were identified in both the ZJU and TCGA samples [5, 18], and we also found LINC00393 and HSPB3 as new common integration hotspots in both populations. Several integration breakpoints such as PVT1, RAD51B, MYC and ERBB2 that were previously reported in TCGA samples were also identified in our analysis. Furthermore, we found new hotspot sites such as SH2D3C and CASC8 in the TCGA samples, and SCGB1A1 and ABCA1 in the ZJU samples. Validation experiments proved the reliability of these HPV integration events (Table S2). In contrast to a previous study that revealed that common fragile sites (CFS) are preferential targets for HPV integration [19], we did not observe such a preference for HPV integration in CFS (Figure S3).
Genomic loci affected by HPV integration were significantly associated with higher gene expression (Mann–Whitney two-sided test; p = 1.1e-4 in ZJU and p = 2.3e-21 in TCGA), which is in good agreement with a previous study [5]. To explore how HPV integration affects the expression of genes around integration sites, we divided the integration sites into five categories according to the relative expression levels of integrated genes and investigated the expression of adjacent genes in each category. Density plots highlighted that the higher the expression of an integrated gene, the higher the expression of adjacent genes (Fig. 1b). This phenomenon was not likely ascribed to the direct effect of integration on transcription factor binding sites or the overall expression levels of samples with HPV integration (Figure S4a). A similar trend was observed in TCGA samples (Figure S4b). These findings revealed several novel characteristics of host genome alteration resulting from HPV integration.
Microhomologous sequences between human and HPV at integration sites
Then we evaluated the identity of human and virus sequences of ± 15 bp around all of integration sites identified in ZJU and TCGA samples (Fig. 2, and Figure S5.). We first randomly selected the human sequence and virus sequence to draw scatter plots for the number of identical nucleotides and the maximum number of consecutive identical nucleotides. As for the results from the human and virus sequences around the integration sites, the points drifted more to the right top of the coordinate system than any two randomly selected human and virus sequences. This supported the idea that HPV may integrate into the host genome through a microhomology-mediated DNA repair pathway during host DNA damage response [6].
The relevance of HPV gene expression to cervical cancer patient outcome
Unsupervised hierarchical clustering of HPV genes detected five distinct clusters consistently in the ZJU and TCGA tumor samples: group 0 with an overall low expression level consistent with defined HPV negative; group 1 expressing mainly E6 and E7; group 2 additionally expressing E6SI compared with group 1; group 3 with higher expression of E1 and E5, partly with expressed E4; and group 4 expressing mainly E5 and E6 but not E4 (Fig. 3a and 3b, and Table S1). E6SI is a spliced transcript encoding active E6 oncoprotein. As to histological types, group 0 contained a mixture of adenocarcinoma, adenosquamous and squamous cell carcinoma samples, whereas squamous cell carcinoma accounted for the majority of samples in the other groups. Except for the above features, we did not find any significant difference in stage, tumor size, and smoking status among these five groups.
Then we investigated whether there were any molecular features and clinical outcomes associated with these groups. Consistently in these two populations, we found enrichment of HPV18 samples in group 2 (hypergeometric test, p = 1.1e-2 in ZJU and p = 1.9e-16 in TCGA) and HPV58 samples in group 3 (hypergeometric test, p = 3.4e-6 in ZJU and p = 1.5e-2 in TCGA). HPV58 is relatively prevalent in China and other Asian countries. Notably, our data showed that E6 and E6SI were not detectable in most HPV58 samples, whereas E1, E4, E5 and E7 were expressed in the HPV58 samples. Meanwhile, HPV18 subtypes were correlated with high expression levels of E6SI (t-test, p = 1.6e-2 in ZJU and p = 1.7e-6 in TCGA) as previously reported [18]. These characteristics suggested that the expression of HPV genes not only varies within an HPV subtype such as HPV16, but also depends in part on a specific tumor subtype, giving us new insights into the potential role of HPV infection via its gene expression pattern beyond HPV subtypes. Furthermore, HPV integration frequency was low in groups 3 and 4 compared with nearly 100% in groups 1 and 2, implying that lack of E5 expression is perhaps propitious to HPV integration.
The E2/E6 ratio using quantitative PCR is often used to detect HPV integrations. The rationale behind the detection is that HPV integration can lead to the disruption of the viral E2 gene, whereas E6 and E7 are retained [20]. However, our data showed it is not always true that samples with HPV integration lack E2 expression. E2 expression was observed in some samples with HPV integration in our analyses. On the other hand, E6 expression was disrupted in some cases. Therefore, the E2/E6 ratio may not be effective for the detection of HPV integrations.
Kaplan–Meier curves revealed that the overall survival rate and recurrence-free internal survival rate were significantly different among the above five HPV-based groups (Fig. 3c; p = p = 9.14e-3 and 1.52e-3, respectively) in the TCGA samples. Group 4 exhibited the most favorable outcome, yet poorer prognosis could not be distinguished among the other groups in terms of overall survival. The five groups showed distinct outcomes in disease recurrence, with group 4 performing the best and group 2 displaying the worst outcome. Overall, the groups with expressed E5 performed better than the groups without E5 expression. These results highlighted the relevance of E5 to cancer recurrence, which not only offered a powerful prognosis biomarker, but also revealed the potential role of E5 in the development of cervical cancer. A specific HPV gene expression pattern is a better prognostic predictor than a single HPV gene; E5 seems to have a better prognostic value than well-focused E6 and E7.
Although there were similar HPV-gene based tumor subgroups and their associated host gene expression profiles in the two populations, the constituents of subgroups varied substantially. The proportions of group 2 in the two populations were nearly identical, while group 4 with the best survival outcome accounted for the most part of cases (35.5%) in ZJU, compared to 15.0% in TCGA, and group 1 with the worse outcome accounted for the most part of the cases (34.6%) in TCGA, compared to 20.0% in ZJU (Figure S2d).
To assess whether the above HPV subgroups are recapitulated at the gene level on the host genome, we detected genes associated with the HPV subgroups by Kruskal–Wallis test. Among 6,095 and 10,360 genes associated with HPV subgroups in ZJU and TCGA, respectively, there were 3,569 overlapped genes between the two populations. Both populations showed the most distinct clusters between group 0 and other groups, indicating there was a mostly different gene expression pattern between HPV negative and positive samples (Fig. 3a and 3b). Further comparison of enriched GO terms revealed the functional similarity of genes in subclass A between the two populations, as well as genes in subclass B between the two populations (Fig. 3d). Overall, genes in subclass B were mainly related to cell cycle and DNA repair. In total, these results represented the robust host response to viral gene expression within specific HPV subgroups.
RNA splicing affects HPV gene expression
We also explored potential factors affecting HPV gene expression. We first examined the relationship between integration and HPV gene expression among HPV positive samples. E7 was significantly up-regulated and E5 was significantly down-regulated in tumor samples with integration (Fig. 4a and Figure S6a). In most integration events, the virus gene is expressed as a virus-host fusion transcript and some of them undergo RNA splicing. Such a virus-host transcript can be spliced from a splice donor in the viral genome to a splice acceptor in the host DNA [21]. In our HPV16 positive tumors, nucleotides (nt) 226 and 880 were the most common splice donors during the splicing of virus-host transcripts (Fig. 4b and Figure S6b). Although both breakpoints of nt 226 and 880 were detected among the samples of the four groups (68% and 74.5% in ZJU, 52.2% and 59.4% in TCGA), these integration events were more apt to occur in group 1 and group 2 (hypergeometric test, p = 5.3e-4 and 2.4e-6 in ZJU, p = 7.5e-7 and 1.5e-7 in TCGA). Interestingly, we also found the complementary situation that the transcript was spliced from a splice donor in the host DNA to a splice acceptor in the virus genome, though its incidence was less common, of which nt 3,358 was the most common splice acceptor during splicing in both populations (17.0% in ZJU and 25.7% in TCGA). Notably, breakpoints of nt 3,358 were significantly enriched in group 3 and group 4 (hypergeometric test, p = 0 in ZJU and 1.2e-11 in TCGA). These results suggest that both breakpoints on the virus genome during integration and subsequent splice events of virus-host transcripts have an important impact on HPV gene expression. HPV integration not only causes dysregulation of host genes, but also shapes distinct HPV gene expression patterns influencing patient survival outcome.
Landscapes of tumor-infiltrating immune cells in cervical cancer
We investigated the immune cell infiltrative levels from RNA sequencing data in the ZJU and TCGA tumor samples and two additional microarray datasets of cervical cancer including GSE6791 [22] and GSE63514 [23]. We first compared compositional differences in immune cell populations between tumors and adjacent normal tissues in ZJU and the other two microarray datasets. Generally, tumor tissues detected a higher fraction of T lymphocytes and yet a lower fraction of B lymphocytes and mast cells than their adjacent normal tissues. For the same dendritic cells, the activated form has a higher fraction in tumor tissues than adjacent normal tissues, whereas the resting form is just the opposite. Immune cells that have higher fractions in tumors than in their adjacent tissues are more likely associated with poorer outcomes (Fig. 5). CD8 T cells and neutrophils were the most favorable and adverse prognostic populations, respectively, which are consistent with previous reports in anal squamous cell carcinoma and gastric cancer [24, 25].
Except for the significantly higher proportion of plasma cells in ZJU tumors, there was no other significant disparity in the immune cell compositions between ZJU and TCGA tumors (Fig. 6a). This suggests that overall levels of immune infiltration are consistent and reproducible across tumors of the two populations. To determine whether immune cell infiltrative levels are associated with HPV infection or integration, we compared immune cell compositions in HPV negative samples, and HPV positive samples with integration and without integration (Fig. 6b). There were more kinds of immune cells showing higher fractions in HPV positive samples compared with HPV negative samples, indicating stronger immune infiltration in tumors with un-cleared HPV infection. For instance, dendritic cells and regulatory T cells were significantly and consistently elevated in HPV positive samples compared with HPV negative samples in both populations. Except for the naïve B cell in TCGA, there was no significant compositional differences in immune cells between HPV positive samples with and without integration, especially in ZJU. These results suggest that HPV integration has little impact on immune infiltration in tumor microenvironments.