Despite screening and the introduction of prophylactic human papillomavirus (HPV) vaccination in developed countries, cervical cancer continues to be one of the leading worldwide causes of cancer-related deaths in women1. Prognosis for patients with metastatic disease remains poor, thus new treatments and effective molecular markers for patient stratification are urgently required. Cervical cancer is caused by at least 14 high-risk human papillomaviruses (hrHPVs), with HPV16 and HPV18 together accounting for over 70% of cases worldwide, with some variation by region1,2. Cervical squamous cell carcinoma (CSCC) is the most common histological subtype of cervical cancer, accounting for approximately 60-70% of cases, again with some variation seen across different populations2. Adeno- and adenosquamous histology are both associated with poor prognosis3–6, while the relationship, if any, between HPV type and cervical cancer prognosis remains unclear7. HPV type is also associated with histology, with HPV16 more commonly found in CSCC, while adenocarcinomas are more likely to harbour HPV182. Previous landmark studies described the genomic landscape of cervical cancer in different populations8–11 and in some cases identified subtypes based on gene expression, DNA methylation and/or proteomic profiles8,9. The Cancer Genome Atlas (TCGA) network identified clusters based on RNA, micro-RNA, protein/phospho-protein, DNA copy number alterations and DNA methylation patterns and combined data from multiple platforms to define integrated iClusters8. In their analysis, only clustering based on the expression levels and/or phosphorylation state of 192 proteins as measured by reverse-phase protein array (RPPA) was associated with outcome, with significantly shorter overall survival (OS) observed for a cluster of cervical cancers exhibiting increased expression of Yes-associated protein (YAP) and features associated with epithelial-to-mesenchymal transition (EMT) and a reactive tumour stroma. Since TCGA’s RPPA analysis was restricted to 155 tumours including SCCs, adeno- and adenosquamous carcinomas, we set out to test the hypothesis that with data from more samples, we could identify a set of transcriptional and epigenetic features associated with prognosis within CSCC and to establish whether it is also present in independent patient cohorts representing different geographical locations and ethnicities. To identify molecular subtypes and prognostic correlates, we identified a set of 643 CSCCs (all HPV-positive), for which clinico-pathological data and genome-wide DNA methylation profiles were either publicly available or generated in this study, and for which in most cases, matched gene expression and somatic mutation data were also available (Table 1).
Table 1: Summary of clinicopathological characteristics for five cervical cancer cohorts.
Identification of two gene expression-based clusters in cervical squamous cell carcinoma
Molecular and clinical differences between cervical adeno/adenosquamous and CSCCs are well documented12–14 and gene expression differences were apparent in multi-dimensional TSNE analysis based on the top 10% most variable genes of three previously published cervical cancer cohorts8,9,11 with available RNA-seq data (Supplementary Fig. S1a-d). To examine molecular and clinical heterogeneity specifically within SCC we focused all subsequent analysis on a collection of confirmed HPV-positive CSCCs from the USA, Europe and Uganda, as shown in Table 1.
236 cervical SCCs profiled by TCGA were defined as our discovery cohort (Table 1, Supplementary Table S1) and consensus clustering was performed using the top 10% most variable genes (n=1377 genes, Supplementary Table S2). Consensus cluster membership heatmaps, delta area plot, consensus cumulative distribution function (CDF) and proportion of ambiguous clusters (PAC) indicated the optimal number of clusters was two (Fig. 1a, Supplementary Fig. 2), the larger of which (n=175) was designated C1 while the smaller cluster (n=61) was designated C2 (Supplementary Table S1). Modelling transcriptomic differences between these two clusters identified 938 differentially expressed genes (DEGs, FDR=0.01, FC > 2) (Fig. 1b, Supplementary Table S3). Tumours in C1 predominantly harbour HPV types from the HPV16-containing alpha-9 clade (150/175) while 38 of 61 C2 tumours contained HPV types from the HPV18-containing alpha-7 clade. C2 tumours were 13.3 times more likely to harbour alpha 7 HPVs than C1 tumours (p = 1.8 x 10-14, Fishers Exact Test) (Fig. 1b).
Univariate analysis of 5-year overall survival (OS) revealed worse outcomes for patients with C2 tumours (HR = 2.54, p = 0.001; Fig. 1c) and in Cox regression including age, tumour stage and HPV type as covariates along with cluster membership, only membership of the C2 cluster (HR = 2.44, p = 0.017 95% CI 1.18, 5.05) and a tumour stage of IV (HR versus stage I = 4.74, p<0.001, 95% CI 2.1, 10.7) were independent predictors of five-year OS (Table 2). The relationship between cluster and OS is also clear when restricting the analysis to HPV16-containing tumours in each cluster in both univariate analysis (HR = 3.39, p = 0.004; Fig. 1d) and multivariate analysis, including age and tumour stage as covariates (HR = 3.89, p = 0.003, 95% CI 1.57, 9.67; Supplementary Table S4).
Identification of C1 and C2 CSCCs and association with prognosis in independent SCC cohorts
To further investigate the association between C1/C2 cluster membership and OS, we assembled a combined validation cohort consisting of 313 CSCC patients treated at three centres in Europe (Bergen (n = 37), Oslo (n = 248) and Innsbruck (n = 28)), for which detailed clinical information were available and for which genome-wide DNA methylation profiles from Illumina Infinium 450k arrays (the same platform used by TCGA) were either available or generated in this study (Table 1). Since RNA-seq data were not available for all European samples, cluster membership was assigned using a support vector machine (SVM) classification model based on 129 CpG sites (methylation variable positions, MVPs) at which methylation differed significantly between tumours in C1 vs C2 clusters in the discovery cohort (Fig. 2a, b; mean delta-Beta > 0.25, FDR < 0.01, Supplementary Table S5), 18 of which were located within 12 genes differentially expressed between the clusters (Supplementary Fig. S3). MVP and DEG signatures were also used to assign cluster membership to 94 CSCCs from a Ugandan cohort originally profiled by the Cancer Genome Characterization Initiative (CGCI)9, for which both DNA methylation and RNA-seq data were available. C2 tumours from all cohorts clustered together using TSNE analysis based on the MVP signature (Fig. 2c) and high concordance between DEG and MVP-based cluster allocation was observed in all cohorts for which both gene expression (RNA-seq for Uganda and Bergen or Illumina bead chip arrays for Oslo) and DNA methylation data were available (Supplementary Fig. S4a, b). Single-sample gene set enrichment analysis (ssGSEA) confirmed differential expression of the signature genes in tumours classified as C1 or C2 using DNA methylation data (Supplementary Fig. 4c). 59 of 313 (18.8%) tumours in the combined European cohort (Fig. 2b, Supplementary Table S7) and 25 of 94 (26.6%) tumours in the Ugandan cohort were classified as C2 (Supplementary Fig. S5a, Supplementary Table S6). As in the discovery cohort, most C1 tumours from the European and Ugandan cohorts harboured alpha-9 HPV types (260/325) while C2 tumours were 3.9 times more likely to harbour alpha-7 HPVs than C1 tumours (p = 1.07 x 10-6, Fishers Exact Test) (Fig. 2b, Supplementary Fig. S5a). Interestingly 80% (20/25) of Ugandan C2 patients were human immunodeficiency virus (HIV) positive, while only 56% (39/69) of C1 patients were HIV positive (Supplementary Fig. S5a).
Univariate analysis indicated lower 5-year OS in C2 tumours from the European cohort (Fig. 2d) and Cox regression controlling for FIGO stage, age, HPV type and treatment (surgery alone, surgery with radio-chemotherapy, surgery with radiotherapy alone, radio-chemotherapy and radiotherapy alone) again identified C2 status but not HPV type to be an independent predictor of 5-year OS (HR = 2.54, p =0.003, 95% CI 1.4, 4.7) along with tumour stage and inclusion of chemotherapy in the treatment regimen (Table 2). As in the discovery cohort, a significant prognostic difference was identified between the C1 and C2 subgroups when considering only the HPV16-positive tumours (n = 204) in both univariate (Supplementary Fig. S5b) and multivariate analyses (HR = 2.64, p = 0.02, 95% CI = 1.16, 6; Supplementary Table S4). Interestingly the prognostic difference was even greater among 78 patients in the European cohort that did not receive chemotherapy (Supplementary Fig. S5c; multivariate HR = 4.4, p = 0.005, 95% CI = 1.58, 12.3). At 94 patients, the Ugandan cohort was underpowered for comparing survival between C1 and C2 tumours and survival rates in the Ugandan cohort were much lower than in the other cohorts (Supplementary Fig. S5d), thus we did not attempt a combined survival analysis including these patients. Taken together, the C1/C2 clusters identified in the TCGA cohort (USA) are apparent in cohorts of CSCC patients from Europe and Uganda and tumours can be accurately assigned to cluster using either gene expression or DNA methylation profiles. C1/C2 cluster is an independent predictor of 5-year OS in both the TCGA (n = 236) and European (n = 313) cohorts and remains so when only HPV16+ tumours are considered. There is no difference in the breakdown of C1 and C2 tumours by stage (Supplementary Table S7).
Table 2 – Five-year survival analysis for all cohorts
Relationships between C1/C2 and clusters previously identified by TCGA
Of the 178 tumour samples that made up the core set in the TCGA’s landmark study into cervical cancer genomics/epigenomics8, 140 CSCCs were present in our discovery cohort of 236 (Supplementary Table S8). This enabled comparisons between our gene expression-based cluster allocations and the subtypes defined by TCGA (Fig. 3). TCGA analysis included integrated clustering using multiomics data (three iClusters, two of which (‘keratin-high’ and ‘keratin-low’ were composed entirely of CSCCs) and clustering based on transcriptomic data (three mRNA clusters). There is considerable overlap between our C1 cluster and TCGA’s mRNA C2 cluster (84/106) and keratin-high iCluster (80/106), and between our C2 cluster and TCGA’s mRNA C3 cluster (19/34) and keratin-low iCluster (27/34). Neither the mRNA C3 nor the keratin-low iCluster were associated with poor prognosis in TCGA’s analysis and given the increased expression of a subset of keratin genes (including KRT7, KRT8 and KRT18) in C2 tumours (Fig. 3), we decided against adopting the keratin-high / keratin-low nomenclature for our clusters. We also examined the relationship between our subtypes and three clusters defined by TCGA based on reverse phase protein array (RPPA) data. Notably, 57% of C2 TCGA tumours with RPPA data available belong to the EMT cluster compared with only 25% of C1 tumours (Fig. 3) and, consistent with the proteomic classification, C2 tumours display higher EMT mRNA expression scores, as defined by TCGA8 than C1 tumours (Supplementary Fig. S6). Although there is greater concordance between C2 and the TCGA EMT cluster compared to C1, it is clearly distinct from the EMT cluster.
Genomic analyses of prognostic clusters
To investigate whether C1 and C2 tumours differ at the genomic level in addition to the transcriptomic and epigenomic differences observed above, whole-exome data was obtained for SCCs from three cohorts, TCGA8, Bergen11 and Uganda9. This amounted to 367 samples, 29 of which were classed as hypermutated by standards set by TCGA8 (>600 mutations). The median tumour mutation burden (TMB) was 2.04/Mb for all tumour, 2.11/Mb for C1 tumours and 1.82/Mb for C2 tumours (1.92/Mb, 1.94/Mb and 1.72/Mb respectively after removal of hypermutated samples). We detected four mutation signatures for the combined cohorts (Supplementary Fig. S7): as expected based on previous studies8,9,11, COSMIC signatures 2 and 13 (characterised by C>T transitions or C>G transversions respectively at TpC sites attributed to cytosine deamination by APOBEC enzymes); age-related COSMIC signature 1 (characterised by C>T transitions attributed to spontaneous deamination of 5’ methylated cytosine) and COSMIC signature 5, for which the underlying mutational process is unknown15 (https://cancer.sanger.ac.uk/signatures/). The proportion of mutations attributable to each signature did not vary between clusters (Fig. 4).
Having excluded the hypermutated samples, we next performed dNdScv analysis16 on each cohort, followed by p-value combination using sample size weighted Fisher’s method followed by FDR correction17 to permit identification of significantly mutated genes (SMGs) across the entire dataset. This combined approach, followed by analysis of individual samples by cluster identified 34 SMGs (Fig. 4, Supplementary Table S9), 21 of which (highlighted by †) have not previously been identified as SMGs in cervical cancer8,9,11. Of the 34 SMGs, 21 were significantly mutated in only C1 samples, two genes in only C2 samples, three genes in both C1 and C2 individual analysis, and eight genes were only significantly mutated when both C1 and C2 clusters were analysed together (Fig. 4, Supplementary Table S9). The frequency of mutations in SMGs that had been previously observed was comparable between combined cohort and each respective SMG study (Supplementary Table S10). Among the 21 genes that have not previously been identified as significantly mutated in cervical cancer, six are SMGs in other SCCs, including head and neck (NOTCH1, JUB (also known as AJUBA), MLL2 (also known as KMT2D), RB1, PIK3R1)18, oesophageal (MLL2, NOTCH1, RB1)19 and lung SCC (NOTCH1, RB1, MLL2, CREBBP (also known as KAT3A))20. Conversely, several genes previously identified as SMGs in cervical cancer, including TP53, ARID1A and TGFBR2 are significantly mutated in adenocarcinoma but not in CSCC8,11. Comparing somatic mutation rates in SMGs between clusters using binomial regression identified PIK3CA (FDR = 0.001) and EP300 (FDR = 0.046) mutations as disproportionally more common in C1 tumours and STK11 (FDR = 0.005) and NF2 (FDR = 0.045) as enriched in C2 tumours (Fig. 4). STK11 is also under-expressed in C2 tumours compared with C1 tumours (Supplementary Table S3).
C2 tumours display Hippo pathway alterations and increased YAP1 activity
Two SMGs from our analysis (LATS1 and NF2) are core members of the HIPPO signalling pathway, while SMGs FAT1, JUB and STK11 are known regulators of HIPPO signalling21–23. Mutations in LATS1, FAT1, JUB, STK11 or NF2 (the latter two of which are significantly mutated specifically in C2 tumours, Fig. 4) result in aberrant activation of the downstream transcription factor, yes1 associated transcriptional regulator (YAP1)24–28, the expression of which is also elevated at the mRNA level in C2 tumours (Table S3).
We generated segmented copy number data for all tumours (combining TCGA and European validation cohort samples for which the necessary data were available for maximum statistical power), which identified 211 focal candidate copy number alterations (CNAs) at FDR < 0.1. Following binomial regression, we identified five discrete CNAs that differed in frequency between C1 and C2 clusters (Fig. 5a; FDR < 0.1, log2 (Odds Ratio) > 1). All five were more prevalent in C2 tumours and included 11q11 and 1q21.2 deletions and 6p22.1, 11q22.1 and 11q22.2 gains. 11q22.2 contains matrix metalloproteinase genes (MMPs) which are well known to be involved in metastasis29, but notably 11q22.1 contains the YAP1 gene. Furthermore, analysis of Reverse Phase Protein Assay (RPPA) data from TCGA revealed significantly higher YAP1 protein expression in C2 tumours (Fig. 5b). We confirmed that of the 137 TCGA cases for which RPPA data were available, cases with YAP1 amplification (8/37 C2 tumours and 6/100 C1 tumours) also showed increased YAP1 mRNA and protein expression (Supplementary Fig. S8). In total 10 genes from a 22 gene signature that predicts HIPPO pathway activity in cancer30 are differentially expressed between C1 and C2 tumours (Supplementary Table S3).
Differences in the tumour immune microenvironment between C1 and C2 tumours.
The nature of the tumour immune microenvironment, particularly the abundance of tumour infiltrating lymphocytes (TILs) is a strong prognostic factor in cervical cancer31–33. We used DNA methylation data to compare the cellular composition of TCGA tumours34, observing differences in the proportions of multiple cell types between the subgroups (Fig. 6a); most notably decreased CD8+ (cytotoxic T lymphocytes (CTL)), and a marked elevation of neutrophil and CD56+ natural killer (NK)-cells in C2 tumours. Repeating this method with the validation cohorts produced results that were remarkably similar (Fig. 6b). Differences in the proportions of cell types between C1 and C2 in the validation cohort mirrored those in the TCGA cohort, decreased CTL, and elevated neutrophil, NK-cell and endothelial cell levels were observed in C2 tumours. Importantly, this was not driven by any single validation cohort, as individual cohorts displayed consistent patterns of differences in the proportion of cell types between C1 and C2 tumours, especially with regards to CTLs, neutrophils and NK-cells (Supplementary Fig. S9a-d). C2 tumours also exhibit markedly higher neutrophil:CTL ratios (Supplementary Fig. S9e, f) and neutrophil:lymphocyte (CTL, B-cell and Treg) ratios (NLR, Supplementary Fig. S10); established adverse prognostic factors in cervical cancer35–37. At 0.7, the NLR in C1 tumours across all cohorts was less than half that observed in C2 tumours (1.85).
Validation of MethylCIBERSORT cell estimates was performed for a subset of samples from the Innsbruck cohort using CD8 (CTLs) and myeloperoxidase (MPO, neutrophils) immunohistochemistry (IHC)-based scores from a pathologist blinded to cluster designation (Supplementary Fig. S11a-c) and for CTLs in the Oslo cohort samples using comparison of MethylCIBERSORT estimates to CD8 IHC-based digital pathology scores (Supplementary Fig. S11d).
Also of potential significance regarding the tumour immune microenvironment, is the presence of two immune checkpoint genes, CD276 (also known as B7-H3) and NT5E (also known as CD73) in the set of 938 signature DEGs that separate the clusters (Table S3). Both B7-H3 and NT5E, along with a third immune checkpoint gene (PD-L2) are expressed at higher levels in C2 tumours (Supplementary Fig. S12) and hypomethylation of two CpGs in the NT5E promoter is evident in C2 tumours (Supplementary Table S5). All three suppress T-cell activity38–40 and B7-H3 expression has been linked to poor prognosis in cervical cancer41,42.
Evidence for differences in stromal fibroblast phenotype between C1 and C2 tumours
Gene set enrichment analysis using Metascape43 suggested increased EMT (Supplementary Table S9) in C2 tumours, with 52 of 200 genes in in the EMT Hallmark gene set upregulated. As noted above there is also greater overlap between the C2 cluster and an EMT cluster defined by TCGA and based on RPPA data (Figure 3). Single-cell RNA sequencing and xenografting studies strongly suggest that rather than arising from the tumour cells (few of which have undergone EMT at any given time44–46), mesenchymal gene signatures in bulk tumour expression data instead derive from stromal cells including fibroblasts, which can adopt various phenotypes and play an important role in shaping the tumour immune microenvironment47,48. In addition to YAP1, which has been linked to the formation of cancer-associated fibroblasts (CAFs)49 (as well as EMT50–52 and angiogenesis53), C2 tumours display increased expression of the CAF marker genes FAP and SERPINE1 (also known as PAI-1)54; the latter evidenced at both mRNA and protein levels (Supplementary Table S3, Fig. 5b). Overall fibroblast content as estimated by MethylCIBERSORT is similar between C1 and C2 tumours (Fig. 6a, b) but given recent findings regarding the extent and prognostic significance of CAF heterogeneity in the tumour microenvironment55–59, we hypothesized that CAF phenotype rather than overall abundance, may differ between C1 and C2 tumours. To examine this, hierarchical clustering was performed based on the expression of eight gene sets (68 genes) curated by Qian et al56, representing CAF-related biological processes and which are differentially expressed across six CAF phenotypes recently identified in a pan-cancer analysis59. C2 tumours cluster together, displaying increased expression of proinflammatory genes associated with an inflammatory (pan-iCAF2) CAF phenotype, C1 tumours appear more heterogenous with respect to expression of the signature genes used to define CAF phenotypes; there is upregulation of assorted myofibroblastic (myoCAF) genes in a subgroup C1 tumours, including various collagens, ECM genes and TGFb-associated genes, as well as ‘contractile’ genes such as smooth muscle actin (ACTA2, Fig. 6c). While ACTA2 is commonly used to identify myoCAF, it is also expressed by pericytes and smooth muscle cells, which share the contractile phenotype (and express for example, MYH11)47,59,60. Consistent with this, C2 tumours are 4.8x (p = 1.78 x 10-9, Fisher’s Exact Test) more likely to be classified as ‘CAF-high’ than C1 tumours using a four-gene CAF index defined by Ko et al48. Indeed, three of the four CAF index genes (TGFBI, TGFB2 and FN1) appear in the 938 DEG signature that separates C2 from C1 tumours (Supplementary Table S3).