We performed a background model to extract significant coding genes to be able to distinguish cancer patients. Single-base mutational profile of samples obtained from the International Cancer Genome Consortium (ICGC) dataset. We clustered 12,270 samples across 19 cancer types into new subtypes using model-based clustering by considering extracted genes. Finally, we performed comprehensive biological analyses on our identified subtypes to investigate each subtype’s biological characterization and gain new insights into cancer subtyping.
This study focuses on somatic point mutations from the ICGC dataset, which contains 19 tissue cancers data. We used 12,270 cancer samples available in the dataset in which 48.5% are female, and 51.5% are male across different projects, including READ-US, COAD-US, COCA-CN, etc. We used the Ensemble gene annotation dataset  to identify several somatic point mutations in coding genes. This dataset contains 20,345 protein-coding genes. We excluded all non-single-base mutations (e.g., insertions, deletions) from our analyses. To identify new subtypes, we only considered coding mutations.
Background model to identify candidate genes
To cluster samples based on their mutational profiles, we first need to identify candidate genes that significantly mutated in each cancer. To do this, we first determined for each cancer, best-fitted distribution to identify genes that significantly mutated in each cancer. We used Cullen and Frey’s graph (see method) with 500-fold bootstrapping to find the best-fitted distribution (Fig. 1.a). Although we examined various distributions, a negative binomial was observed as the best-fitted distribution for all cancer types (Fig. 1.b). Next, we used the Cramer-Von Mises test to confirm the perceived distributions. We considered each cancer type’s perceived distribution to detect candidate genes and then calculated the mutational load’s p-value for each gene separately. We then used a threshold of 0.001 on p-value to determine candidate genes of each cancer type. We then gathered all candidate genes from all cancer types and identified 684 genes significantly mutated in at least one cancer. A complete list of candidates (features) genes and their p-value is provided (Supplementary Table S1). The mutational load of feature genes in each cancer type is shown in (Supplementary Fig. 2). According to this figure, some genes are significantly mutated in multiple cancers, which is in line with the idea presented of pan-cancer research. For instance, TP53 and KRAS are examples of genes among the significant genes of many cancers such as Breast, Brain, and Ovarian. TP53 single-base substitution, known as the primary type of alterations, is associated with cellular proteins’ inactivation and leads to many cancers . As the figure shows, pancreatic and prostate cancers are the most mutated cancers in the candidate genes (43.8% and 44.3% of the candidate genes are mutated in pancreatic and prostate cancers, respectively). Esophagus and nervous system cancers are the less mutated cancers in the candidate genes (only 1 and 2 genes out of 684 candidate genes are mutated in the esophagus and nervous system cancers, respectively).
Model-based clustering to detect new subtypes
Having significantly mutated genes identified, we then used these genes as the features for our multi-level clustering approach to identifying cancer subtypes. For this purpose, we used the Mclust package implemented in R (see method section). We preferred this method over other clustering approaches because it builds a robust model to identify clusters without random initialization. Unlike Mclust, classic clustering algorithms such as k-means need to be randomly initialized and sensitive to initialization. Besides, clustering methods such as dbscan  and hdbscan  require the user to specify the optimal number of clusters optimally. Hence, we applied the method based on Gaussian mixture models to overcome these issues and cluster our samples based on their inherent. As shown in (Fig. 2.a) at the first level of clustering, the algorithm breaks down all samples into 2 clusters (Cluster 1 with 9318 and Cluster 2 with 2952 samples, respectively). Custer1 was divided into eight sub-clusters (from Cluster 1.1 to Cluster 1.8), and Cluster 2 was split into two sub-clusters (Cluster 2.1 and Cluster 2.2). Finally, at the third level, only Cluster 1.5 was divided into eight sub-clusters. We used a threshold of 95% to stop breaking down a cluster meaning that if we observed that applying the algorithm to a cluster would bring us a new sub-cluster with more than 95% of samples in, clustering is not valid. The cluster would be identified as a new subtype. We used this threshold because of some possible outliers in clusters. When a cluster was divided into multiple sub-clusters and a sub-cluster with more than 95% of the father’s samples, we believed that all other sub-clusters could be outliers, so the cluster should not be divided. Eventually, we obtained 17 clusters as our new identified subtypes (17 subtypes from C1 to C17). Supplementary Table S2 shows all samples identified in each subtype.
We then investigated the contribution of samples from different cancers in our identified subtypes. Figure 2.b shows this contribution relatively in a bar plot, and Supplementary Fig. 3 shows the contribution specifically in a heat map. A table of number of contribution of samples from different cancers in our identified subtypes is also provided in Supplementary Table S3. As we can see in both figures, most of the subtypes consist of various cancer types. For example, subtype C7 and subtype C12 contain all cancer types, but subtype C4 and subtype C8 are mainly composed of Head & Neck Cancer with 78% and 82% of their samples. Also, many subtypes mostly contain samples of 2 to 4 cancer types. For instance, Subtype C2 mainly consists of Prostate (48.2%), Blood (17.5%), and Breast (16.8%) cancers which are about 82.5% of all samples in this subtype. Subtype C3 consists of Blood (68.1%) and Lung (24.5%) types which are about 92% of all samples in this subtype, and subtype C9 also contains samples from Bladder (53.3%), Kidney (26.7%), and Cervix (11.1%) types which are about 91% of samples. Moreover, many cancers are primarily scattered in 3 or 4 subtypes. For instance, more than 95% of the Prostate cancer samples are grouped in C1(19.8), C2 (27.9%), C7 (20.6%), and C16 (20%). The esophagus samples are primarily in C16 (26.7%) and C17(30.6%), and more than 75% of Ovary samples are in C16 (44.9%) and C12 (31.4%).
In the following Sections, we present the results of analysis to demonstrate the biological characterization of identified subtypes and hence demonstrate the effectiveness of our new subtyping approach over traditional cancer type classification approaches.
Mutational load of genes for each subtype
We study the mutational load of candidate genes and all protein-coding genes in our subtypes in this analysis. To compute the mutational load of gene ‘g’ in subtype ‘C’ we counted the number of samples in subtype ‘C’ which have a mutation in gene ‘g’ and then normalized it by dividing it into the number of all samples of subtype ‘C’. As Fig. 3 shows, the mutational load distribution of candidate genes in the identified subtypes are different, demonstrating correct pan-cancer clustering of the samples. As the figure shows (Fig. 3), subtypes C16 and C17 are hyper-mutated subtypes (5 genes with at least one mutation in 90% of samples in C16 and 276 genes with at least one mutation in 95% of samples in C17) which can be a reason that samples of these two subtypes were separated from others at the first level of clustering. Notably, 92.6% of samples in C6 had a mutation in TTN; all samples in C8 had a mutation in BRAF; all C9 samples had a mutation in MUC4; 100% and 97.9% of samples in C10 had a mutation in CSDE1 and NRAS, respectively; all samples in C11 had a mutation in PTPN11; all samples in C12 had a mutation in TP53; all samples in C13 had a mutation in PCDHGA1, PCDHGA2, and PCDHGA3. Moreover, 98.6%, 96.2%, 99.1 and 98.1% of samples in this subtype had a mutation in PCDHGA4, PCDHGA5, PCDHGB1, and PCDHGB2, respectively (gamma Protocaderins family highly mutated in this subtype); 96.9% of samples in C14 had a mutation in PCDHGA2,3,4 and PCDHA1,2,3. Moreover, 97.4% of samples in this subtype had a mutation in PCDHGA1; 99.7% of samples in C15 had a mutation in PCDHA1,2,3. Moreover, 99.7%, 96.1%, and 98.8% of samples in this subtype had a mutation in PCDHA4, PCDHA5, PCDHA6, respectively (alpha Protocadherins family highly mutated in this subtype).
However, for some subtypes, we found similar patterns. For instance, CSMD1 and RBFOX1 are highly mutated in both subtypes C1 and C2 (CSMD1 and RBFOX1 are mutated in 80.5% and 87% of samples in C1, respectively, and 95.1% and 96.8% of samples in C2, respectively). Other examples are PCDHGA1 and PCDHGA2 that are mutated in almost all samples of C13 and C14. To understand the difference between similar subtypes (C1 and C2; C13 and C14), we plotted the fraction of samples that had at least three mutations in each candidate (feature) gene (Supplementary Fig. 4). As the figure shows, tumor samples in C2 and C14 have higher mutations than subtypes C1 and C13, respectively. In addition, we observed that CSMD1 is mutated in 74% of the samples in C2. In comparison, only 42% of samples in C1 are mutated within this gene, meaning that the difference between C1 and C2 originated from the different mutation rate in significant common genes. Supplementary Fig. 4 also shows PCDHGA1 mutated in 6% and 43% of C13 and C14 subtypes, respectively, demonstrating the effect of mutation numbers in distinguishing these two subtypes. Another example for C13 and C14 is PCDHGA2 which mutated in 5% and 37% of samples in C13 and C14. Results demonstrate that common genes have more mutations in C14 than C13.
Gene and Gene-motif association as a biomarker of each subtype
We then investigated the top 100 highly mutated genes in each subtype (Supplementary Table S4) and then asked how many of the top 100 highly mutated genes are common between every two subtypes (Fig. 4.a). As shown in this figure, many pairs of subtypes have a few common genes, while some pair with numerous common genes. For example, subtypes C1 and C2 have 93 common significant genes out of 100 in both subtypes. While subtypes C13 and C14 have 34 common genes in their top 100 significant genes. Interestingly, there is no common gene between their top 100 important genes for many of the subtypes.
It has been recently shown in  that gene-motifs are the primary source of disease-related variations in cancer. Gene-motifs refer to the 3-nucleotide sequence mutated within a gene, i.e., NXN-to-NYN (where reference nucleotide X mutated to Y, and N: A, C, G, or T). There are 96 combinations of mutations within 3-nucleotide motifs. For example, MUC16, LRRC4C, and IL1RAPL1 are examples of genes that appeared as significant genes within the top 100 important genes of different subtypes. We investigated each subtype’s mutations in tri-nucleotide motifs to show the motif preferences of mutations in each of these genes (Supplementary Fig. 5).
Interestingly, our result (Supplementary Fig. 5.a) indicates that IL1RAPL1 mutations in C2 samples within T > A and T > C occurred more among C1 samples, and C5 samples were mutated in a smaller number of motifs compared to C1 and C2. According to Supplementary Figs. 5.b and 5.c, the same results are observed for LRRC4C and MUC16.
The large number of shared genes between different pair of subtypes (e.g., C1 and C2) led us to investigate the mutational loads within 3-mer motifs in the top 100 important genes, separately. To do this, we used Fisher exact test (method section), and we identified significantly mutated motifs within the top 100 significant genes in each subtype, separately. Considering the top 100 gene-motifs for each subtype, we identified common gene-motifs between every two subtypes as shown in Fig. 4.b. Interestingly, this analysis showed a more evident difference between the identified subtypes. Compared to Fig. 4.a all pairs have less common significant gene-motifs than significant common genes (except C14 and C15, which have 30 common gene-motifs). There is no common gene-motif between most of paired subtypes (Fig. 4.b). Importantly, subtypes C1 and C2, with 93 common significant genes within their top 100 most mutated genes, have only 48 common gene-motifs (within their top 100 gene-motifs), showing different molecular mechanisms within these subtypes. The complete lists of the top 100 significant gene-motifs for each subtype is provided in Supplementary table S5.
Mutational signature analysis
We also investigated mutational signatures in our identified subtypes. A mutational signature is a fingerprint for a molecular mechanism that is causing mutation across genome. Molecular mechanisms are blind to what location they are causing the mutation. Therefore, to identify the molecular mechanism respective to mutational signature, we have to consider all mutations in whole genome (except mitochondria). We applied CANCERSIGN tool  on complete mutational profiles of each subtype separately and found 121 signatures. We then compared our identified signatures with 67 signatures identified in COSMIC . We calculated the angular similarity between our identified signatures and COSMIC signatures to extract each signature’s biological information and their associated subtypes. Heatmap of similarities between our signatures and Alexandrov signatures is shown in Fig. 5. Hierarchical clustering enables us to find similar signatures. Hierarchical clustering enables finding of similar signatures beside each other. As shown in the figure, COSMIC’s signature 1, whose number of mutations correlates with the individual’s age, is significantly correlated with many signatures in our identified subtypes, including C1.S1, C2.S1, and C12.S1. COSMIC’s signature 1 is shown to be highly associated with breast cancer. Interestingly C1, C2 and C12 contains many breasts cancer samples (27.4%, 16.8%, 22% respectively as shown in Supplementary Fig. 3). Also, COSMIC’s signature 2 which is attributed to activity of the AID/APOBEC family of cytidine deaminases is highly observed in nervous system and is significantly correlated with signatures in C8 (C8.S7 and C8.S1), which consists of nervous system cancer (77.8%). Similarly, COSMIC’s signature 4 is also associated with smoking and highly observed in lung cancer, is highly correlated with two signatures in C6 (C6.S4, C6.S5), a subtype that consisted of lung cancer patients (14.3%) and 20.4% of lung samples are in C6. Similarly, COSMIC’s signature 5 is associated with Skin cancer and is also correlated with a signature of C17 (C17.S8) which consists of skin cancer patients (38.5%). Furthermore, COSMIC’s signature 6 is associated with defective DNA mismatch repair and is correlated with signatures from various subtypes including C7.S4, C8.S3, C6.S1, C15.S1. Also, COSMIC’s signature 7 and 8 are related to ultraviolet light and Skin cancer. These two signatures are highly correlated with signatures of C14 and C17 (C14.S11 and C17.S10), which consists of skin cancer (40.7% and 38.5%, respectively). Similarly, characteristics of COSMIC’s signature 16 is yet unknown but has observed in liver cancer tumors and highly correlated with C7.S8 (9.9% of samples in this subtype are liver and 32.1% of liver samples are in C7) and C16.S5 (13.6% of samples in this subtype are liver). Also, COSMIC’s signature 22 is also highly observed in Eso-AdenoCA cancer. This signature is correlated with C17.S1 (30.2% samples in this subtype are Esophagus). Finally, COSMIC’s signature 34 is observed in samples from individuals with a tobacco chewing habit and have been found in oral and liver cancer. This signature is highly correlated with C6.S5 and C7.S1. 9.3% and 32.1% of liver samples are in C6 and C7, respectively.
The figure also shows that some of the signatures are presented in multiple subtypes. However, most of the signatures identified in each subtype are specific to the subtype, indicating that samples within subtypes have the same mutational process. The exact amount of correlation between identified signatures and COSMIC’s (Alexandrov’s) and correlation between each two identified signatures are provided in (Supplementary Table S7). Molecular mechanism respective to each mutational signature of COSMIC is also provided in this table.
Gene ontology and pathway analysis
We next investigate whether each subtype’s top 100 significant genes are associated with any gene ontology (GO) or gene pathway terms [32, 33]. To do this, we used the enrichr  package in R (see method) for gene ontology and pathway terms analyses. Gene ontology covers three main domains namely, biological process, molecular function, and cellular component. We considered all these domains and only retained enriched terms with FDR < 0.05. In general, we identified at least one GO terms for 10 subtypes out of 17 subtypes (Fig. 6). Most of GO terms are uniquely enriched in one subtype, while others are enriched in multiple subtypes. For example, the “integral component of plasma membrane” is associated with five subtypes (C2, C13, C14, C15, and C17), and “nervous system development” is associated with another five subtypes (C1, C2, C13, C14, and C15). Conversely, the “bitter taste receptor activity,” “MHC class II protein complex”, “anterograde trans − synaptic signaling”, “actin − myosin filament sliding”, and “anion channel activity“ are examples of terms that are uniquely associated with C4, C5, C13, C14, and C17, respectively. Moreover, associated terms in C1 and C2 are almost the same, and only three terms associated uniquely in one of them (“axolemma” and “integral component of plasma membrane” only associated with C2 and “integral component of lumenal side of endoplasmic reticulum membrane” only associated with C1).
Clinical report and survival analysis
We also examined clinical data, such as gender and region (where the data were collected), available for a subset of the ICGC data and identified several interesting results in the gender distribution of subtypes. For instance, C4 and C8, which mainly contains nervous system samples, are female-biased (69% of samples in these subtypes are female) and C2 (48.2% of samples are prostate cancer), C5 (84.3% of samples are prostate, blood, or brain cancers) and C17 (68.7% of samples are skin or esophagus cancers) are male-biased (Fig. 7.a). The geographical distribution of identified subtypes is shown in Fig. 7.b.
We also used molecular data available for a subset of the ICGC dataset to demonstrate the difference between our identified subtypes regarding their survival curve. We begin with excluding samples of patients that were placed in different subtypes. To estimate survival probability over time, we used Kaplan-Meier  method and created survival curves for each subtype (Fig. 8). As shown in this figure, the difference between subtypes is not only demonstrated by the p-value, but it is also observable in the plot itself that the survival times of identified subtypes are different from each other. Since the data we use to cluster samples is entirely based on the somatic mutation data without any clinical information, this survival plot and p-value explicate influential biological signals. As shown in Fig. 8, more than 75% of patients in C1 (significantly mutated in CSMD1/CNTNAP2) have a good survival length of 10 years. C2 and C5 (significantly mutated in DPP10/PTPRD and DMD, respectively) are also subtypes with a high chance of survival (survival of 13 years for more than 50% of their patients). However, patients in PCDHGA-driven subtype (C13), patients in PCDHA/PCDHGA-driven subtype (C14), and patients in PCDHA-driven subtype (C15) have the most unfortunate results since only 25% of patients of these subtypes have an overall survival of only five years. Moreover, NBPF/USP17-driven subtype (C3) and CSDE1/NRAS-driven subtype (C10) have the worst survival time (all patients in these two subtypes have survival length of less than six years). Our results suggest that the Protocadherin family, USP17 family, NBPF family, NRAS, and CSDE1 substantially affect survival time.