Using the degree of association between mutation rates and replication timing to characterize tumor samples
Mutation rates (MR) are associated with replication timing (RT). As a rule, MR are higher in LRR (late replication regions) and lower in ERR (early replication regions) [4, 6]. However, analysis of individual tumors revealed that this association varies considerably, and that there are many tumors with weaker association between MR and RT. In order to systematically investigate this phenomenon, we divided the genome into four equal bins of RT (limiting the analysis to genomic regions with constitutive RT, see Methods), and checked the mutation rate in these RT regions for each tumor. Then, we clustered the 2787 WGS tumors into two clusters according to the MR in each bin (Methods). Cluster 1 contains 1042 samples in which the association between MR and RT is weak, while cluster 2 contains the remaining 1745 samples in which the association is stronger (Fig. 1a-b). For simplicity of further analyses, we defined the RT-MRa (“RT-MR association”) metric as the log2 of the ratio between the MR in the late and the early bins (\(\text{R}\text{T}-\text{M}\text{R}\text{a}= {\text{log}}_{2}\frac{\text{n}\text{o}\text{r}\text{m}\text{a}\text{l}\text{i}\text{z}\text{e}\text{d} \text{m}\text{u}\text{t}\text{a}\text{t}\text{i}\text{o}\text{n} \text{c}\text{o}\text{u}\text{n}\text{t} (\text{L}1+\text{L}2)}{\text{n}\text{o}\text{r}\text{m}\text{a}\text{l}\text{i}\text{z}\text{e}\text{d} \text{m}\text{u}\text{t}\text{a}\text{t}\text{i}\text{o}\text{n} \text{c}\text{o}\text{u}\text{n}\text{t} (\text{E}1+\text{E}2) }\)). This RT-MRa metric ranges from − 1 to 2.5, and most of the samples with RT-MRa score < 0.8 were found in cluster 1 (Fig. 1a (the rightmost column) and 1c).
Next, we examined the possibility of an association of the RT-MRa metric with tumor type (Fig. 1d). There are cancer types (i.e., different projects) that contain almost only tumors with weak RT-MRa (cluster 1), such as kidney renal papillary cell carcinoma (KIRP) and breast cancer (BRCA), while others are heavily biased toward strong RT-MRa (cluster 2), such as esophageal adenocarcinoma (ESAD) and lung squamous cell carcinoma (LUSC). Furthermore, projects of similar cancer types showed similar behavior in this tendency (Fig. 1d – different projects of similar tumor types were given the same color).
The two clusters may be a consequence of confounding factors such as mutation load or age at diagnosis. In order to explore this possibility, we calculated the correlation between RT-MRa and those features. Both correlations were distributed normally with a mean of 0 (Fig. 1e), suggesting that RT-MRa levels are not associated with mutation rate or age at diagnosis.
Mutational Signatures Association With The Rt-mra Metric
In order to explore the factors that influence RT-MRa, we next analyzed the contribution of mutational signatures. The mutational signatures found in each tumor reflect the mutational processes that the tumor underwent. We have shown that mutational signatures differ in their association with RT [6], and thus they may explain RT-MR association (RT-MRa). We would hypothesize, therefore, that a tumor that mostly underwent mutational processes that are ERR-biased will end up with relatively more mutations in ERR, resulting in a weaker RT-MRa metric as seen in cluster 1.
To explore this relationship between mutational processes and RT, we checked the correlation between the RT-MRa metric and the relative contribution of different mutational signatures. We found mutational signatures that display positive correlation with the RT-MRa metric (i.e., samples with larger contribution of these signatures have stronger RT-MRa or belong to cluster 2), and mutational signatures that display negative correlation (i.e., samples with larger contributions of these signatures have weaker RT-MRa or belong to cluster 1) (Fig. 2a). In some cases, the reason for the associations is clear: signatures that are associated with ERR (such as SBS2 and SBS13 [22]) are associated with cluster 1 tumors, whereas signatures associated with LRR (such as SBS17a&b and SBS7a [22]) are associated with cluster 2 tumors. In addition, it is known that defects in DNA repair pathways can cause a different association between RT and MR [9]. Indeed, signatures associated with defects in DNA repair mechanisms (APOBEC, base excision repair, mismatch repair and homologous recombination; Fig. 2a - the bold SBS) display negative correlation with RT-MRa, (i.e., contribute more to cluster 1 tumors). Similar association with signatures associated with DNA repair defects was found using the Kruskal-Wallis rank test (see Methods and Supplementary Figure S1).
Next, we analyzed the association of the RT-MRa metric with mutational signatures for each project separately. This allowed us to identify signatures that are cancer-type specific and thus are less prominent in a pan-cancer analysis. This analysis revealed many signatures that are either positively or negatively associated with the RT-MRa metric in specific projects (Fig. 2b). As expected, signatures with a low mean correlation coefficient value are associated with cluster 1, whereas signatures with a high correlation are associated with cluster 2, which is graphically captured in Fig. 2c (left panel). The association between RT-MRa and mutational signatures is partially explained by the RT basis of each signature [6]. Indeed, plotting the average correlation coefficient as a function of the RT bias (delta ERR-LRR) of each signature, demonstrates this association (Fig. 2c right panel). It should be noted that there are signatures, such as SBS7b and SBS16, that show a strong bias toward ERR [6], yet are not associated with low RT-MRa (absent in Fig. 2b-d). This is because those signatures have a relatively small contribution to the mutation load of the tumor, and thus their contribution to ERR is eclipsed by other signatures that contribute more mutations to LRR. For example, the UV related signature SBS7b is associated with ERR, yet we do not find it in samples with low RT-MRa, since it is always accompanied by the LRR associated signature SBS7a, which has an average higher contribution (47%) than SBS7b (16%).
After examining each signature separately, we checked whether combinations of signatures contribute to the RT-MRa. To this end, we represented each tumor by a vector containing the relative contribution of each signature, and used principal component analysis (PCA) followed by K-means clustering to define subgroups of tumors within each project (see Methods). Our approach revealed that there are distinct subgroups in many projects, and we examined whether these subgroups are associated with the RT-MRa metric using Wilcoxon rank sum test. In many projects, such as COAD-US, a strong association was found (Figs. 2d, Supplementary Figure S3 and Supplementary Table S1). This finding illustrates the importance and impact of the contribution of signatures to the association between RT and mutation rates.
Taken together, as expected, mutational processes appear to have a strong association with the RT-MRa in many tumors. Yet in many cancer types we found variation in RT-MRa that cannot be explained by signatures (Supplementary Figure S3 and Table S1).
Identification Of Pathways Significantly Mutated In Tumors With Weak Rt-mr Association
Next, we explored the possibility that the differences in the RT-MRa scores stem from non-functional components of DNA replication, repair and chromatin structure. To this end, we looked at the abundance of deleterious mutations in tumors with weak and strong RT-MRa (clusters 1 and 2 respectively). Since deleterious mutations are rare, we performed this analysis at a pathway level, meaning that instead of asking whether a mutation in a particular gene is enriched in cluster 1, we asked if mutations in any gene belonging to a particular pathway are enriched in cluster 1. For the same reason this analysis was performed in a pan-cancer manner, pooling together cluster 1 and cluster 2 samples from all projects. We focused on a few candidate pathways that we hypothesized might be associated with changing RT-MRa. These pathways included DNA repair, DNA replication and chromatin organization, and as a control we chose others, non-replication related pathways, as well as random sets of genes. We counted the number of deleterious mutations in each gene in those groups separately for cluster 1 and cluster 2 tumors and calculated the enrichment in cluster 1 using a binomial test (see Methods). DNA repair and chromatin organization pathways were highly enriched for deleterious mutations in cluster 1 samples, whereas no such enrichment was observed in any of the controls (Fig. 3a), suggesting that those pathways are involved in the modulation of the association between mutation rates and RT. This conclusion was independent of the definition of the deleterious mutations since similar results were observed using different definitions of effective mutations (Supplementary Figure S4). Interestingly, we got the same results for different DNA repair pathways (Fig. 3b), suggesting that the association between RT and MR is affected by multiple DNA repair mechanisms and not confined to MMR and GG-NER, as have been previously suggested [9, 10]. The ten most enriched genes in each of the three enriched pathways are shown (Fig. 3c).
Identification Of Differentially Expressed Genes In Tumors With Weak Rt-mr Association
In parallel to analyzing the mutated pathways, we performed differential expression analysis to identify correlation of gene expression and pathways with the RT-MRa metric. This analysis was performed in 803 samples for which RNA-seq data exists in addition to the WGS mutation information in ICGC [21]. We analyzed each project separately, since different tissues differ in their expression profiles. As we have shown (Fig. 1d), the RT-MRa metric varies in most projects and thus can be used to define within each project a set of samples with stronger and weaker association between RT and MR. To this end, we divided the samples in each project into three equally sized groups according to the RT-MRa values and used DESeq2 to identify differential expressed genes between the strong RT-MRa and weak RT-MRa groups (Fig. 4a and Supplementary Figure S5). We found numerous genes in most projects that passed the FDR < 0.1 criterion. In order to control for inflated FDR in DEseq2 analyses [23], we calculated an experimental FDR by randomizing the tumors in each project, performed differential expression analysis and counted the number of genes that passed the FDR < 0.1 threshold in the randomized data. Only projects in which the number of differentially expressed genes in the randomized data that passed the threshold was less than 10% of the number of genes identified by DESeq2 in the original data were considered valid and were kept for further analyses. Following this analysis, we were left with 10 projects of interest. 9 of these had genes expressed higher in the weak RT-MRa group, and 6 had genes expressed higher in the strong RT-MRa group (5 projects contained genes in both groups). We performed GO annotation analysis on these genes using the Metascape tool [24], and identified several categories enriched in the weak RT-MRa group. Interestingly, very similar GO terms were enriched in multiple projects (Fig. 4b, Supplementary Tables S3 and S4), suggesting common mechanisms, despite the wide range of tissues and phenotypes. Surprisingly, the reoccurring enriched categories were associated with communications between cells and with the immune system, suggesting that tumors with weak RT-MRa contain a higher degree of infiltration of immune cells (which most probably cause the enrichment of immunological categories). These results were confirmed using a more stringent differential expression identification approach (based on a Wilcoxon test (following [23])) (Supplementary Figure S6, Supplementary Tables S5 and S6). In order to confirm this surprising finding, we tried to confirm the generality of the association with the immune system. To this end, we looked at the group of genes that were enriched in several projects. Counting the number of projects each gene was enriched in revealed a large number of genes enriched in multiple projects (Fig. 4c). The 421 differential genes that are enriched in at least 3 projects are also enriched for immunity-associated processes (Supplementary Figure S7). Furthermore, we were able to show that immune-related genes are expressed higher in the weak RT-MRa group in multiple cancer types (Fig. 4c and Supplementary Figure S8). By contrast, analyzing genes with higher expression in samples with strong RT-MRa did not reveal any association with cell-cell communication or immunology (Supplementary Figure S9, Supplementary Tables S7 and S8).
The association between RT-MRa and immune genes expression could be either causative or affected by a common confounding factor. It is generally hard to infer causality from associations and thus the only thing that can be done is to look for association between the RT-MRa with other potential confounding factors. Although there was no general association of the RT-MRa metric with either age or mutation load (Fig. 1e), we re-examined those associations in the nine tumor types with genes expressed higher in the weak RT-MRa group. We found that in colon cancer (COAD-US) the weak RT-MRa group is associated with high mutation load and also with neoantigens (see Methods). Moreover, for the 26 colon cancer samples, for which we have information about the status of microsatellite instability (MSI) [21], the weak RT-MRa group is associated with MSI positive tumors (P val = 0.015; Chi square) (Fig. 5a). This suggests that for colon cancer, the association with the immune genes may be a consequence of high mutation load. On the other hand, in all other cases such associations were not found (Fig. 5b-c), suggesting that the involvement of the immune genes may be associated directly with mutation distribution. The associations with age at diagnosis were weak (Supplementary Figure S10).