Determining mutational burden and signature using RNA-seq from tumor-only samples
Background: Traditionally, mutational burden and mutational signatures have been assessed by tumor-normal pair DNA sequencing. The requirement of having both normal and tumor samples is not always feasible from a clinical perspective, and led us to investigate the efficacy of using RNA sequencing of only the tumor sample to determine the mutational burden and signatures, and subsequently molecular cause of the cancer. The potential advantages include reducing the cost of testing, and simultaneously providing information on the gene expression profile and gene fusions present in the tumor.
Results: In this study, we devised supervised and unsupervised learning methods to determine mutational signatures from tumor RNA-seq data. As applications, we applied the methods to a training set of 587 TCGA uterine cancer RNA-seq samples, and examined in an independent testing set of 521 TCGA colorectal cancer RNA-seq samples. Both diseases are known associated with microsatellite instable high (MSI-H) and driver defects in DNA polymerase ɛ (POLɛ).
From RNA-seq called variants, we found majority (>95%) are likely germline variants, leading to C>T enriched germline variants (dbSNP) widely applicable in tumor and normal RNA-seq samples. We found significant associations between RNA-derived mutational burdens and MSI/POLɛ status, and insignificant relationship between RNA-seq total coverage and derived mutational burdens. Additionally we found that over 80% of variants could be explained by using the COSMIC mutational signature-5, -6 and -10, which are implicated in natural aging, MSI-H, and POLɛ, respectively. For classifying tumor type, within UCEC we achieved a recall of 0.56 and 0.78, and specificity of 0.66 and 0.99 for MSI and POLɛ respectively. By applying learnt RNA signatures from UCEC to COAD, we were able to improve our classification of both MSI and POLɛ.
Conclusions: Taken together, our work provides a novel method to detect RNA-seq derived mutational signatures with effective procedures to remove likely germline variants. It can leads to accurate classification of underlying driving mechanisms of DNA damage deficiency.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
This is a list of supplementary files associated with this preprint. Click to download.
Supplementary Figure 1: Background regression coefficients of COSMIC signatures in UCEC samples. Histogram of UCEC regression coefficients against (A) a germline signature derived from dbSNP variants, (B) natural aging signature (COSMIC signature 5), (C) MSI-H signature (COSMIC signature 6), and (D) POLɛ signature (COSMIC signature 10).
Supplementary Figure 2: Unsupervised mutational signatures. The mutational frequencies of all five unsupervised signatures identified from UCEC samples and output by mutationalPatterns R package.
Supplementary Figure 3: Sample correlations to unsupervised identified signatures. Sample mutational frequencies were correlated to the top five signatures output by the unsupervised method. The samples are color coded (y-axis bars) based on the clinical annotation of the tumor in regares to MSI and POLɛ status.
Supplementary Figure 4: RNA editing and true somatic events in tumor-only cohort. Histogram of the frequency of RNA editing events in the filtered pool of variants from tumor-only RNA samples (A). Scatterplot comparing the total number of somatic variants in a sample calculated from DNA tumor-normal somatic variant calling to the percentage of true somatic variants in the filtered pool of variants from tumor-only RNA samples (B).
Posted 21 Jan, 2021
On 13 Jan, 2021
On 13 Jan, 2021
On 13 Jan, 2021
On 13 Oct, 2020
Received 02 Oct, 2020
Invitations sent on 28 Sep, 2020
On 28 Sep, 2020
On 17 Sep, 2020
On 16 Sep, 2020
On 16 Sep, 2020
On 03 Jul, 2020
Received 17 Jun, 2020
On 07 Jun, 2020
Received 30 Apr, 2020
On 19 Feb, 2020
On 30 Jan, 2020
Invitations sent on 30 Jan, 2020
On 21 Jan, 2020
On 21 Jan, 2020
On 02 Jan, 2020
Determining mutational burden and signature using RNA-seq from tumor-only samples
Posted 21 Jan, 2021
On 13 Jan, 2021
On 13 Jan, 2021
On 13 Jan, 2021
On 13 Oct, 2020
Received 02 Oct, 2020
Invitations sent on 28 Sep, 2020
On 28 Sep, 2020
On 17 Sep, 2020
On 16 Sep, 2020
On 16 Sep, 2020
On 03 Jul, 2020
Received 17 Jun, 2020
On 07 Jun, 2020
Received 30 Apr, 2020
On 19 Feb, 2020
On 30 Jan, 2020
Invitations sent on 30 Jan, 2020
On 21 Jan, 2020
On 21 Jan, 2020
On 02 Jan, 2020
Background: Traditionally, mutational burden and mutational signatures have been assessed by tumor-normal pair DNA sequencing. The requirement of having both normal and tumor samples is not always feasible from a clinical perspective, and led us to investigate the efficacy of using RNA sequencing of only the tumor sample to determine the mutational burden and signatures, and subsequently molecular cause of the cancer. The potential advantages include reducing the cost of testing, and simultaneously providing information on the gene expression profile and gene fusions present in the tumor.
Results: In this study, we devised supervised and unsupervised learning methods to determine mutational signatures from tumor RNA-seq data. As applications, we applied the methods to a training set of 587 TCGA uterine cancer RNA-seq samples, and examined in an independent testing set of 521 TCGA colorectal cancer RNA-seq samples. Both diseases are known associated with microsatellite instable high (MSI-H) and driver defects in DNA polymerase ɛ (POLɛ).
From RNA-seq called variants, we found majority (>95%) are likely germline variants, leading to C>T enriched germline variants (dbSNP) widely applicable in tumor and normal RNA-seq samples. We found significant associations between RNA-derived mutational burdens and MSI/POLɛ status, and insignificant relationship between RNA-seq total coverage and derived mutational burdens. Additionally we found that over 80% of variants could be explained by using the COSMIC mutational signature-5, -6 and -10, which are implicated in natural aging, MSI-H, and POLɛ, respectively. For classifying tumor type, within UCEC we achieved a recall of 0.56 and 0.78, and specificity of 0.66 and 0.99 for MSI and POLɛ respectively. By applying learnt RNA signatures from UCEC to COAD, we were able to improve our classification of both MSI and POLɛ.
Conclusions: Taken together, our work provides a novel method to detect RNA-seq derived mutational signatures with effective procedures to remove likely germline variants. It can leads to accurate classification of underlying driving mechanisms of DNA damage deficiency.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5