Determining mutational burden and signature using RNA-seq from tumor-only samples

doi:10.21203/rs.2.21641/v2

Download PDF

Research article

Determining mutational burden and signature using RNA-seq from tumor-only samples

https://doi.org/10.21203/rs.2.21641/v2

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Mar, 2021

Read the published version in BMC Medical Genomics →

You are reading this older preprint version

Read the latest preprint version →

Background: Traditionally, mutational burden and mutational signatures have been assessed by tumor-normal pair DNA sequencing. The requirement of having both normal and tumor samples is not always feasible from a clinical perspective, and led us to investigate the efficacy of using RNA sequencing of only the tumor sample to determine the mutational burden and signatures, and subsequently molecular cause of the cancer. The potential advantages include reducing the cost of testing, and simultaneously providing information on the gene expression profile and gene fusions present in the tumor.

Results: In this study, we devised supervised and unsupervised learning methods to determine mutational signatures from tumor RNA-seq data. As applications, we applied the methods to a training set of 587 TCGA uterine cancer RNA-seq samples, and examined in an independent testing set of 521 TCGA colorectal cancer RNA-seq samples. Both diseases are known associated with microsatellite instable high (MSI-H) and driver defects in DNA polymerase ɛ (POLɛ).

From RNA-seq called variants, we found majority (>95%) are likely germline variants, leading to C>T enriched germline variants (dbSNP) widely applicable in tumor and normal RNA-seq samples. We found significant associations between RNA-derived mutational burdens and MSI/POLɛ status, and insignificant relationship between RNA-seq total coverage and derived mutational burdens. Additionally we found that over 80% of variants could be explained by using the COSMIC mutational signature-5, -6 and -10, which are implicated in natural aging, MSI-H, and POLɛ, respectively. For classifying tumor type, within UCEC we achieved a recall of 0.56 and 0.78, and specificity of 0.66 and 0.99 for MSI and POLɛ respectively. By applying learnt RNA signatures from UCEC to COAD, we were able to improve our classification of both MSI and POLɛ.

Conclusions: Taken together, our work provides a novel method to detect RNA-seq derived mutational signatures with effective procedures to remove likely germline variants. It can leads to accurate classification of underlying driving mechanisms of DNA damage deficiency.

Epigenetics & Genomics

Cancer

RNAseq

UCEC

COAD

Mutational Burden

Mutational Signatures

Mutational identity of cancers

Clinical treatment of cancer is largely dependent on the mechanistic defect(s) giving rise to a tumor; subsequently the discovery of biomarkers for identifying these defects has become increasingly important. PD-L1 inhibitors are immunotherapy drugs and function by repressing the ability of tumors to disguise themselves from immune system. PD-1 on T cells binds to the PD-L1 receptor on some types of tumor cells, signaling to the T cell not to target the tumor cell^1,2. Tumors that display high levels of genetic diversity are often susceptible to PD-L1 inhibitors as variants in the DNA sequences lead to the display of abnormal proteins on the tumor cell surface, allowing the immune system to attack the tumor in the absence of the PD-1/PD-L1 “off switch”³. Classification of patients as promising or unpromising candidates for PD-L1 inhibitors is therefore of great importance in therapeutic decisions.

Tumor Mutational burden (TMB) is a metric used to quantify the degree of mutational diversity in tumor cells by calculating the number of somatic variants per mega base. Different tumor types display highly different levels of mutational burden and high TMB is associated with response to immunotherapies⁴.

Mutations in the catalytic subunit of DNA polymerase epsilon (POLɛ), responsible for DNA repair during chromosomal replication, have been associated with uterine and colorectal cancers with a large mutational burden^5,6. The mismatch repair system (MMR) repairs spontaneous mismatches in the DNA, and disruption of MMR leads to polymorphism in the length of microsatellite regions (microsatellite instability high, MSI-H, or microsatellite stable, MSS)⁷. Additionally, an increased mutational burden has been associated with MSI-H tumors⁸.

Disruption of DNA repair mechanisms also results in a unique mutational signature as certain base substitutions occur more frequently based on the mechanism of repair, and with specific, adjacent sequence context. Mutational signatures are calculated over trinucleotide sequence context (e.g. ACA to AGA) resulting in 96 quantitative variables to compare as opposed to the single variable of mutational burden. The Catalogue of Somatic Mutations in Cancer (COSMIC) is a database of 30 mutational signatures identified in a variety of tumor types, including the signature of natural accumulation of mutations during aging.

Current method for characterization of tumors

Classically⁹, MMR status has been determined using PCR based methods, looking at five microsatellite regions for polymorphic changes in length, however these tests are effort, cost, and time intensive. More importantly there is a wide variation in mutational burden across MSI-H cases; hence there is a need for a global assay¹⁰. More recently, mutational signature models have been used to predict MMR and POLɛ status from variants identified from next generation sequencing¹¹. However, routine practice requires both a tumor and normal sample, to allow for private, germline variants to be subtracted out. Additionally, it is common to use variants identified from DNA sequencing, which contains variants from the entire exome or genome, including variants not expressed. Expressed variants detected in RNA-seq are more likely to result in neo-epitopes, which are likely to be targeted by immunotherapy.

In this study, we investigated the efficacy of MSI-H and POLɛ classification from RNA -sequencing, using only a tumor sample without its corresponding normal sample. By using RNA sequencing we also get gene expression and gene fusion data for the tumor, and the variants we identify are expressed. Finally, only using the tumor removes the requirement to take a normal sample, but also provides the greatest challenge in requiring us to contend with the presence of germline variants, that can make up well over 99% percent of called variants. We tested both supervised and unsupervised approaches to identify the mutational signature of samples and created a regression based model. Based on identified signatures and associated coefficients, we demonstrated potentially clinical meaningful application of classifying MSI-H and POLE tumors, which are candidates for immunotherapies.

To identify somatic variants from uterine corpus endometrial carcinomas (UCEC) and colon adenocarcinomas (COAD), RNA-sequencing alignment files from 587 UCEC and 521 COAD tumor samples were downloaded from The Cancer Genome Atlas (TCGA) and single nucleotide variants were called using the human reference hg38 genome (RVBoost, see methods)¹². The median sample had approximately 55,000 called variants, a majority of which are expected to be private, germline variants. We implemented a series of filtering steps aimed at selectively removing germline variants and enrich for somatic variants (Fig. 1A).

The first step was to select for high confidence variants, removing false positives such as sequencing errors, by requiring a read depth of at least 50 reads at the position, with at least 20 reads supporting the alternative allele. This and subsequent filtering steps greatly reduced the number of variants from the original pool (Fig. 1B). The second step was to remove all population variants that have been previously characterized in the dbSNP database, as these variants would highly likely to be germline. The third step was to remove variants that had allelic frequencies between 0.45 and 0.55 (heterozygous) or 0.95 and 1 (homozygous). Tumor samples are often not 100% pure tumor, and additionally, tumors are often mosaic as mutations accumulate in subclones. As a result, the allelic frequencies of somatic variants deviate from germline variants of 0.5 and 1 (Fig. 1C). The number of variants in the resulting list enriched for somatic variants (referred to as the somatic variants for the remainder of the paper) was reduced by a median of 100-fold from the starting variant pool.

The mutational signature of germline variants was determined using the removed variants from filtering steps 2 and 3 (Fig. 1D), and used to estimate the remaining germline population in the final pool of enriched somatic variants. The mutational signature for each sample’s somatic variants was determined in a similar method to the COSMIC signatures and germline signature mentioned above. Linear regression of the trinucleotide frequencies for the somatic variant signatures against those of the germline signature gave a coefficient representing similarity to the germline (see methods). To determine if mutational burden, the number of somatic variants per sequenced Mb, was a function of variability in somatic variant numbers or variability in the filtering process, resulting in left over germline variants, we tried to correlate mutational burden with the similarity to the germline signature (Fig. 1E). Mutational burden was independent of the similarity to the germline signature for all three cancer subtypes, POLɛ, MSI-H, and MSS, suggesting higher mutational burden isn’t due to contaminating germline signatures.

Calling tumor mutational subtype using regression

We next determined if the mutational burden for the POLɛ and MSI-H tumor subtypes, both defined by disruption of DNA integrity mechanisms, was increased relative to MSS tumors, which should only acquire somatic variants from the natural aging process. The mutational burden of POLɛ samples showed significant separation from both MSS and MSI-H tumors (Fig. 2A, p-value < 0.001, unmatched t-test). MSI-H samples also showed statistically significant separation (p-value <0.01, unmatched t-test), however there was substantial overlap between the mutational burden distributions of MSS and MSI-H samples. One concern was that samples sequenced at higher depth would show an increase in TMB, providing a confounding factor to analyzing tumor-only mutational burden. However, comparing the number of megabases with sufficient (>25 read depth required by our variant filtering) for each sample to TMB showed no significant correlation (Fig. 2B, r = 0.015).

Classification of MSS, MSI-H, and POLɛ samples based on supervised approaches

The mutational signatures for MSS, MSI, and POLɛ tumors from COSMIC (Fig. 3A-C), and the germline signature derived from dbSNP variants were compared to tumor-only TCGA samples. TCGA samples with clinical information on MSI and POLɛ status were grouped to make corresponding MSS, MSI-H, and POLɛ mutational signatures from the somatic variants passing through the filtering criteria (Fig. 3D-F). Increased frequency of T to C (green) changes suggests additional germline variants remain after filtering, influencing the signatures of MSS and MSI-H more than POLɛ.

Regression of MSS and MSI-H samples against the germline signature showed no significant difference between the two groups (Fig. 3G). To test if POLɛ and MSI-H samples could be differentiated from MSS cases, we regressed samples’ signatures against MSI-H and POLɛ COSMIC signatures. MSI-H samples had higher coefficients than MSS samples against the MSI-H signature, although the overlap in the coefficient distribution is high (Fig. 3H). POLɛ samples showed a stronger separation of coefficient values from non-POLɛ samples, likely due to the POLɛ signature being less disrupted by contaminating germline variants and distinct from either MSS or MSI-H signatures (Fig. 3I). Next, we tested the performance of the MSI-H and POLɛ signatures derived from the tumor-only TCGA samples against COSMIC signatures. Training and testing sets were randomly selected 10 times and used to calculate the recall and specificity. The POLɛ signature had strong recall and specificity, while MSI-H was moderate, consistant with our previous results (Fig. 4).

To quantify the per signature bias during regression to RNA samples, we ran 8000 normal GTEx samples through the pipeline. We found that background bias for each signature was relatively distinct and consistent (S. Fig. 1). Germline showed the highest coefficients, consistent with the idea that a majority of the GTEx variants should be germline, familial variants. The distributions of background coefficients can be subsequently used to derive statistics on how far samples are outliers and are likely to be a particular subtype.

Identification of mutational signatures using an unsupervised approach

To expand the capabilities of the pipeline to potentially identify rare tumor subtypes we added the mutationalPatterns R package from bioconductor, a previously described, unsupervised approach¹³. Mutational signatures identified by mutationalPatterns looked comparable to the cosmic signatures used in the supervised approach (S. Fig. 2). Identified signature 3 corresponds to MSS samples, cosmic signature 5; identified signature 1 corresponds to MSI-H, cosmic signature 6; and identified signature 5 corresponds to POLɛ, cosmic signature 10. A majority of samples were explained by the five major signatures identified, and most showing strong correlation to only one signature (S. Fig. 3).

Validation using colorectal TCGA samples

We next wanted to test the performance with signatures generated from the uterine data and applied to a similar dataset of colorectal (CRC) samples from TCGA. The mutational frequencies for each trinucleotide change of MSS, MSI-H, and POLɛ uterine samples were averaged to generate a de-novo signature from the tumor-only, mRNA sequencing data. Generated signatures were then regressed against the mutational frequencies for 521 colorectal tumors, of which 208, had MSS, MSI-H, and POLɛ subtype annotated. The regression coefficients for samples of each subtype were elevated for comparisons against the corresponding de-novo signatures (Fig. 5A-C). The de-novo signatures were able to distinguish between MSS and MSI-H more effectively, and POLɛ just as effectively, compared to regression against the COSMIC signatures (Fig. 5D-F). Taken together, tumor subtype is able to be effectively called when using signatures accounting for the method specific biases encountered from using the RNA of samples lacking a paired normal.

The difficulty in reducing the ratio between germline and somatic variants for the purpose of determining tumor subtype from mutational signature initially resulted in MSS and MSI-H subtypes being nearly indistinguishable. Filtering of variants using quality scores, databases, and allelic frequency greatly reduced the number of germline variants, but there was still significant germline contamination in the mutational signatures of subtypes with low tumor mutation burden. Additionally, the similarity between MSS and MSI-H signatures further convoluted distinguishing between the two subtypes. The increased mutation burden of POLɛ tumors, resulting in a higher ratio of somatic to contaminating germline, and the mutational signature being distinct from germline, MSS, or MSI-H signatures combined to allow for POLɛ status to be called from the mutational signature.

Further improving the method by using uterine samples to generate mutational signatures for MSS, MSI-H, and POLɛ subtypes increased the ability to distinguish between MSS and MSI-H tumors in colorectal samples, while maintaining the performance of POLɛ tumors. The improvement coming from using the generated signatures may be the result of intrinsically modeling in germline contamination and biases specific to mRNA sequencing. Future directions could include clustering MSS or MSI-H signatures and regressing against sub-signatures within the subtypes, which may become increasingly important as the number of tumor types analyzed increases.

The ability to call MSS, MSI-H, and POLɛ status from tumor-only samples would simplify the testing process. Often paired normal samples are not taken with the tumor, or are normal tissue adjacent to the tumor and may confound results by contain some percentage of tumor cells or lack the amount of sample to generate quality data. Although we acknowledge the tradeoff of being unable to fully remove germline variant contamination from the tumor mutational burden and signatures, we believe being able to classify tumors from one standalone sample, and the cost and logistical benefits that accompany that, are worth continued development.

Another novel advantage our approach has is the use of RNA sequencing data to identify the variants used to calculate mutational burden and generate the signature. The concept of tumor mutational burden is that more mutations in the DNA will lead to abnormal proteins being expressed on the cell surface, which would allow the patient’s immune system to distinguish the tumor as non-self. Variants from RNA sequencing are only those that are expressed from genes, giving a more biologically relevant calculation of tumor mutational burden. Additionally, RNA sequencing gives information on gene expression, alternative splicing, and gene fusions that is not present from DNA based methods. Altogether, our approach provides another facet of information from an already useful technique.

Correct identification of tumor subtype as MSS, MSI-H, or POLɛ can lead to important patient treatment decisions, improving the quality of care offered. As medicine moves towards a more individualized approach, a higher focus is placed on characterizing the tumor and personalizing the treatment. POLɛ tumors are such a case, as previously mentioned, a higher tumor mutational burden leads to tumor cells presenting abnormal proteins. These tumors often compensate by over expressing PD-L1 to disguise the tumor from the immune system. Subsequently, these tumors are good candidates for combination treatment with PD-L1 inhibitors, allowing the immune system to target the tumor in addition to other treatments. Developing easier and more accessible methods to properly classify tumors are therefore important to the advancement of patient care.

Mutational burden and signatures of POLɛ are easily identifiable from RNA-seq, tumor only samples. We can call POLɛ samples using mutational burden and/or mutational signatures using linear regression or unassisted learning.

Aim of study

Determine the efficacy of classifying MSI-H and POLɛ status from mutational signatures derived from tumor-only, RNA-seq data.

Materials

All raw UCEC and COAD RNA-seq data was acquired from The Cancer Genome Atlas (TCGA https://portal.gdc.cancer.gov/).

Calling variants from RNA-seq TCGA data

Bam files downloaded from TCGA had variants called using RVboost¹³. RVboost provides a Q-score as a metric for confidence in the metric using the subsequent filtering steps.

Filtering to enrich for somatic variants

Filtering to enrich for somatic variants was performed in three steps. Step 1 was to reduce the number of sequencing errors and required variants to have a Q-score >0.05 and at least 25 supporting reads for the alternative allele. Step 2 removed common population variants by only accepting variants that were not present in the dbSNP database. Step 3 was to remove variants that had allele frequencies close to perfectly heterozygous (0.5) or homozygous (1). Any variants with an allele frequency between 0.45 and 0.55, or 0.95 and 1 were removed.

Mutational burden

Mutational burden was calculated as the number of variants remaining after filtering divided by the number of sequenced mega bases with at least 50 read depth. Read depth was calculated using GATK DepthOfCoverage to find the number of bases at 50+ read depth.

Regression of mutational signatures

Regression of the 96 vector mutational signatures to the COSMIC signatures (https://cancer.sanger.ac.uk/cosmic/signatures_v2) was performed in R.

Unsupervised signatures

Unsupervised signatures were determined using the pre-existing R package mutationalPatterns (https://bioconductor.org/packages/release/bioc/html/MutationalPatterns.html).

Validation

Validation was performed by taking random samplings of UCEC or COAD classifications and calculating recall and specificity. ROC plots were generated from classifications of the full dataset. Significance between two groups of data was determined by t-test.

The Cancer Genome Atlas (TCGA), Uterine Corpus Endometrial Carcinoma (UCEC), Colon Adenocarcinoma (COAD), Tumor Mutational Burden (TMB), Catalogue Of Somatic Mutations In Cancer (COSMIC), Mismatch Repair (MMR), Microsatellite Instability High (MSI-H), DNA Polymerase Epsilon (POL)

Ethics approval and consent to participate

The Cancer Genome Atlas is focused on the protection of privacy for all humans participating in the project. The Ethics, Law and Policy group of TCGA established consent guidelines for the effective and fair use of TCGA data. Access is restricted to research use and all data is deidentified to protect patient identities.

Consent for publication

Not applicable

Availability of data and materials

The tool and an example dataset of 100 annotated vcfs are available at the following Github repository: https://github.com/Jessen-Erik/RNA.TOMS

Raw RNA-seq data is available on TCGA.

Competing interests

The authors declare that they have no competing interests.

Funding

Funding was granted through the Mayo Clinic’s Center of Individualized Medicine.

The Center of Individualized Medicine had no influence over any aspect of our study.

Author’s contributions

Erik Jessen created the filtering pipeline, validated the results, and wrote the manuscript.

Yuanhang Liu performed all unsupervised methods and generated supporting figures.

Jamie Davila, Wang Chen, and Jean-Pierre Kocher provided instruction, guidance, and project design.

All authors have read and approved this manuscript.

Acknowledgements

We would like to thank the Center of Individualized Medicine for funding this study.

C Blank, I Brown., A Peterson, M Spiotto, Y Iwai, T Honjo, T Gajewski. PD-L1/B7H-1 Inhibits the Effector Phase of Tumor Rejection by T Cell Receptor (TCR) Transgenic CD8+ T Cells. Cancer Research. 2004;64:3, 1140-1145.
C Blank, T Gajewski, A Mackensen. Interaction of PD-L1 on tumor cells with PD-1 on tumor-specific T cells as a mechanism of immune evasion: implications for tumor immunotherapy. Cancer Immunology. 2005;54:4, 307-314.
A D'Incecco, M Andreozzi, V Ludovini, E Rossi, A Capodanno, L Landi, et al. PD-1 and PD-L1 expression in molecularly selected non-small-cell lung cancer patients. Br. J. Cancer.2015;112, 95–102.
D Le, J Durham, K Smith, H Wang, B Bartlett, L Aulakh, S Lu, H Kemberling, C Wilt, B Luber, F Wong, N Azad, A Rucki, D Laheru, R Donehower, A Zaheer, G Fisher, T Crocenzi, J Lee, T Greten, A Duffy, K Ciombor, A Eyring, B Lam, A Joe, S Kang, M Holdhoff, L Danilova, L Cope, C Meyer, S Zhou, R Goldberg, D Armstrong, K Bever, A Fader, J Taube, F Housseau, D Spetzler, N Xiao, D Pardoll, N Papadopoulos, K Kinzler, J Eshleman, B Vogelstein, R Anders, L Diaz. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade. Science. 2017;357:6349, 409-413.
5. F Supek, B Lehner. Differential DNA mistmatch repair underlies mutation rate variation across the human genome. Nature. 2015;521, 81-84.
Cancer Genome Atlas Research N, C Kandoth, N Schultz, A Cherniack, R Akbani, Y Liu, H Shen, A Robertson, I Pashtan, R Shen, et al. Integrated genomic characterization of endometrial carcinoma. Nature. 2013;497:7447, 67–73.
SA Kuismanen, MT Holmberg, R Salovaara, A de la Chapelle, P Peltomäki. Genetic and epigenetic modification of MLH1 accounts for a major share of microsatellite-unstable colorectal cancers. Am. J. Pathol. 2000;156, 1773–1779.
Z Stadler, F Battaglin, S Middha, J Hechtman, C Tran, A Cercek, R Yaeger, N Segal, A Varghese, D Reidy-Lagunes, N Kemeny, E Salo-Mullen, A Ashraf, M Weiser, J Garcia-Aguilar, M Robson, K Offit, M Arcila, M Berger, J Shia, D Solit, L Salta. Reliable Detection of Mismatch Repair Deficiency in Colorectal Cancers Using Mutational Load in Next-Generation Sequencing Panels. J. Clin. Oncol. 2016;34:18, 2141-2147.
N Suraweera, A Duval, M Reperant, C Vaury, D Furlan, K Leroy, R Seruca, B Iacopetta, R Hamelin. Evaluation of tumor microsatellite instability using five quasimonomorphic mononucleotide repeats and pentaplex PCR. Gastroenterology. 2002;123:6, 1804-1811.
Z Chalmers, C Connelly, D Fabrizio, L Gay, S Ali, R Ennis, A Schrock, B Campbell, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Medicine. 2017;9:34.
P Jia, W Pao, Z Zhao. Patterns and processes of somatic mutations in nine major cancers. BMC Medical Genomics. 2014;7:11.
C Wang, J Davila, S Baheti, A Bhagwate, X Wang, J Kocher, S Slager, A Feldman, A Novak, J Cerhan, E Thompson, Y Asmann. RVboost: RNA-seq variants prioritization using a boosting method. Bioinformatics. 2014;30:23, 3414-3416.
F Blokzijl, R Janseen, R Boxtel, E Cuppen. MutationalPatterns: comprehensive genome-wide analysis of mutational processes. Genome Medicine. 2018;10:33.

SupplementaryFigure1.jpg
Supplementary Figure 1: Background regression coefficients of COSMIC signatures in UCEC samples. Histogram of UCEC regression coefficients against (A) a germline signature derived from dbSNP variants, (B) natural aging signature (COSMIC signature 5), (C) MSI-H signature (COSMIC signature 6), and (D) POLɛ signature (COSMIC signature 10).
SupplementaryFigure2.jpg
Supplementary Figure 2: Unsupervised mutational signatures. The mutational frequencies of all five unsupervised signatures identified from UCEC samples and output by mutationalPatterns R package.
SupplementaryFigure3.jpg
Supplementary Figure 3: Sample correlations to unsupervised identified signatures. Sample mutational frequencies were correlated to the top five signatures output by the unsupervised method. The samples are color coded (y-axis bars) based on the clinical annotation of the tumor in regares to MSI and POLɛ status.
SupplementaryFigure4.jpg
Supplementary Figure 4: RNA editing and true somatic events in tumor-only cohort. Histogram of the frequency of RNA editing events in the filtered pool of variants from tumor-only RNA samples (A). Scatterplot comparing the total number of somatic variants in a sample calculated from DNA tumor-normal somatic variant calling to the percentage of true somatic variants in the filtered pool of variants from tumor-only RNA samples (B).

Download PDF

Journal Publication

published 01 Mar, 2021

Read the published version in BMC Medical Genomics →

Editorial decision: Major revision
13 Oct, 2020
Review #1 received at journal
02 Oct, 2020
Reviewers invited by journal
28 Sep, 2020
Reviewer #1 agreed at journal
28 Sep, 2020
Editor assigned by journal
17 Sep, 2020
Submission checks completed at journal
16 Sep, 2020
Editor invited by journal
16 Sep, 2020

You are reading this older preprint version

Read the latest preprint version →

Determining mutational burden and signature using RNA-seq from tumor-only samples

Status:

Journal Publication

Version 2

Abstract

Figures

Background

Results

Discussion

Conclusions

Methods

List Of Abbreviations

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 2