Annotations of Recurrent Structural Variant Events in Pan-Cancer Whole Genome Data for Precision Medicine


 In personalized cancer genomic medicine, characterizing a patient’s molecular profile based on comprehensive information is important for maximizing treatment benefits. However, current cancer genome analysis is centered on single nucleotide variation (SNV), gene expression, and copy number variation (CNV) but places little emphasis on structural variations (SV) beside fusions. To date, investigation of SVs has been limited because SV analysis entails a cumbersome annotation process. This study describes the design, development, and implementation of an annotation tool for SV, termed SVAnnotator. Detailed annotation was performed on the results of SV detection of 2,781 whole genome samples from the ICGC/TCGA PanCancer Analysis of Whole Genomes (PCAWG) with identifications of fusion, exon skipping, gene disruption, and tandem duplication SVs. These annotations of SVs will facilitate understanding of molecular events and further enhance utilities of precision medicine in stratifications, pathogenicity assessments and drug responses. Frequent novel SV events in MACROD2, FHIT, WWOX and CCSER1 were observed across many cancers. Importantly, SV events were frequently identified in well-established tumor suppressor genes including RB1, NF1, PTEN and TP53. As such, it is plausible that potential therapeutic opportunities are overlooked when SV analysis is not appropriately performed. Given the frequency of SVs detected in our study, SVanalysis with detailed annotation should be a routine part of comprehensive precision medicine analysis, and further studies are warranted to enhance clinical benefits as well as our understanding of uncharacterized SV events.

SV events are quite common in cancer due to the inherent chromosomal instability associated with the disease [3][4][5][6][7][8] . In some cases, such as chromothripsis [9][10][11] , chromoplexy 12 , and other complex SVs 13 , a tumor may present with multiple SVs acquired as a single event resulting in complex genome rearrangements and genome instability.
In current cancer genomic analysis, copy number variations (CNV) and fusions are typically captured with traditional analytic methodologies; however, subtler events such as gene disruptions, exon skipping, and ITD are often neglected due to the laboriousness of current annotation methodologies for detected SVs. For instance, a typical CNV analysis will likely fail to find partial loss of a gene (defined as disruption in this study) and is also more likely to miss exon skipping and ITD events. It is known, however, that genes which are not prone to point mutations can be altered in SV-induced gene disruption and exon skipping. Common fragile sites (CFS) are an example of this phenomenon 14 .
Besides fusions, the current knowledge of SV events in cancer has been primarily limited to EGFRvIII in glioblastoma multiforme (GBM), MET exon 14 skipping in lung adenocarcinoma, AR-V3 and AR-V7 in prostate cancer, and FLT3 ITD in acute myeloid leukemia.
No comprehensive list of recurrent SV events in cancer currently exists, partly due to the complexity of cancer genomes and variety of signals required for the comprehensive detection of SVs [15][16][17] . The generation of billions of short reads from whole genome sequencing presents challenges to develop algorithms and workflows to identify chromosomal break points 18 ; conversely, long read sequencing 4 technologies have been developed, but their increased cost and high error rate is not conducive to accurate detection of SV events 19,20 . Recent technological advances in sequencing and algorithmic detection of SVs, however, provide new opportunities for assessing the role of SVs in cancer 18,[21][22][23][24] .
The purpose of this study was to design, develop, and implement a novel SV annotation tool, SVAnnotator, to discover recurrent SV events using International Cancer Genome Consortium (ICGC)/The Cancer Genome Atlas (TCGA) PanCancer Analysis of Whole Genomes (PCAWG) data in multiple cancer types. Identification of clinically targetable SV events will extend the current spectrum of actionable alterations beyond SNVs, CNVs, and fusions and will provide additional targeted treatment options at the point of care.

Results
Analysis of de-identified 2,781 samples of 31 cancer types from the PCAWG project with the novel SV annotation tool, SVAnnotator, annotated total 200,772 SV events and classified into seven SV types shown in Table 1. The most frequent SVs detected were gene disruptions, followed by dysfunctional fusions, fusions, exon skipping, ITDs, deletions, and deletion-insertion. Detailed results of SV annotations for each gene are provided in Table S1. Table 2 lists the top recurrent fusion events identified by SVAnnotator with their frequency by cancer type. We compared the frequency of our most commonly observed fusion event, TMPRSS2-ERG in prostate cancer, with the frequency of the same fusion event in the COSMIC database 25 . We identified 124 TMPRSS2-ERG fusions in 248 ICGC samples (44%) compared with a 38% frequency (2303/6145) in the COSMIC database. Additionally, we identified KIAA1549-BRAF fusion in 64% (57/89) of pilocytic astrocytoma cases which is consistent with a previously published study where this fusion was detected in 75% (24/32) of cases analyzed 26 . The complete list of fusions obtained from our analysis can be found in Table S2.  Table S4.
Dysfunctional fusions do not result in cohesive mRNAs and thus do not engender stable protein products; however, what's often overlooked is that these fusions lead to the loss of at least one copy of the genes involved leading to loss of heterozygosity. In this study we uncovered two novel dysfunctional fusions in ATAD1-PTEN (12) and PTEN-RNLS (13), rendering PTEN haploinsufficient. Table 3 lists the most frequent SVs identified for each cancer type. The disease-specific events were selected using chi-square analysis with a threshold of p-value < 1.0e-25 (Table S5). Notable findings included 14 GBM cases (35%) with exon skipping events leading to constitutively active EGFRvIII and other oncogenic isoforms; 35 (11%) liver cancer and 15 (31%) lung squamous cell carcinoma cases with LRP1B exon skipping; 24 colorectal cancer cases (40%) with RBFOX1 exon skipping; over 20% of bone osteosarcoma cases with DLG2, CSAMP or CNTNAP2 exon skipping; both 5 prime and 3 prime 6 TENM4 fusions in melanoma; and 70 cases (80%) of esophageal adenocarcinoma with FHIT exon skipping located in the FRA3B locus.

Discussion
While advances in NGS technology have focused on SNVs, CNVs, and fusions as primary oncogenic drivers, structural variants have emerged as an under-appreciated yet integral component of the cancer genome. Limitations in analysis methodologies and availability of whole genome sequencing data have precluded the development of a comprehensive database of SVs in cancer. However, in this work, SVAnnotator was able to identify SV events spanning the entire genome. These included SVs in clinically actionable cancer-associated genes, currently non-targetable cancer-associated genes, and genes not yet associated with cancer but representing potentially novel targets for future investigation.
As expected, many of these top recurrent events were identified at CFS, including FRA3B, FRA2F, and FRA16D loci. Examples from our study include FHIT (localized within FRA3B) which positively regulates antigen presentation 28 ; thus, loss of FHIT might negatively affect immune checkpoint inhibitor response 29 . On the contrary, loss of LRP1B (FRA2F) might be associated with positive immune checkpoint inhibitor response 30 . Recently, MACROD2 (FRA20B) is reported to be related to chromosomal instability (CIN) in colorectal cancer 31 , and is mutually exclusive with microsatellite instability (MSI) which is known to respond to immune checkpoint inhibitors 32 . WWOX (FRA16D) has been reported to be responsible for metastasis in triple negative breast cancer 33 . Notably, genes like PTPRD, EYS, TMPRSS2, and ERG do not belong to the currently known CFS. The aforementioned four CFS loci are associated with particular cancer types; however, SV events in PTPRD and EYS are more ubiquitous across many cancer types. Also, our work reiterates the importance of whole genome sequencing to capture these non-CFS loci events.
Detection of SVs in well-established cancer genes underscores the fact that clinically actionable alterations may be missed using traditional genomic screening approaches. Most notable in this class of findings was the detection of SV events in NF1, RB1, CDKN2A, SMAD4, TP53, ARID1A/B and PTEN.
NF1 is a known tumor suppressor gene which suppresses the MAPK pathway, and tandem duplication events of NF1 were specifically observed in ovarian cancer in our data set. NF1 loss of function mutations are detected in only 2.5% of the cases in the TCGA ovarian cancer dataset 34 . However, we detected NF1 tandem duplication in 13% (15/112) of ovarian adenocarcinoma patients. As NF1 inactivation affects the mTOR pathway, disruption of NF1 through SV events represents an additional therapeutic opportunity to use inhibitors targeting the mTOR pathway. RB1 repeats were observed in many cancer types including breast, ovarian, pancreatic, and osteosarcoma. This study identified 4% Our study also identified SV events in many genes whose association to cancer is emerging. As such, these findings may not currently affect clinical decisions but present paths for future investigation. In summary, the SVAnnotator tool was able to identify various genes that are frequently subjected to SV events in cancer patients. These findings included SV events in well-established cancer genes that are clinically actionable as well as genes with limited association to cancer. Further work is required to elucidate the role of these SVs in tumorigenesis and progression. In conclusion, this study suggests that identification of SVs in cancer can provide additional clinically pertinent insights that may be overlooked by conventional genomic analysis workflows and as such should be integrated as a routine practice of comprehensive precision medicine analysis.

Samples
All samples used in this study were derived from the Pan-Cancer Analysis of Whole Genomes (PCAWG) study subset of the ICGC final consensus genome data (2,781 total samples across 31 cancer types).
We used the results of somatic structural variant detection analysis performed in the PCAWG project.
For 67.3% of samples, tumors were freshly frozen solid samples whereas controls were blood samples.
In contrast, for 20.2% of samples, controls were derived from tumor-adjacent normal tissues or other non-blood tissues, particularly in hematological cancers 49 . 93.8% of the tumor samples came from treatment-free primary cancers, but 6.2% originated from donors with multiple samples of primary, metastatic, and/or recurrent tumors. Due to large contributions from European, North American, and Australian genomic projects, the continental ancestry distribution of the sample was heavily weighted towards Europeans (77%) and East Asians (16%) 49,50 .

Sequencing
Sequencing was performed by each data provider of the PCAWG project and Illumina whole-genome paired-end sequencing, which reads lengths of 100-150 bp from tumor and normal samples was used.
The mean whole genome sequencing coverage in this set was 30 reads per bp for control samples, whereas tumors had a bimodal coverage distribution with maximums at 38 and 60 reads per bp 49,50 .
The BWA-MEM v0.7.17 algorithm 51 was used to align each tumor and normal sample to Human Genome Reference Build GRCh37 52 , version hs37d5.

Structural variant calling
In order to generate a consistent call set to be used for cross-tumor type analysis, all samples were analyzed using a uniform set of algorithms for alignment, variant calling, and quality control. The SV call set was generated by one of the working groups of the PCAWG study 50,53,54 . The call set was generated by merging the SV calls from independent calling pipelines, and the merged SV calls were further required to have a consistent copy number change. The data of analysis results are publicity available at ICGC Data Portal, https://dcc.icgc.org/releases/PCAWG.

Structural variant annotation; SVAnnotator
We developed an annotation tool named SVAnnotator to automate the cumbersome task of  Figure S3. SVAnnotator identifies loss of an acceptor or a donor side of an exon as exon skipping. However, these losses can be indicative of non-exon skipping events 57 ; it is strongly encouraged that users check actual transcripts to confirm.

Identifying tumor-type-specific SV events
To identify SV events that are specifically enriched in some tumor types, a Chi-Square goodness-of-fit test was employed for each SV event that was characterized by an affected gene name and an SV type.
The test evaluates how a set of observed counts of a specific SV event across multiple tumor types differs from that of expected ones, which are proportional to the sample sizes of the tumor types.
We restricted our analysis to tumor types with greater than 20 samples and focused on the following  (48), and Uterus.AdenoCA (49), where the number in parentheses after each tumor type is the sample size, and the total number of samples is 2,655. We removed 126 samples belonging to 11 other cancer types of smaller cohorts from the analysis to avoid statistical skewing for cancer type specific event extraction.
For each SV event that was characterized by an affected gene name and an SV type, the Chi-Square goodness-of-fit test was applied to the observed counts of the SV across the 27 tumor types. The total number of distinct SV events considered was 99,498. The result, i.e., p-values and the observed counts of SV events, are shown in Table S5.

Software Availability
SVAnnotator is available from https://github.ibm.com/ComputationalGenomics/ Legends (Tables and Figures)   Table 1. Number of events by SV type in total and distinct gene and event combinations. SV = structural variant, ITD = internal tandem duplication, DEL = deletion, DELINS = deletion-insertion Table 2. Top recurrent fusion events and their frequency in each cancer type.