SAVANA algorithm
SAVANA consists of the following steps:
SAVANA uses aligned reads from tumour and normal BAM files to identify breakpoints. Specifically, SAVANA splits the genome into non-overlapping bins of 100 million bp by default, and the alignments from each bin are examined in parallel. Thus, the analysis of large chromosomes is distributed across multiple processes and small chromosomes are analysed in a single process. Primary and supplementary alignments with a mapping quality over the minimum threshold of 5 are considered for further analysis. To ensure that all potential germline SVs are identified and used to discount germline SVs as somatic, a more lenient mapping quality filter is applied to the normal reads. Specifically, the minimum mapping quality threshold is lowered by a fraction of 0.5 for alignments originating from the normal sample.
For each sequencing read, SAVANA parses supplementary alignments (SA) from the SA tag and creates breakpoint objects storing the location, orientation, phasing, and read name information. If the single breakend option is enabled, supplementary alignments with low mapping quality (default < 5) are considered unmapped and their sequence and location are recorded as potential single breakends.
SAVANA parses the CIGAR string for both primary and supplementary alignments, tracking deletion and insertion operations greater than the minimum SV length (30bp by default), and inserted bases in the case of insertion operations. Large deletions can be interspersed with short regions of mapped sequence. To account for this phenomenon, adjacent deletions are merged and reported as a single putative event when interspersed by mapped sequences shorter than 30bp. Soft-clip operations in the CIGAR string greater than 1000bp (by default) without a corresponding supplementary alignment are also tracked at this stage, as well as the soft-clipped bases. To more accurately remove germline SVs from the tumour SV set, the minimum SV length to track an insertion or deletion is reduced by a fraction of 0.20 for the normal BAM file. This step prevents reporting as somatic those germline SVs falling below the minimum length cutoff used to analyse the tumour reads.
In addition, while iterating through aligned reads to identify putative breakpoints, a coverage array is built to track the count of overlapping primary and supplementary alignments, separated by their haplotype. For each haplotype (including none for unphased reads), an array of zeros is created for each contig, with an element for each bin across the genome. The value of the array position corresponding to the bin overlapped by an alignment is then incremented by 1 for the corresponding haplotype and contig array. This array is later used to annotate the depth of sequencing at breakpoints by accessing the number of overlapping reads from each haplotype, as well as the sum of these. The bin size is defaulted to five bases, which can be increased by the user to reduce memory usage and run time.
Putative SVs are recorded using breakend notation following the variant call format (VCF) specifications. In addition, SVs are annotated using the notation scheme established by the Pan-Cancer Analysis of Whole Genomes (PCAWG) project1: + - for deletion-like SVs, - + for duplication-like SVs, + + and - - for inversion-like SVs, and INS for insertions. Putative deletions derived from the CIGAR string are reported as ‘+-’ to ensure that supplementary and gapped alignments supporting the same deletion event are grouped together.
Once information about individual putative breakpoints is collected, adjacent breakpoints are clustered together if they fall within a clustering window. By default, this clustering window is 10bp for all breakpoint types except insertions. We use a clustering window of 250bp for insertions because we frequently observed that insertions typically have more variance in their mapping location than other breakpoint types. Reads supporting each breakpoint are split into clusters based on the contig of mate breakpoints. A final clustering step is performed on the reads supporting each mate breakpoint using a larger clustering window (default of 50bp) to prevent supporting reads from being incorrectly split into multiple events. Insertion events are clustered by their insert size to ensure that grouped reads support insertions of comparable size. Finally, grouped events are split by their respective breakpoint notation, and their supporting information and location in the genome are stored in a dictionary. Single breakends are reported if they are the only type of event present at a given breakpoint cluster. This is required because the detection of single breakends adjacent to other types of events might indicate that some of the sequencing reads in the cluster were long enough to permit complete reconstruction of the underlying SV, whereas other reads only allowed reliable mapping of one side of the new adjacency, thus resulting in single breakends. For example, an insertion of a repetitive sequence might be fully spanned by some sequencing reads, thus allowing complete reconstruction of the event. However, other reads might span only a small fraction of the inserted sequence, thus hampering unambiguous mapping of the inserted sequence, which would be detected as a single breakend.
- Filtering of false positives using machine learning
Breakpoints are encoded using 38 features, including the average MAPQ of supporting reads, the mean, standard deviation, and median SV length, the standard deviation of the start and end mapping locations of supporting reads, etc. (see Supplementary Table 2 for a description of all breakpoint features computed). Sequencing depth in the tumour and normal sample before, at, and after each breakpoint is also computed by summing the coverage array from the first base of the contig to the breakpoint start coordinate as described above. If the input BAMs provided are phased, the haplotype of supporting reads and their phase sets are also reported in the VCF output by SAVANA.
To differentiate clusters of noise or sequencing artefacts from true somatic or germline events, we trained a Random Forest (RF) classifier using Scikit-Learn50. To this aim, we annotated all breakpoints detected by SAVANA in the long read data sets as true somatic SVs if they were also detected as somatic SVs in matched Illumina data or as noise otherwise. To establish a high-confidence set of SVs for model training, we excluded breakpoints which (1) were not found in Illumina but were however supported by at least six tumour reads in the ONT data; (2) mapped within 500bp of a repetitive element reported by RepeatMasker51; (3) mapped within a microsatellite; (4) mapped to an ENCODE blacklist region31; (5) mapped to regions with high read depth in the tumour sample (> 200 reads); and (6) intersected with the location of four or more putative insertions from across the cohort with at least five supporting reads in either a tumour or normal sample.
Features encoding breakpoints (Supplementary Table 2) are read in from raw VCF files and transformed into a Pandas dataframe52. Breakpoints are then split into training and test sets with a 4:1 ratio. We used default parameter values to train the RF models except for the number of trees and the maximum depth. The optimal values for these parameters were determined using cross validation and grid search using a range of 100 to 1000 for the number of trees and 10 to 25 for the maximum depth. To detect somatic SVs on each tumour, we trained an RF model using the labelled SVs from all other tumours to ensure that no data from the tumour to be analysed was used during model training.
Once the RF model is trained, the raw SAVANA VCF file is labelled with an additional CLASS variable in the INFO column that denotes the model prediction. The predictions of the model are used to set the FILTER column to ‘PASS’ if the breakpoint is predicted to be somatic or ‘LIKELY_NOISE’ otherwise. The model predictions are vetted by additional thresholds on the minimum allele fraction and number of supporting reads, and their filter column is updated to “LOW_SUPPORT” or “LOW_AF” if they do not meet minimum thresholds of 0.01 for allele fraction and 3 supporting reads by default. Conversely, breakpoints classified as false by the model with at least 7 tumour supporting reads, a minimum tumour allele fraction of 0.10, a mean read mapping quality over 15 and no clusters of breakpoints of any type mapping to the same location in the normal sample, are rescued and their FILTER column updated to ‘PASS’.
Mondrian Conformal Predictors were implemented as previously described33,34. In brief, the training data was randomly split into two sets: 70% of the datapoints were used to train an RF model, which was applied on the remainder. The fraction of trees voting each class was used as the nonconformity measure. We generated one list of nonconformity measures for each class, which were then used to compute P values using a predefined confidence level. The P values were then used to assign breakpoints the one of the following categories: somatic, likely noise, Both or Null.
Detection of somatic copy number aberrations
SAVANA provides functionalities for the detection of total and allele-specific copy number aberrations and the estimation of tumour purity and ploidy.
First, SAVANA generates non-overlapping bins of 10kbp incorporating SAVANA breakpoints (if provided) across the reference genome. Each bin is annotated with the percentage of non-N bases and its overlap with blacklisted regions when provided. Next, SAVANA counts the number of tumour and normal sequencing reads mapping to each bin across the genome, excluding bins with more than 5% overlap with blacklisted regions and bins encompassing more than 75% non-N bases in the reference genome. Secondary and supplementary read alignments, as well as reads with a mapping quality < 5, are not considered for this analysis. Bins that do not have any read counts are filtered and removed. Binned tumour read counts are then normalised using normal counts from the matched normal sample while accounting for differences in read depth between tumour and matched normal BAM files. Subsequently, binned read counts are log2 transformed providing relative log2 copy number ratio values (log2R). Next, single point outliers in the log2R read counts are smoothened, as previously described53, to reduce noise prior to segmentation of the log2R data. Subsequently, log2R read counts are segmented using circular binary segmentation (CBS), an algorithm for finding changepoints in sequential data54, as follows. First, copy number changepoints are identified using 1,000 permutations setting the threshold for statistical significance to 0.05. Changepoints are then validated using 1,000 permutations and a threshold for statistical significance of 0.01. To handle potential oversegmentation, resulting genomic segments are subsequently merged if their log2R difference is smaller than the 20th percentile of the distribution defined by the log2R difference between all pairs of segments.
To infer tumour purity, SAVANA computes the B-allele frequency (BAF) for all germline heterozygous SNPs in the genome. Next, SAVANA searches for haplotype blocks showing evidence of LOH. To this end, SAVANA computes the bimodality coefficient of the BAF distribution for each haplotype block larger than 500kbp and with at least 10 heterozygous SNPs. Haplotype blocks with read depths larger than twice the mean read depth are excluded, as potentially amplified or gained regions would bias the BAF distribution. Subsequently, the median BAF values are computed for the top ten haplotype blocks with the highest bimodality coefficient value for each of the parental alleles (BAFA and BAFB), and tumour purity is estimated using the following equation:
$$\:Purit{y}_{tumour}\:=\frac{1}{10}\:\:\sum\:_{i=1}^{10}\frac{(1\:-\:2\:*\:(1-median\:BA{F}_{A\_i}\left)\right)\:+\:(1\:-\:2\:*\:(median\:BAF{\:}_{B\_i}\left)\right)}{2}\:\:$$
Once the tumour purity (Puritytumour) is estimated, SAVANA infers tumour ploidy using grid search, as previously described55, encompassing a purity search space of {0.01𝑘 ∣ 𝑘 ∈ [Puritytumour − 0.05 .. Puritytumour + 0.05]} range, and ploidy search space of {0.01𝑘 ∣ 𝑘 ∈ [1.5 .. 5]} range. Each purity-ploidy combination is subsequently assessed for its goodness-of-fit by computing the root mean squared deviation (RMSD) or mean absolute deviation (MAD) weighted by the segment size between the inferred absolute copy number and the nearest integer values. In addition, for each purity-ploidy combination, SAVANA evaluates whether the proportion of copy number states fitted to zero is < 0.1, the proportion of copy number segments that are considered close to the next nearest integer is > 0.5, and the copy number change step size between the two most frequent copy number states is < 2. Purity-ploidy combinations which do not pass these criteria are discarded. The purity-ploidy combination with the best goodness-of-fit (i.e. lowest RMSD or MAD value) is finally used to compute the absolute copy number for each segment using the observed log2R data as follows55:
$$\:Absolut{e}_{copy\:number\:}=\:ploidy\:+\:{(2}^{lo2gR}\:-1\left)\:*\right(ploidy\:+\:\left(\frac{2}{Purit{y}_{tumour}}\right)\:-2)$$
Finally, the optional tumour purity and ploidy values estimated in the previous step are used to compute the allele specific copy number profile for the tumour as previously described56.
Selection of human sarcoma samples and matched germline controls
Fresh-frozen bone and soft-tissue sarcoma samples were obtained from patients consented and enrolled in both the Genomics England 100,000 Genomes Project (G100k) as well as the Royal National Orthopaedic Hospital (RNOH) Biobank, satellite of the UCL/UCLH Biobank for Health and Disease (REC reference 20/YH/0088). Patients did not receive financial compensation for donating samples. Surplus tumour tissue from resection and/or biopsy samples were collected and frozen as part of routine clinical practice. Matched blood samples were used for germline sequencing. Processing of tissue samples for pathology review and molecular analyses was performed at the RNOH for the 100,000 Genomes Project, founded by England's National Health Service in 2012, as previously described57. Fresh-frozen tissue sections [haematoxylin and eosin (H&E); 5 µm] were used to guide the selection of the most viable areas for each tumour specimen in terms of lack of necrosis and tumour cellularity. DNA was extracted from matched tumour and blood samples using established protocols and in accordance with the 100,000 Genomes Project guidelines. DNA was sent for centralised library preparation and sequencing at the Illumina Laboratory Services (ILS) in Cambridge, UK58. Sequencing was performed as part of the 100K Genomes Project. The DNA from tumour and normal samples was sequenced to an average depth of 116 (median 118) and 42 (median 36), respectively. Dual consented somatic and germline genomic data were shared with RNOH and EMBL-EBI. Data linked to local clinical data (no NHS Digital or NHS England data) were shared by Genomics England with RNOH and EMBL-EBI. No analysis was undertaken in the Genomics England Research Environment or National Genomic Research Library.
Selection of glioblastoma tumour samples and matched germline controls
Glioblastoma and matched blood samples were collected in the Neurosurgery Department at Centro Hospitalar Universitário Lisboa Norte (CHULN) and stored less than 1 h after surgery at Biobanco-iMM CAML (Lisbon Academic Medical Center, Lisbon, Portugal). Ethical approval was obtained from the Ethics Committee of CHULN (Ref. Nº 367/18). Written informed consent was obtained from all patients prior to study participation in accordance with the European and National Ethical Regulation (law 12/2005). Patients did not receive financial compensation for donating samples.
DNA extraction and processing for nanopore sequencing
Tumour genomic DNA was extracted from tissue with the Nanobind tissue kit (PacBio SKU 102-302-100) and Germline DNA from blood with the Nanobind CBB kit (PacBio, SKU 102-301-900). After extraction, the genomic DNA (gDNA) was homogenised by 3–10 passes of needle shearing (26G) and 1h of incubation at 50°C. All samples were quantified on a Qubit fluorometer (Invitrogen, Q33226) with the Qubit BR dsDNA assay (Thermo Fisher Scientific, Q32853) and manual volume checks were performed. The DNA size distributions were assessed at each relevant step by capillary pulse-field electrophoresis with the FemtoPulse system (Agilent, M5330AA and FP-1002-0275). 4–10 µg of gDNA were either fragmented to a size of 10-50kb with the Megaruptor 3 (Diagenode, E07010001 and E07010003) or fragmented to a size of 15-25kb with a gTUBE (Covaris, 520079) centrifuged at 1,500rcf. All samples were depleted of short DNA fragments (less than 10kb long) with the SRE kit (Circulomics, SS-100-101-01, now PacBio) or the SRE XS kit (Circulomics SS-100-121-01, now PacBio) when sample availability or concentration was limiting. To generate matched Illumina and nanopore tumour WGS data, DNA aliquots were tumour matched. Cases for which multi-region nanopore and Illumina tumour sequencing was performed, DNA aliquots were tumour and region matched.
Library preparation and nanopore whole-genome sequencing
Sequencing libraries were generated from 600ng up to 1.8µg of gDNA with 1 library prepared for germline samples and 2 libraries prepared for tumour samples with the SQK-LSK110 or SQK-LSK114kits (Oxford Nanopore Technology, ONT), according to manufacturer’s recommendations with minor modifications. Briefly, the samples were end-repaired by adding 2 µl NEBNext FFPE DNA Repair Mix (NEB, M6630) and 3 µl NEBNext Ultra II End Prep Enzyme Mix (NEB, E7546), incubated for 10 min at room temperature followed by 10 min at 65°C, then cleaned up with 1 × AMPure XP beads (Beckman Coulter, A63880) and eluted in 60 µl of Elution Buffer. The end-repaired DNA was ligated with 5 µl Adapter Mix (ONT, SQK-LSK110) using 8 µl NEBNext Quick T4 DNA ligase (NEB, E6056) at 21°C for up to 1h. The adapter-ligated DNA was cleaned up by adding a 0.4 × volume of AMPure XP beads. Sequencing libraries were quantified using the average peak size of the samples determined after sample preparation on a Femto Pulse system. 20fmol of the obtained sequencing libraries were loaded onto R9.4 flow cell (ONT), or 10fmol of libraries were loaded onto R10 flow cells and sequenced on a PromethION48 (ONT). The libraries were stored overnight in the fridge. After 24h, sequencing runs were paused and a DNAse treatment or nuclease flush (ONT, WSH-003) was performed and 20fmol of the libraries were re-loaded on the flow cells.
Nanopore sequencing data analysis
Base-calling was performed using the high accuracy model of Guppy-4.0.11 and a qscore filter of 7. Sequencing reads were aligned to GRCh38 using either minimap2 (version 2.24)(68) with parameters “-ax map-ont-MD”. QC statistics and plots were generated using cramino59 version 0.14.5.
Analysis of artefactual fold-back-like inversions
Fold-back-like inversion artefacts were detected by the presence of reads characterised by having two nearly identical alignments in forward and reverse orientations. As a result, the first alignment starts where the second alignment ends, and both alignments are of equal length (Supplementary Fig. 2). To estimate the number of fold-back-like inversion artefacts per sample, we analysed all aligned reads per sample and classified them as a fold-back-like inversion artefact when the following criteria were met: (1) the read had exactly one primary and one supplementary alignment, both with a minimum mapping quality of 20; (2) the primary and supplementary alignments overlapped but were mapped in opposite orientations; and (3) the distance between the alignment positions corresponding to the start and end of the read were less than 150bp apart from each other. To calculate the rate of fold-back-like inversion artefacts per sample, we determined the total number of reads by counting those reads with a primary alignment and a minimum mapping quality of 20. Samples with more than 5 million artefactual fold-back-like inversions were not considered for further analysis. The code used for this analysis can be accessed at https://github.com/cortes-ciriano-lab/ont_fb-inv_artefacts.
Detection of SVs using existing algorithms
Sniffles2 (version 2.2.0), cuteSV (version 2.1.0), SVIM (version 1.4.2), Severus (version 1.0), and NanomonSV (version 0.5.0) were used for benchmarking. Germline callers (Sniffles2, cuteSV, and SVIM) were run separately on tumour and normal BAM files, all with a minimum SV length of 32bp. Sniffles2 was run with `--output-rnames`. cuteSV was run with recommended parameters: `--max_cluster_bias_INS 100`, `--diff_ratio_merging_INS 0.3`, `--max_cluster_bias_DEL 100`, and `--diff_ratio_merging_DEL 0.3`. SVIM was run in `alignment` mode as pre-aligned BAM files were used. To subtract germline calls from somatic, coordinates of germline SVs for each tool were extracted from their resulting germline VCF. The start and end coordinates were extended by 1000bp and saved to a BED file. A bedtools60 subtraction of this file from the tumour VCF was performed, requiring a fraction of at least 0.01 of the tumour SV to be overlapped by the extended germline SV coordinates for it to be removed. Only tumour SVs with three or more supporting reads were considered. However, all germline SVs were considered. NanomonSV was run with the recommended, `--use_racon` flag, `--single_bnd`, as well as a control panel provided by the tool developers to reduce false positives (available at: https://zenodo.org/records/7017953). Severus was run using haplotyped BAM files, as recommended by the tool. In all cases, SVs mapping to alternative contigs, chromosome M, unplaced contigs, and the Epstein-Barr virus were removed. SVs detected in Illumina and ONT data were considered to support the same event if the underlying breakpoints mapped within 100bp of each other.
Benchmarking SV detection algorithms
The breakpoint coordinates for the SVs in the COLO829 SV truth set from60 were lifted over to GRCh38. The SV truth set was downloaded from zenodo.org/records/4716169#.YL4yTJozYUE (truthset_somaticSVs_COLO829_hg38lifted.vcf), which was filtered to contain only breakpoints that were detected by the authors in ONT WGS data. The VCF file was curated to report insertions in a single entry.
SAVANA includes a module, termed evaluate, to compare VCF files. Specifically, the evaluate module compares breakends and reports them as matched when they are within an overlap buffer (100bp by default). The VCF files generated by the germline SV detection algorithms benchmarked were compared to the COLO829 somatic SV truth set VCF. Specifically, we assessed whether breakends in the truth set mapped within 100bp of the breakends reported by the algorithms benchmarked. These results were then used to compute the precision, recall and F-measure for each algorithm. The evaluate module in SAVANA accounts for both breakend format VCFs, where each breakend of an SV has a line in the VCF, and for VCF formats where both breakends are reported in the same line. We used the same criteria to benchmark the performance of SV detection methods using replicates.
Assembly of derivative chromosomes at single-haplotype resolution using SAVANA SVs
For each phase set defined by Whatshap61, SV supporting reads identified by SAVANA were used as input to generate an assembly using wtdbg262 and subsequently polished using Racon63.
Processing of short-read whole-genome sequencing data
Sequencing reads were mapped to the GRCh38 build of the human reference genome using BWA-MEM64 version 0.7.17-r1188. Aligned reads were processed following the Genome Analysis Toolkit (GATK, version 4.1.8.0) Best Practices workflow to remove duplicates and recalibrate base quality scores65. NGSCheckMate was utilised with default options to verify that sequencing data from tumour-normal pairs were properly matched66. Somatic SVs were detected using GRIDSS66 (v2.12.0, https://github.com/PapenfussLab/gridss), which we ran using default parameter values. B-allele frequency (BAF) information for heterozygous SNP sites was collected using AMBER (v3.5) and read depth ratios for the reference and alternate alleles were computed using COBALT (v1.11). BAF and read depth ratios for heterozygous SNPs were used as input to PURPLE67 (v2.54) to estimate the purity, ploidy, and somatic copy number aberrations of the tumour samples. Microsatellite instability was assessed using PURPLE. AMBER, COBALT, and PURPLE are developed by the Hartwig Medical Foundation and are freely available on GitHub: https://github.com/hartwigmedical/hmftools. The implementation of the WGS data analysis pipeline used in this study is available at: https://github.com/cortes-ciriano-lab/osteosarcoma_evolution.
Statistical analysis and visualisation
All statistical analyses were performed using R version 4.1.3. The level of significance for all statistical analyses was set at 0.05. No statistical method was used to predetermine sample size. Rearrangement and copy number profiles were visualised using the R package ReConPlot68 v0.1. Structural variants were read into R data frames for visualisation purposes using the R package StructuralVariantAnnotation69.