Generating synthetic FASTQ files.
To benchmark the performance of variant callers at calling low frequency variants in ctDNA datasets, we used a set of synthetic datasets with 303 variants from the Catalogue of Somatic Mutations in Cancer (COSMIC v94) spiked in at 6 different low allele frequencies. The COSMIC variants were randomly selected from a set of variants mapping to target regions in a template cfDNA BAM file.
FASTQ files used to generate this template cfDNA BAM file were acquired from the Sequence Read Archive (SRA) with the accession SRR10296599. The sample was from a cfDNA targeted sequencing of a healthy Han Chinese female. The authors used the Roche ctDNA panel kit targeting 17 cancer associated genes [20]. The sequencing library included UMI adapters, however, these were not made available in the data submitted to the SRA. FASTQ files from this sample were mapped to the GRCh38 human reference using BWA-MEM v0.7.17 [21] with default parameters. The output sequence alignment map (SAM) file was converted to the BAM file format using Samtools v1.2 [22].
Two modified copies of this BAM were produced – the first containing 303 COSMIC SNP variants (supplementary material) at 100% allele frequency, and the second containing reference alleles at the same loci at 100% frequency. These modified BAM files were input for a custom Python script used to generate paired-end reads, and synthetic UMI FASTQ files with reads containing the 303 COSMIC variants at allele frequencies of approximately 0.005, 0.01, 0.02, 0.04, 0.05, and 0.075.
Reads from the reference and alternate BAM files were combined to make up desired allele frequencies. Reads with the same start and stop positions for forward and reverse reads were assumed to be PCR duplicates. Reads within PCR duplicate sets were assigned to UMI families by random assignment by sampling from a Poisson distribution where the lambda parameter was equal to 1/mean family size - mean family size was set to 2. UMI sequences were created by generating a random string of letters (A,T,C or G) for the length of the UMI sequence, which was set to 9. The PHRED qualities were set to 37 for all UMI bases. The synthetic data generation process is illustrated in Figure 1A.
Synthetic 200x, Synthetic 450x and Synthetic 850x datasets at each allele frequency were generated in replicates of 5, at approximately 200x, 450x and 850x depths at the site of spiked in variants. The Synthetic 200x, dataset excluded ∼0.005 and ∼0.01 allele frequency BAM files due to insufficient read depths meaning that variants at these frequencies often had no reads supporting them.
Description of real metastatic Breast Cancer samples.
The real publicly available ctDNA samples originated from a 2021 study [23] utilising samples collected from the COMET clinical trial (Clinical Trials Identifier - NCT01745757). The study included collection of ctDNA from 198 metastatic breast cancer patients at two timepoints – prior to, and four weeks post-treatment. Library construction used a custom panel targeting 54 genes. The sequencing library also included UMI adapters which were not available for download from the SRA. Synthetic UMIs were therefore artificially generated and added. We randomly selected 8 pre-treatment samples from this dataset for benchmarking (SRR15081468, SRR15081470, SRR15081472, SRR15081477, SRR15081480, SRR15081482, SRR15081494, SRR15081493). FASTQ files were downloaded from the SRA using corresponding accession numbers.
Pre-processing data.
To generate BAM files in both synthetic and real data, reads were aligned to GRCh38 with BWA-MEM at default parameters. Output SAM files were converted to BAM file format and indexed using Samtools (v1.2).
For datasets without UMIs, read groups were added to output BAM files using the GATK AddOrReplaceReadGroups (v2.27.1) tool, and the GATK MarkDuplicates (v2.27.1) tool was used to mark and discard duplicate reads. The output BAM files were indexed and used as input for variant callers.
In UMI encoded data, fgbio v1.3.0 [18] was used to annotate BAM files with UMI sequences and to collapse UMI families into consensus reads. Read groups were added to output BAM files as previously described.
Figure 1B illustrates how synthetic data was processed with and without UMI sequences. Input files for UMI-aware variant callers were either BAM files with UMIs stored in the RX tag, or FASTQ files with UMI sequences prepended to read sequences. A custom Python script was used to prepend UMI sequences to FASTQ files. Figure 1 C & D illustrate how real data was processed with and without UMIs respectively.
Variant calling tools and parameters.
The variant callers assessed in this study were bcftools v1.16 [24], FreeBayes v1.3.6 [25], LoFreq v2.1.5 [26], Mutect2 v4.2.6.1 [27], UMIErrorCorrect v0.24 [28] and UMI-VarCal v2.5.0 [29]. While not particularly designed to call low frequency calls, we include bcftools for benchmarking here as one of the most widely used variant callers [30]. Conversely, the authors of LoFreq designed the tool for calling low frequency variants. Several published benchmarking studies have shown Mutect2 and FreeBayes have high sensitivity at calling low frequency variants meriting their inclusion in this study. There is currently only a handful of variant callers that natively support UMI encoded data. UMI-VarCal [29] and UMIErrorCorrect [28] are publicly available, and output the widely used Variant Call Format (VCF) file type to report variants and were therefore included for benchmarking.
All callers were ran using default parameters and only variants reported in VCF files were evaluated. UMIErrorCorrect requires a Browser Extensible Data (BED) file which was generated using bedtools v2.30 [31]. Both UMI-VarCal and UMIErrorCorrect require UMI sequences and were therefore excluded from benchmarking on non-UMI datasets.
Variant filtering and annotation pipelines.
We discarded variants with read depths and quality scores of less than 50. Mutect2 does not report quality scores for called variants, therefore this metric was excluded from variant filtering in Mutect2 VCF files. Next, we used snpSIFT (v4.3t) [32] and bcftools to annotate and discard variants reported in dbSNP v138 at >=5% in all populations. No variant filtering was applied to UMI-aware caller VCF files due to incompatibility with with variant filtering tools. To assess performance of callers on real datasets, snpSIFT (v4.3t) was used to annotate VCF files with COSMIC variants. Figure 2 shows an overview of the variant filtering pipeline
Evaluating variant calls and data visualisation.
We assessed the sensitivity of variant callers on the synthetic datasets using the GATK Concordance tool. The truth set was a VCF file of the 303 spiked in COSMIC variants. For the real mBC data, snpSIFT (v4.3t) was used to annotate COSMIC presence in output VCF files. As the synthetic data originated from a real cfDNA sample, it is difficult to determine if a candidate false postive variant is indeed a false positive, or a real variant in the template data. Therefore, to infer specificity in synthetic datasets, we calculated the total number of variants detected, minus true positive calls. These variants are refered to as putative false positive variants. In the real dataset, the concordance of variants called by each caller across all samples was used to gain an insight into false positive discovery rate.The R package UpSetR [33] was used for visualising intersections of called variants.