We aimed at testing to what extent WGS of FF- and FFPE-derived material results in the identification of the same somatic variants (SNVs). Hereto, we used one FF and one FFPE sample of the same metastatic prostate tumor. A matching blood sample, taken from the same patient, was used as a reference to identify germline mutations. We opted for a metastatic sample as this is in general more homogenous than primary samples, reducing differences in variant calls due to intra-tumor and sampling variation.
Quality checking showed that in the matching FF and FFPE samples a comparable number of variants was called (Suppl. Table 1).. No particular bias towards specific mutational patterns was observed, indicating the absence of prominent artifacts in the FFPE sample (Suppl. Table 1).. Both FF and FFPE samples showed a similar average coverage (Suppl. Table 2).. To ensure that comparing the somatic variants called from the FFPE and matching FF samples would be independent of the specificities of the variant caller, we used four somatic variant callers, i.e. Strelka2, Mutect2, VarScan2 and Shimmer.
Comparison between FF and FFPE sample for each caller
All variant callers except VarScan2 were run under default settings (see Materials and Methods).. In general, more calls were reported in the FFPE than in the FF sample, except for VarScan2 (Suppl. Fig. 1).. Table 1 shows the overlap between somatic calls in the matching FF and FFPE samples in terms of sensitivity and precision using the calls from the same variant caller on the FF sample as gold standard. Overall, variant callers have an average sensitivity and precision of respectively 50.97% and 52.41%, meaning that 50.97% of the variants from the FF sample are also detected in the FFPE sample and 52.41% of the FFPE somatic variants are detected in the FF sample. We also report the F1-score which is the harmonic mean of sensitivity and precision. Among the four variant callers considered in this study, Strelka2 achieved both the highest sensitivity and precision.
To get an idea of how good the sensitivity and precision from Table 1 are, we compared the calls from different variant callers on the same sample. In general, the difference between variant callers on the same sample is larger than the difference between the FF and the FFPE sample (Suppl. Fig. 1 and 2).. This implies that the choice of variant caller is at least as important as whether or not the sample was FF or FFPE. Overall, the overlap between variant callers is low, especially for the FFPE sample (Suppl. Fig. 2).. Shimmer agrees the least with the other variant callers, whereas Strelka2 and Mutect2 tend to mutually agree in both the FF and FFPE samples (Suppl. Table 3 and 4)..
While the choice of variant caller turns out to play a pivotal role in the calls that are obtained, the basic analysis above does not account for the properties of the called variants. Indeed, we will demonstrate that the variants considered in Table 1 represent both true variants and artifacts of the variant caller, consistently made in the FF and the FFPE sample. To assess the relevance of the calls made by each of the callers in the overlap between the FF and the FFPE sample, we first assessed to what extent each of the callers tends to reconstruct the same clonal subpopulations in both the FF and the FFPE sample and secondly whether high confidence calls are consistent between both sample types.
Comparison of the clonal subpopulation
Given that the FF and the FFPE sample should have a similar subclonal structure, as they derive from the same metastatic tumor, we judged the relevance of the variants called by either method by assessing whether they corresponded in the FF and the FFPE to the same clonal subpopulations. To visualize the different subpopulations in each sample, we plotted for all detected variants thecoverage as a function of the VAF and the corresponding histogram representing the distribution of the VAFs (15). Fig. 1 shows the results for Strelka2 in both the FF and the FFPE sample. Similar figures for the three other variant callers can be found in the Supplementary Materials. Note that the VAF estimate of the major distributions slightly differs between the different tools because of small discrepancies in their definition of the VAF (Suppl. Fig. 3)..
According to Strelka2 (Fig. 1),, two distinct clusters of variants can be observed in the FF sample, denoted by the blue bars in the histogram: a small cluster at a VAF of 0.05 and a larger cluster around a VAF of 0.25. In the FFPE sample, only one cluster can be observed around a VAF of 0.15. The calls that are common to the FF and the FFPE sample are shown in orange, demonstrating that the peak at a VAF of 0.25 in FF does indeed shift to 0.15 in FFPE. Almost all variants from the largest cluster in the FF sample, i.e. at a VAF of 0.25, are identified in the FFPE sample but most of the FF calls belonging to the smaller peak at 0.05 are not detectable in the FFPE sample.
We observed similar results for Mutect2 with the peak representing the major cluster at VAF 0.25 in the FF sample being shifted to a lower VAF in the FFPE sample (Suppl. Fig. 4).. For VarScan2, only a small proportion of the variants detected at an average VAF of 0.25 in the FF is recovered in the FFPE sample (Suppl. Fig 5).. Shimmer could barely detect the cluster of variants at VAF 0.25 in the FF and completely misses the major cluster in the FFPE sample (Suppl. Fig. 6).. A deeper analysis shows that Shimmer reports many somatic variants in the tumor sample that have a VAF above zero in the normal sample (putative germline calls). This behavior is not expected and also not observed for any of the other variant callers, explaining the poor overlap between Shimmer and the other variant callers (Suppl. Fig 6, Suppl. Table 3 and 4).. Only keeping for Shimmer the variants with zero VAF in the normal sample filtered those putative germline calls and allows to better recover a cluster of variants at VAF 0.25 in the FF, but still not in the FFPE sample (Suppl. Fig. 7)..
In addition to the cluster at VAF 0.25, VarScan2 also reports a peak of variants at an average VAF of 0.5 in the FF sample. These are likely germline variants as more than 65% of variants with VAF above 0.5 are present in dbSNP database (Suppl. Table 5).. Except for Mutect2, which has a similar number of germline calls above and below a VAF of 0.5, other callers reported up to 8 times more calls from dbSNP at a VAF above 0.5. This suggests that calls reported with VAF above 0.5 are more likely to be false positives. Indeed, most of the calls with VAF above 0.5 detected consistently in both the FF and the FFPE sample tend to be residual germline calls.
Hence, the cluster of variants detected at VAF 0.25 likely represents the major subpopulation in the metastatic sample. All variant callers except Shimmer could at least partially recover this subpopulation in the FFPE sample albeit at lower VAF. This shows that the lower effective VAF in the FFPE sample is independent of the variant caller and thus an intrinsic property of the FFPE sample. As a result, calls common to the FF and the FFPE sample were typically reported with lower VAF in the FFPE sample (Suppl. Fig. 8).. Imposing a zero VAF criterion in the normal sample helped Shimmer to recover the major cluster of variants that was also recovered by the other callers, but only in the FF sample. In addition, only a minority of calls (813) are retained of which most (604) were also reported by other callers (Suppl. Fig. 12).. This indicates that many of the calls uniquely made by Shimmer without imposing the zero VAF criteria and reported in Table 1 are likely spurious calls, despite being consistently detected in both the FF and the FFPE sample. For the remainder of the analysis, variants reported by Shimmer with a positive VAF in the normal sample were filtered.
Assessing the significance levels in FFPE
To assess the relevance of the calls in the overlap between the FF and the FFPE sample, we also compared for each caller the distributions of the significance scores of the somatic variant calls detected in the FFPE sample that were also called in the FF sample, hereby assuming that the FF sample is less prone to artifacts and constitutes the reference as to what should be detected. Table 2 shows that calls reported in both samples typically have lower significance scores in the FFPE than in the FF sample, which can be explained by the discrepancies in VAF observed in Fig. 1.
In addition, the boxplots in Fig. 2 show that the FFPE calls that were also made in the FF sample, received a relatively higher significance score than FFPE calls not made in the FF sample. This shows that the most reliable variants in the FFPE sample generally correspond to those detected in the FF sample.
For each caller, we calculated the correlation between the significance levels for common calls and compared it in FF versus FFPE. Suppl. Table 6 shows that the significance levels of Strelka2 and Mutect2 are significantly correlated in both FF and FFPE samples. In addition, the significance levels of between the callers themselves seems to be consistent between the FF and the FFPE sample. For Shimmer and VarScan2, this consistency between both samples cannot be observed. Furthermore, the ranking of these callers is consistent in the FF, but not in the FFPE sample, again indicating that these variant callers are not performing well in the FFPE sample. A possible explanation for this could be that the underlying hypergeometric testing procedure cannot cope with the lower VAFs present in this sample.
Relation between the subclonal structure and the significance level
To investigate the relation between the major subpopulation and the significance of the calls, we map for each variant caller the 25% highest confidence calls on the Coverage versus VAF plots. The upper panel of Fig. 3 shows how for Strelka2 these most significant calls are located around a VAF of 0.25 in the FF and 0.15 in the FFPE sample, and hence make up the aforementioned major subpopulation that was detected in both the FF and the FFPE sample.
For Mutect2, the majority of significant calls also belonged to this cluster of variants (Suppl. Fig. 9).. For VarScan2 and Shimmer (Suppl. Fig. 10 to 12),, a similar effect could be observed, although not as pronounced as in Strelka2 and Mutect2. Subsequently, many of the highly significant calls belong to the major subpopulation at a VAF of 0.25 in FF and 0.15 in FFPE, such that the analysis of the significance levels and the clonal subpopulations points at the existence of a highly confident subset of variants, present in both the FF and the FFPE sample.
In addition, the lower panel in Fig. 3 shows that when comparing the variants obtained from different callers, many variants that are called by other callers are also located in the variant cluster (see also the lower panels of Suppl. Fig. 9–12).. This is in line with the observation that for each variant caller (except Shimmer), calls made by any of the other three variant callers in general received a higher significance especially in the FF sample (Suppl. Fig. 13 and 14).. Using two criteria, based on clonal subpopulation (Fig. 1) and significance score of the variants (Fig. 2),, we could see that calls common to the FF and the FFPE sample tend to belong to the major subpopulation and are highly significant. Importantly, the lower panel of Fig. 3 and Suppl. Fig. 13 and 14 show that variants that are called by more than one caller tend to satisfy these two criteria as well. In the next section, we investigated how well these calls, reported by more than one variant caller, overlap with the calls common to the FF and the FFPE sample.
Sensitive and precise variant calling on FF, FFPE threshold optimization
Previous analysis also showed that at least some of the calls that were made consistently between the FF and the FFPE sample by the same caller tend to be spurious or at least non-somatic. To assure that only biologically relevant calls are considered, we first identify a highly reliable subset of calls in the FF sample. This subset, referred to as the ground truth, can then be used to more qualitatively assess how well each caller can recover these highly reliable calls in the FFPE sample. From our analysis above, an ideal ground truth would consist of all highly reliable calls made in the reference FF sample that also represent the major subpopulation in the metastatic sample. We have shown that calls made by at least two callers tend to satisfy these criteria. Therefore, the ground truth was defined as the union of all calls that were detected by at least two callers in the FF sample.
Our previous analysis also shows that the subpopulation represented by the ground truth is also present in the FFPE sample albeit at lower effective coverage. Hence, fully recovering the subpopulation of highly significant variants from the FFPE sample will require the identification of a significance threshold in the FFPE sample. Because calls in the FF sample were reported with a higher significance score, the threshold in the FFPE will typically be lower than the threshold that would be necessary in the FF sample to capture the ground truth (Suppl. Fig. 13 and 14).. In addition, previous analysis shows that it is feasible to set a threshold in the FFPE sample as calls common to the FF and the FFPE sample tend to have a higher significance level in the FFPE sample (see Fig. 2).. However, the lower effective coverage in the FFPE sample complicates identifying a threshold that distinguishes the true calls from the noise (the significance distribution of noisy and true calls starts overlapping (see Suppl. Fig. 14).. We determined for each variant caller a threshold in the FFPE sample that optimized the F1-score with the FF-derived ground truth (i.e. that optimizes the tradeoff between precision and sensitivity in recovering the ground truth).
Table 3 shows for the different variant callers their sensitivity and precision in recovering this ground truth. This table shows that Strelka2 and Mutect2 are performing the best in recovering from the FFPE sample the calls that belong to the ground truth. Table 3 now quantitatively illustrates how Shimmer underperforms on the ground truth despite calling consistently the same mutations in the FF and the FFPE sample, which was shown in Table 1. This indicates, as was shown above, that most of the calls made by Shimmer in the overlap between the FF and the FFPE sample are likely spurious. The same is to some extent true for VarScan2 because of the high number of residual germline calls.
Table 3 gives an estimate of the expected overlap between the FF and the FFPE sample after optimizing the thresholds in the FFPE sample using the ground truth based on the variants detected in the FF sample. However, often only FFPE samples are available, such that the stringency thresholds cannot be optimized based on observations in the FF sample. Therefore, rather than optimizing the threshold for each caller separately, we assessed to what extent combining the output of different variant callers in the FPPE sample allows recovering the ground truth from the FF sample.
Table 4 shows how taking the intersection of the four variant callers maximizes the precision but comes at the expense of losing almost all sensitivity. As discussed above, Shimmer can barely retrieve the cluster representing the major subpopulation in the FFPE sample (Suppl. Fig. 7) and most of the calls retrieved by Shimmer in the FFPE sample are unique (even after correcting for the so-called somatic calls with high VAF in the normal sample). Because the intersection seems too strict and limits the sensitivity, we considered calls reported by at least three callers. It results in a high precision but still relatively low sensitivity in recovering the ground truth (Table 4) because VarScan2 reports less calls in the FFPE than in the FF sample and loses many true positive calls. Using the calls returned by at least two of the four callers in the FFPE sample drastically increases sensitivity while only slightly decreasing precision. The performance here is even better than the performance of the best caller (Strelka2) that was obtained after optimizing its stringency thresholds based on the ground truth. This indicates that there is definitely some complementarity in the calls made by different callers. Given the lower performances of VarScan2 and Shimmer in recovering the ground truth, one can wonder what their added value is when taking the union of all variants that were detected by at least two callers. To assess the added value of Shimmer, we compared the performances obtained when retaining the ‘union of all variants obtained by at least two callers’ where Shimmer was respectively included and discarded from the ensemble of used callers (Table 5).. Omitting Shimmer from the ensemble resulted in a slightly higher F1-score while only 4 variants belonging to the ground truth could no longer be detected.
Along the same lines, we assessed the added value of VarScan2 by comparing performances of two ensemble consisting of Strelka2 and Mutect2 respectively with and without VarScan2 using again the criterion being called by at least two callers (Table 5).. Compared to the situation without Shimmer, omitting VarScan2 resulted in a slight increase in precision but came at the expense of a slight decrease in sensitivity, 70 variants could no longer be detected by excluding VarScan2 from the ensemble strategy. Hence, including Varscan2 allows detecting more variants common to FF and FFPE samples, but comes with a few additional false positives.
These results show that the FFPE sample can recapitulate approximatively 80% of the highly reliable calls detected in the FF sample, while maintaining a precision of 0.87. Taking calls reported by at least two callers allows effectively distinguishing the true from the spurious calls in the FFPE sample, removing the residual germline variants obtained by VarScan2 and the spurious calls made by Shimmer. It gives an overlap with the ground truth that outcompetes the best variant caller in our hand (Strelka2, with threshold optimized on the ground truth) in terms of both precision and recall.
The relatively low fraction of variants that are unique to the FFPE sample (539/4219) shows that in our sample FFPE artifacts do not heavily bias the calls. The fraction of calls unique to the FF sample corresponds to the less significant calls in that sample (low VAF) and hence might, because of the lower effective coverage in the FFPE sample, no longer detectable (false negatives). Small discrepancies between the FF and the FFPE sample can of course also be ascribed to sampling and tumor heterogeneity (14).