A robust benchmark for evaluating and improving mosaic variant calling strategies

16 The rapid advances in sequencing and analysis technologies have enabled the accurate detection 17 of diverse forms of genomic variants, including germline, somatic, and mosaic mutations. However, 18 unlike for the former two mutations, the best practices for mosaic variant calling still remain chaotic 19 due to the technical and conceptual difficulties faced in evaluation. Here, we present our benchmark 20 of nine feasible strategies for mosaic variant detection based on a systematically designed reference 21 standard that mimics mosaic samples, with 390,153 control positive and 35,208,888 negative single- 22 nucleotide variants and insertion–deletion mutations. We identified the condition-dependent strengths 23 and weaknesses of the current strategies, instead of a single winner, regarding variant allele 24 frequencies, variant sharing, and the usage of control samples. Moreover, feature-level investigation 25 directs the way for immediate to prolonged improvements in mosaic variant calling. Our results will 26 guide researchers in selecting suitable calling algorithms and suggest future strategies for developers. 27

Post-zygotic mutations continuously occur along the zygote-to-adult trajectory, resulting in 28 genetic mosaicism. Recently, the capabilities to directly detect mosaic mutations at the genome level 29 have led to a series of discoveries, including the mutational processes and landscapes involved in 30 human development 1,2 and aging 3,4 , the causes of neurological disorders 5 , and cancer predispositions 6 .

31
As more research questions are being answered, there is growing attention on the complete 32 investigation of mosaicism, warranted by the accurate detection of mosaic mutations.

33
The detection of mosaic mutations is an intricate process not only technically but also

59
Benchmark setting 60 We constructed a new reference standard for evaluation. The detailed information about the 61 procedure, quality control, and the final specifications is described in a separate data description 62 paper 18 . Briefly, the reference standard material is a collection of 39 mixtures of six pre-genotyped 63 normal cell lines, each one harboring 7,566 to 11,606 known single-nucleotide variants (SNVs) and

68
The uniqueness of the reference standard lies in the internal hierarchical structure that mimics the 69 mutation acquisition process under cellular differentiation, such as during the early embryonic 70 development (Fig. 1a). The 39 mixtures belong to one of the three mixture categories (M1-M3) that 71 symbolize three distinct descendants, each one harboring a common and lineage-specific set of

79
We selected nine mosaic detection strategies for evaluation. The inclusion criteria were: (1) 80 algorithms that explicitly aim to detect mosaic mutations 10,19 , (2) procedures that have been used 81 previously to discover mosaic variants 8,20 , and (3) algorithms that can be applied for mosaic mutation 82 detection via simple modifications 9 (Methods). As not all strategies are primarily developed to target 83 mosaic mutations, some results do not represent the intrinsic performance of the baseline algorithm.

84
The nine strategies were classified into four major categories based on their baseline algorithms: 85 mosaic, somatic, germline, and ensemble (Fig. 1b). The mosaic category includes three algorithms  analysis is shown. The dashed line denotes VAF 10%, the data on the plane and axis refer to the shared (blue mutations by altering the filter usage (Fig. 1b left). Likewise, germline callers can be used to detect 108 low-to medium-level mosaic mutations modifying the ploidy assumptions 8 (Fig. 1b right). Although 109 many callers can be applied to these categories, Mutect2 (MT2) 21  The detection of mosaic variants in a single sample is analogous to conventional somatic variant 116 calling without a matched control. The aim of this task is to sort out true mosaic variants from wild-117 type and germline sites, based on the sequence alignment and VAF. As wild-type sites are generally 118 presented in low VAFs derived from sequencing artifacts (< 1%-5%) and the VAFs at germline sites 119 follow binomial-like distributions centered at 50%, most approaches primarily target the variants with 120 intermediate VAFs (Fig. 2a). Moreover, inherent uncertainty exists in distinguishing high-VAF 121 mosaic variants from germline variants, which requires external information, such as the population-

125
In a high-depth (1,100×) setting, MF and MT2 showed the best F1-score in detecting mosaic 126 SNVs, with robust sensitivity and precision in a wide VAF range (4%-20%); MT2 showed higher 127 sensitivity and lower precision than MF (Fig. 2b). We found that most of the performance gain in MF 128 and MT2 was achieved at low VAFs (<10%), benefiting from the strength of somatic variant calling;

129
MF takes the raw calls of MT2 as input. In the high-VAF area (≥20%), other approaches showed their 130 own strengths; for example, HC showed high sensitivity, and mosaic callers (MH, MF, and DM) 131 showed high precision. For most approaches, increases in sequencing depth resulted in an 132 improvement in sensitivity and precision, especially at low VAFs (<10%) (Fig. 2c and 133 Supplementary Fig. 1a). The only exception was DM, which showed the best performance at 250×, 134 at which the algorithm was trained 11 . This implies that image-based deep-learning approaches can be 135 further improved by diversifying the models for various read depths. For INDELs, the best 136 performing approaches differed by VAF range: MF was the best approach at very-low (<5%) and 137 high (≥15%) VAFs, whereas HC20 was the best at medium VAFs (5%-15%) (Fig. 2d). Similar to

157
Further assessment of the consistency of the call sets towards read depth showed an unexpected 158 behavior (Fig. 2f, Supplementary Fig. 1c (Fig. 2g, Supplementary Fig. 1d and 1e). In this case, 165 however, the low consistency may indicate a possibility of further improvement by referring to the 166 true and false calls from other algorithms, or by composing ensemble approaches.

Evaluation of paired sample-based calling 168
Mosaic variants generally affect multiple parts of an individual; thus, sampling more than a single 169 site is a better option. In a sample pair, mosaic variants can exist either in one or both samples, 170 comprising a non-shared or shared form (Fig. 3a upper). Detection of a non-shared mosaic variant is 171 equivalent to the conventional somatic variant detection problem and can be handled using relevant 172 callers; therefore, it was not prioritized here. By contrast, shared variants are harder to call as they 173 should be distinguished from wild-type and germline variants simultaneously. VAFs comparably 174 deviating from 50% (expected in heterozygous) in both samples are useful, but inter-sample 175 imbalance can be present 1 . Nevertheless, there are few tools that directly detect shared mosaic 176 variants. Instead, the modified use of existing algorithms has been used alternatively 7 .

177
We evaluated nine and five strategies that could be applied to detect mosaic SNVs and INDELs, 178 respectively, in paired samples. These strategies are divided into two major categories: two-single and 179 paired (Fig. 3a lower). Strategies in the two-single category attempt to call mosaic variants in each 180 sample and report their intersections. We applied MH, MF, DM, MT2, HC20, and HC200 to two-181 single approaches for the SNVs, and MF, MT2, HC20, and HC200 for INDELs (we will use the 182 suffix '-ts' to represent two-single). Strategies in the paired category take both samples together to 183 call shared mosaic variants. We found that MH, MT2, and M2S2MH were applicable to SNVs, and 184 MT2 was also applicable to INDELs (the suffix '-p' will be used). Among them, M2S2MH is the only 185 strategy that directly targets shared mosaic variants, whereas MH and MT2 required further 186 modifications (Methods).

187
Our evaluations revealed even more complex relationships among algorithms, strategies, and 188 VAFs ( Fig. 3b-3c, Supplementary Fig 2a-2d for all read depths). With the lack of optimal models, 189 the paired approach did not outperform the two-single approaches. Instead, the benefit was algorithm-190 specific. In particular, MT2-p showed lower sensitivity, especially at low (<5%) and high (>25%)

191
VAFs, than MT2-ts. Conversely, MH-p showed better sensitivity than MH-ts, without an increase in 192 the number of false positives. We assume that the joint genotyping model of MH led to the better 193 utilization of the sample pairs. Regarding other callers, M2S2MH showed a robust performance by 194 reinforcing the sensitivity of MH with a read-level rescuing procedure. Germline approaches (HC20-195 ts and HC200-ts) showed high sensitivity at medium to high VAFs (>10%), but called many false 196 positives at VAFs > 25%, rendering the best performing area at 10%-25%. MF-ts showed a 197 comparable performance to that of MT2-ts, with higher precision at very-low (<5%) and high (>25%)  Fig. 2e-2f). Overall, the 220 development of effective paired sample-based models is demanded.
The algorithm and VAF specificity of the performance suggests an instant way to improve the 222 mosaic variant detection on an application level. We partitioned the answer set into 16 (= 4 × 4) VAF 223 areas: four ranges (very-low: < 5%; low: 5%-10%; medium: 10%-25%; and high: > 25%) for each 224 sample, to describe the landscape of the best performing ranges (Fig. 3e). Although MF-ts and MT2-225 ts marked the best F1-score, in general, other algorithms showed a better performance within 226 particular VAF areas, especially for INDEL detection (Supplementary Fig. 3a). Mapping these 227 "local winners" into the VAF space rendered the current best practice for integrating multiple 228 strategies (Fig. 3f). Compared with the single best performing strategy, an ensemble of the five 229 strategies increased the overall F-score from 0.89 to 0.96 and from 0.52 to 0.60 for SNVs and    We inspected the generalized properties of the informative features (Fig. 4b). The sources, scales, and 274 complexity of the 48 features were highly diverse, but could be categorized into three different levels: grouping of AUCs showed that the genotype-level features were substantially more informative than 281 the other categories (average AUC = 0.77 vs. 0.59, p = 3 × 10 -5 and 2 × 10 -5 for the sequencing-and 282 alignment-level features, respectively) (Fig. 4b). Therefore, we suggest that opportunities for 283 improvement lie in the better use of the genotype-level features.

284
Next, we evaluated another usage of such information: a filter with which the call sets are post-285 processed and refined usually by using a single threshold value. A good filter is expected to remove 286 the false calls, while leaving the true calls, and increase the overall accuracy (e.g., F1-score). We 287 tested the efficiency of the 16 independently adjustable filters used in MT2 and HC200 by disabling 288 and comparing the changes in the call sets (Fig. 4c upper). We found that most of the filters removed 289 a substantial number of false positives (0 to 233,874); however, it was also accompanied by the 290 corresponding number of lost true positives (0 to 160,967). Overall, the contribution of the filters to 291 the overall performance (F1-score) was limited (-0.002 to 0.038, mean = 0.003) (Fig. 4c lower).

292
These results imply that the naïve, single threshold-based filtration is not an effective strategy for

296
Stepping forward: additional strategies for mosaic variant calling 297 We have shown that there is no single algorithm that fits all mosaic variant types and more major 298 breakthroughs are evidently needed. Here, we propose and test two strategies that can direct further 299 advances for reinforcing the current approaches or for developing new algorithms.

300
First, we considered a call set-or feature-level recombination of multiple algorithms. We have 301 already shown that an ensemble of multiple algorithms can instantly improve the performance of 302 paired sample-based detection. Similarly, cross-reference of diverse features applied in multiple 303 algorithms would lead to a more fundamental improvement in the short-term. For example, in a single 304 sample setting, MT2 showed high sensitivity, but was accompanied by many false positives at the 305 VAF areas, mostly germline variants. We found that the MT2 call set could be efficiently improved 306 by applying the foreign feature "alt softclip" developed for MF, which removed 27.3% (228/834) of 307 false calls from wild-type sites and lost only 0.009% (27/289,124) of the true answers (Fig. 5a top).   0.36, respectively (Fig. 5b). We presumed that testing full combinations of call sets 336 and features, preferably in a VAF-specific manner, may bring instant benefits without developing a 337 new algorithm ab initio. Second, the usage of multiple samples was tested. Although no such 338 algorithms have been developed yet, considering ≥3 samples can be useful, particularly for improving 339 precision. For example, variants shared in two samples of distal lineages (e.g., ectoderm and 340 mesoderm) are likely to be present in the third sample within the lineages (e.g., another ectoderm). application of this idea was conducted on the 1,944 possible combinations of three different mosaic 343 samples (Fig. 5c). We found that 67.3% and 55.5% of the shared false positives of SNVs, and

344
INDELs could be removed while losing only a small fraction (1.6% and 3.9%) of true positives, 345 thereby further increasing the F1-score from 0.94 to 0.95 and from 0.18 to 0.29 for SNVs and 346 INDELs, respectively (Fig. 5d). Thereby, we anticipate that a generalized algorithm for using 347 multiple tissues would increase the sensitivity and precision of mosaic variant calling.

349
In recent decades, the technologies used for detecting germline and somatic mutations have been

361
Genome in a Bottle 14 ) for germline variants, and similar efforts are being made for somatic variants 16 .

362
However, deriving a robust standard reference is yet to be accomplished for these variants due to the 363 intrinsic difficulties in finalizing confident true calls. The direct engineering of the genome using gene 364 editing technologies (e.g., CRISPR-cas9) is another approach and has been used to produce 365 commercial products 26 . Nonetheless, the small number of true answers confines its usage on an 366 application level (e.g., validation of clinical panels). Generating in silico simulated datasets (e.g.,

367
mixing BAM files) is a simple but powerful method 16 . Although this can popularize the following 368 benchmark studies, we noted that the error profiles, especially the sequencing errors, are restricted in 369 the source data and cannot represent the real-world level artifacts (Supplementary Fig. 3b). Overall,

370
we believe that our approach to construct a standard reference is costly and time-consuming, but the 371 most proper way for securing both the scale and robustness.

372
Despite all efforts, this benchmark has potential pitfalls, particularly in the interpretation of the 373 analysis results, limitation in the search space, and data dependency. First, the performance of the 374 mosaic calling "strategies" should not be confused with their baseline algorithms, especially when 375 they were used in an unintended way. For example, the performance of MT2-p does not directly 376 indicate the somatic mutation calling performance of MT2, with the modified use of normal filters.

377
Likewise, MH-p originally reported variants with inconsistent genotypes in two samples (e.g.,

378
germline in one and mosaic in the other) and has been modified to call the shared variants by referring 379 to the internal genotype probability matrix. The composition of all the two-single approaches for modifications were conducted to test the potentially applicable strategies in the absence of specifically 383 developed algorithms. Moreover, the complexity of the use of parameters limits the benchmark. There 384 are at least 4 to 110 parameters whose use can be adjusted and the number of combinations of which 385 reaches up to tens of millions. Because of the intractability, evaluations have been conducted using 386 default parameters, assuming that the empirical suggestions from developers is close to optimal.

387
Finally, the composition of the standard reference can affect the results, such as the distribution of 388 VAFs in the datasets, sequencing platform, read length, and error profiles. For example, the 389 cumulative cell line mixing (Fig. 1a) produced a large set of robust mosaic variants; however, it can 390 also alter the composition of germline variant sites. We resolved the problem of the loss of pure 391 germline sites by preparing a separated set of reference standard that reserved germline sites 18 .

392
Notably, there are remaining unavoidable noises at the germline variant sites of the original cell lines.

393
Although the effect is assumed as ignorable (see Supplementary Notes), we should be aware of it 394 when evaluating tools that consider the alignment patterns of the flanking regions.

395
In summary, we present the first systematic benchmark of mosaic variant calling strategies. Our 396 analysis revealed the sequencing depth-, VAF-, and error type-specific strengths and diversity of the 397 current algorithms, feasible strategies for ensemble approaches, and directions for future 398 development. We anticipate that our study will be a good starting point for the technical advances in DeepMosaic 10,11 . Variants tagged as "mosaic" and "PASS" were only kept for downstream analysis.
used whenever it was recommended to remove repeats to filter out confounding regions 10,11,19 . For 434 paired-sample analysis, MT2-p, and MH-p were applied with simple modifications to detect the 435 shared mosaic variants. Variants of Mutect2 paired mode tagged as "normal artifacts" were selected 436 for MT2-p to be exploited as an alternative filtering strategy for shared variant detection. In MH-p, 437 paired naïve mode, if (1) the joint probability of two samples with "mosaic" variants was over 0.05 438 and (2) if it was larger than that of any other genotype combination, the variants were considered as 439 shared, whereas the remaining variants remained as sample-specific. In the M2S2MH approach,

447
Performance evaluation of single sample analysis 448 For each approach, the precision, sensitivity, and F1-score were calculated based on the call set.

461
The variant call consistency within an approach was investigated in three categories; the true positives package nVennR 27 (0.2.3) for the set analysis. We also observed the similarity of variant calls among 465 the approaches using the Jaccard index ( Fig. 2f and Supplementary Fig. 1d- Table 2. We observed that the genotype-level features 505 had significantly higher AUCs than those at the sequencing (p = 3 × 10 -5 , Wilcoxon's rank sum test) 506 and alignment levels (p = 2 × 10 -5 , Wilcoxon's rank sum test).

507
To further investigate the efficiency of the filters that could be assessed independently, 16 post-508 filters from MT2 and HC were tested. By disabling each filter, we compared the new F1-score to the