Alleloscope: Integrative single cell analysis of allele-specific copy number alterations and chromatin accessibility in cancer


 Cancer progression is driven by both somatic copy number aberrations (CNAs) and chromatin remodeling, yet little is known about the interplay between these two classes of events in shaping the clonal diversity of cancers. We present Alleloscope, a method for allele-specific copy number estimation that can be applied to single cell DNA and ATAC sequencing data, either separately or in combination. This approach allows for integrative multi-omic analysis of allele-specific copy number and chromatin accessibility on the same cell. On scDNA-seq data from gastric, colorectal, and breast cancer samples, with extensive validation using matched linked-read sequencing, Alleloscope finds pervasive occurrence of highly complex, multi-allelic copy number aberrations, where cells that carry varying allelic configurations adding to the same total copy number co-evolve within a tumor. The contributions of such allele-specific events to intratumor heterogeneity have been under-reported and under-studied due to the lack of methods for their detection. On scATAC-seq from two basal cell carcinoma samples and a gastric cancer cell line, Alleloscope detects multi-allelic copy number events and copy neutral loss-of-heterozygosity, enabling the dissection of the contributions of chromosomal instability and chromatin remodeling in tumor evolution.


Introduction 36
Cancer is a disease caused by genetic alterations and epigenetic modifications 37 which, in combination, shape the dysregulated transcriptional programming of tumor 38 cells 1, 2 . These somatic genomic events lead to a diverse cellular population from which 39 clones with advantageous alterations proliferate and eventually metastasize 3 . The 40 comprehensive study of cancer requires the integrative profiling of genetic and epigenetic 41 changes at the resolution of single cells. We combined the analysis of two such genomic 42 dimensions -DNA copy number and chromatin accessibility -through massively parallel 43 single cell sequencing assays. 44 First, consider copy number aberrations (CNAs), through which we have derived much of 45 our current understanding of the relationship between genome instability and tumor 46 evolution 4 . Total copy number profiling, which estimates the sum of the copy numbers of 47 the two homologous chromosomes, is inadequate to characterize some types of cancer 48 genomic aberrations. Such events include the pervasively occurring copy-neutral loss of 49 heterozygosity (LOH) 5-8 , intriguing "mirrored events" 9, 10 where a given tumor may have 50 cancer cells carrying amplification of one haplotype are intermingled with cancer cells 51 carrying amplification of the other haplotype, and the even more complex alterations that 52 are only detectable through allele-specific analysis 11 . While the importance of allele-53 specific copy number has been emphasized in bulk DNA sequencing analysis 5-8, 11 , most 54 single-cell CNV analysis considers only total copy number due to low per-cell coverage 12-55 5 10x linked-read sequencing which provides accurate phasing information [34][35][36] . In these 81 datasets, Alleloscope accurately identifies LOH and mirrored-subclonal amplification 82 events, and finds pervasive occurrence of highly complex, multi-allelic loci, where cells 83 that carry varying allelic configurations adding to the same total copy number co-evolve 84 within a tumor. The ubiquity of such events in all three cancer types analyzed reveal that 85 they may be an important overlooked source of intratumor genetic heterogeneity. 86 Having characterized the complexity of allele-specific CNA events at single cell resolution, 87 we turn to scATAC-seq data from two basal cell carcinoma samples with paired bulk 88 whole exome sequencing data 23 and a complex polyclonal gastric cancer cell line that we 89 analyzed by scDNA-seq. In these samples, we evaluate the accuracy of Alleloscope in 90 genotyping and clone assignment and demonstrate its application to the integrative 91 analysis of CNA and chromatin accessibility. 92

Results 93
Overview of Alleloscope allele-specific copy number estimation 94 First, we briefly overview Alleloscope's method for allele-specific copy number estimation 95 ( Figure 1). Clone assignment and integration with peak signals in scATAC-seq data will 96 be described later. Alleloscope relies on two types of data features: coverage, derived 97 from all reads that map to a given region, and allelic imbalance, derived from allele-98 informative reads that cover heterozygous loci in the region. We start with some essential 99 definitions. For a given single nucleotide polymorphism (SNP) site, we refer to its mean 100 coverage across cells as bulk coverage and its mean variant allele frequency (VAF = ratio 101 of alternative allele read count to total read count) across cells as its bulk VAF. Between 102 the two parental haplotypes, we define the term "major haplotype" as the haplotype with 103 higher mean count across cells. Note that a haplotype may be the "major haplotype" of a 104 sample, but be the haplotype with lesser copy number within some cells. For each 105 individual cell , in any given CNA region, we define two key parameters: (1) the major 106 haplotype proportion ( ! ), defined as the count of the major haplotype divided by the total 107 copy number for the region, and (2) total copy fold change ( ! ), defined as the ratio of the 108 total copy number of the region in the given cell relative to that in normal cells. 109 The genotyping algorithm starts by segmenting the genome into regions of homogeneous 110 allele-specific copy number using both the bulk coverage and bulk VAF profiles ( Fig.1  111 Step 2). This can be achieved by multiple existing algorithms, which may be combined to 112 increase detection sensitivity, see Methods for details. In our analyses of scATAC-seq 113 data, the segmentation relied on the matched scDNA-seq data or the whole-exome 114 sequencing data, which ensures that the putative CNA regions considered for genotyping 115 are not confounded by the broad chromatin remodeling that occur in cancer. 116 Now consider each putative CNA region. An expectation-maximization (EM) based 117 algorithm is used to iteratively phase each SNP and estimate the major haplotype 118 proportion ( ! ) for each cell (Fig. 1, step 3). For each SNP , let " ∈ {0,1} be the indicator 119 of whether the reference allele of SNP j is a component of the major haplotype. An initial 120 estimate . The estimates of ! and " usually converge within a few iterations as described in the 124 Methods. If matched scDNA-seq data are available for a sample sequenced by scATAC-125 seq, " values can be estimated from scDNA-seq and then used to compute ! for each 126 cell in the scATAC-seq data, enabling integration of the two data types. 127 The estimated major haplotype proportions ( 0 ! 's), along with a preliminarily normalized 128 coverage statistic ( 1 ! ), are then used to identify a set of normal cells and diploid regions 129 ( Fig. 1, Step 4). This information is used to estimate an improved relative coverage fold-130 change ( 2 ! ) for each cell within each CNA region. If cell 's true allele-specific copy 131 numbers are homogeneous within the given region, then its true value of ( ! , ! ) should 132 belong to a set of canonical points displayed in Step 5 of Figure 1. Thus, the estimated 133 values ( 2 ! , 0 ! ) are clustered across cells and associated with one of the canonical values 134 to yield the cell-level haplotype profiles for the CNV region. These cell-and region-specific 135 haplotype profiles serve as the base for clone assignment and subsequent integration 136 with peak signals in scATAC-seq data. (Fig. 4b). 137 8 Fig. 1: Overview of allele-specific copy number estimation of single cells with Alleloscope. 1. The algorithm operates on raw read count matrices for reference allele (Ref) and alternative allele (Alt) computed from single cell DNA or ATAC sequencing. 2. First, we obtain a segmentation of the genome based on sample-matched whole genome or whole exome sequencing data using FALCON 5 . If scDNAseq is available, cells can be pooled to derive a pseudo-bulk. 3. For each region derived from the segmentation, simultaneously phase SNPs ( " ! ) and estimate cell major haplotype proportion ( $ " ) by expectation maximization (EM) algorithm. Since we are focusing on only one region, the region indicator is suppressed in our notation here. In the E-step, information is pooled across cells to estimate the phasing of each SNP. In the M-step, information is pooled across all SNPs in the region are pooled to estimate the major haplotype proportion $ " for each cell. The toy example shows a scenario with two cells for a region containing 5 SNPs, with cell 2 carrying an amplification of the major haplotype (in pink). For each cell and each SNP, alleles that are observed in a sequenced read are bolded in black (we assume that only one read is observed, reflecting the sparsity of the data). The true phase ( ! ) of the SNPs and the true major haplotype proportion ( $ " ) are shown. 4. For region let { $ "# } be its estimated major haplotype proportions across cells . Pool data across regions to identify candidate normal cells and candidate normal regions for computing a normalized coverage * "# for region in cell . 5. Alleloscope assigns integer allele-specific copy numbers to each cell for each region based on the ( * "# , $ "# ) pairs.

number estimation 140
First, we explore the single cell landscape of allele-specific CNAs using scDNA-seq data. 141 To validate the phasing and genotyping accuracy of Alleloscope in scDNA-seq data, we 142 used matched linked-read whole-genome sequencing data on three gastrointestinal 143 tumor samples: P5931, P6335 and P6198. Linked-read sequencing, in which one derives 144 reads from individual high molecular weight DNA molecules, provides variants that can 145 be phased into extended haplotypes covering Mb 34-36 . As a result, one obtains accurate, 146 Mb-scale haplotype information from cancer genome. To evaluate the accuracy of 147 phasing, we compared the haplotypes estimated by Alleloscope to the haplotypes 148 obtained from linked-read WGS. Additionally, we used the WGS haplotype to evaluate 149 the allele-specific copy number estimation for each cell and to assess the impact of 150 phasing errors on genotyping accuracy (Fig. 2a). 151 Figure 2b shows the results for the gastric cancer sample from P5931, whose genome-152 wide copy number profile indicates clear CNA events on four chromosomes-chr7, chr8, 153 chr20, and chr21. For each event, the scatter plots of 3 0 ! , 2 ! 4 estimated by Alleloscope 154 and colored by haplotype profiles, are shown in Fig. 2c. Note that the 3 0 ! , 2 ! 4 clusters fall 155 almost directly on top of the expected canonical values (e.g. (1/2, 1) for diploid, (2/3, 1.5) 156 for 1 copy gain of major haplotype). Interestingly, chromosomes 7, 8, and 21 each show 157 subclonal clusters have differing allelic ratios but the same total copy number, which 158 would not be detectable without allele-specific estimation. We denote the major haplotype We compared the phasing estimated by Alleloscope ( . " ) against the whole genome 166 haplotypes. The phasing accuracy is 98% for the deleted region (chr21), ~90% for the 167 two clonal amplifications (on chr8 and chr20), and 79% for the subclonal chr7 168 amplification (shown in the titles of the scatter plots of Fig. 2c). Moreover, we evaluated 169 the genotyping accuracy for some of the somatic alterations. Figure 2d shows scatterplots 170 of 2 ! against major haplotype proportion computed using haplotypes derived from linked-171 read sequencing ( 5 ! ), with the same coloring as Figure 2c. Comparing the scatterplots in 172 from linked-read WGS. Specifically, the concordance is ~100% across all four events (the 175 concordance for each event is labeled in the scatter plots of Figure 2d). This shows that 176 the genotyping algorithm in Alleloscope is robust to errors in phasing (e.g. for chr7). 177 Similar analysis performed for P6335 is given in Supplementary Fig. 1. We also applied 178 CHISEL on P5931, yet it did not work well for this sample due to the low coverage 179 ( Supplementary Fig. 2). 180 Hierarchical clustering of cells in the P5931t sample based on allele-specific copy numbers given by Alleloscope, showing normal cells and 4 main clones, as well as a number of small clones marked by highly confident low-frequency mutations. M: Major haplotype, m: minor haplotype. (c) ( * #" , $ #" ) estimated by Alleloscope for four regions, colored by the inferred haplotype profile. Note that clusters fall on canonical points corresponding to discrete allele-specific copy number configurations. Phasing accuracy for each region is shown in the plot title. In the color legend, M and m represent the "Major haplotype" and "minor haplotype" respectively. (d) Similar to (c), with $ " estimated using known SNP phases from matched linked-reads sequencing data, colored by the haplotype profiles assigned in (c) using Alleloscope without the given phasing information. Genotyping accuracy is labeled in the plots.
Since copy neutral LOH events, common in cancer genomes, can only be identified 182 through allele-specific copy number analysis. We examined the accuracy of Alleloscope 183 specifically for copy-neutral LOH events with a colorectal adenocarcinoma from P6198. 184 This tumor sample had a conventional WGS profile revealing several copy-neutral LOH 185 regions that were not evident when considering the copy number heatmap in cellranger 186 (Fig. 3a). Chromosome 5 presents an illustrative example: The bulk VAF clearly 187 separates this chromosome into two main regions, a normal region followed by a copy-188 neutral LOH (Fig. 3b). Concordantly, Alleloscope reveals a cluster centered at ( , ) = 189 (1,1) corresponding to copy-neutral LOH only for the region on the right (Fig. 3b), which 190 cleanly separates the tumor cells from normal cells. Comparing to the haplotype profiles 191 derived using the haplotypes from linked-read WGS for this sample showed that the 192 accuracy of Alleloscope for copy-neutral LOH events is nearly 100% ( Supplementary Fig.  193 3). 194 (c) Single cell allele-specific estimates ( *, $ ), colored by assigned haplotype profiles, for select regions in the samples P6198t (metastasized colorectal cancer sample), SNU601 (gastric cancer cell line), P6335 (colorectal cancer sample), and BC10X (breast cancer cell line). In the color legend, M and m represent the "Major haplotype" and "minor haplotype" respectively. The lower-case letters following the chromosome number in the titles denote the ordered genomic segments.

Alleloscope finds pervasive occurrence of polyclonal CNA regions differentiated 196 by haplotype ratios 197
The scDNA-seq samples included in this study are shown in Table 1  is high for all LOH and amplification event types that create an allelic imbalance 248 ( Supplementary Fig. 3). 249

Juxtaposition of single cell copy number and chromatin remodeling events by 250 integrative scATAC-seq analysis 251
To illustrate the integrative analysis of scATAC-seq data, we first consider two basal cell 252 carcinoma samples with matched whole-exome sequencing (WES) data 37 . Using the 253 matched WES data, the genome of each sample was first segmented into regions of 254 homogeneous bulk copy number (Fig. 4a, middle panel shows the segmentation for 255 SU008). Alleloscope was then applied to the scATAC-seq data to derive allele-specific 256 copy number estimates of each cell in each region. Scatterplots of ( 2, 0 ) for five example 257 data, allows us to detect subclones. In parallel, a peak-by-cell matrix can be computed 279 following standard pipelines. Then, the subclone memberships or CNA profiles can be 280 visualized on the low-dimensional embedding of the peak matrix, and the subclones can 281 be further compared in terms of peak or transcription factor motif enrichment. Precise 282 haplotype profiles for each subclone then allow us to identify significantly 283 enriched/depleted peaks after accounting for copy number differences, thus delineating 284 events that are uniquely attributable to chromatin remodeling. 285 Hierarchical clustering using major haplotype proportion 0 identifies the tumor cells from 286 the normal cells for both SU006 (Supplementary Fig. 11) and SU008, and clearly 287 delineates a subclone in SU008 marked by a copy-neutral LOH event on chr4a (Fig. 4c). 288 Focusing on SU008, we call the cell lineage that carries the chr4a LOH event clone-2, 289 and the other lineage clone-1. In parallel, clustering by peaks cleanly separates the tumor 290 cells from the epithelial cells and fibroblasts ( Fig. 4d: left), and further, demarcates two 291 distinct clusters in the tumor cells (peaks-1 and peaks-2) ( Fig. 4d: middle). What is the 292 relationship between the peaks-1 and peaks-2 clusters obtained from peak signals to the 293 two clones delineated by chr4a LOH? Coloring by chr4a major haplotype proportion ( 0 ) 294 on the peaks-derived UMAP shows that the LOH in this region is carried by almost all of 295 the cells in peaks-2 but only a subset of the cells in peaks-1 ( Fig. 4d: middle). This can 296 also be clearly seen in the density of 0 ( Fig. 4d: right): While 0 is heavily concentrated 297 near 1 for peaks-2, it is bimodal for peaks-1. Since clone-1 and clone-2 are differentiated 298 by a copy-neutral event, this separation by peaks into two clusters is not driven by broad 299 differences in total copy number. Since clone-2 is split into two groups of distinct peak 300 signals, we infer that the chromatin remodeling underlying the divergence of the peaks-2 301 cells must have occurred in the clone-2 lineage, after the chr4a LOH event (Fig. 4e). In 302 this way, Alleloscope analysis of this scATAC-seq data set allowed us to overlay two 303 subpopulations defined by peak signals with two subpopulations defined by a subclonal 304 copy-neutral LOH, and infer their temporal order. 305 heterogeneity of SU008 is shaped by a subclonal LOH of chr4a followed by subsequent genome-wide chromatin remodeling leading to three subpopulations: Clone 1 which does not carry the chr4a LOH (peaks cluster 1), Clone 2 carrying the chr4a LOH (peaks cluster 1), and remodeled clone 2 (peaks cluster 2).

306
Integrative analysis of clonal evolution and altered chromatin accessibility for a 307 complex polyclonal gastric cancer cell line 308 The gastric cancer cell line SNU601 exhibits complex subclonal structure, as evidenced 309 by multiple multiallelic CNA regions (chr3e and chr18a are shown in Figure 3c). In addition 310 to scDNA-seq, we also performed scATAC-seq on this sample to profile the chromatin 311 accessibility of 3,515 cells at mean coverage of 73,845 fragments per cell. This allows us 312 to compare the allele-specific copy number profiles obtained by scATAC-seq with those 313 given by scDNA-seq and integrate the two data types in a multi-omic characterization of 314 this complex tumor. 315 First, we segmented the genome and estimated the allele-specific copy number profiles 316 of single cells at each segment for both the scATAC-seq and scDNA-seq data, following 317 the procedure in Figure 1 with some modifications due to the lack of normal cells to use 318 as control for this sample (see methods). Figure 5a shows the relative total coverage, 319 pooled across cells from scDNA-seq. Figure 5b shows ( 2, 0 )-scatterplots for five example 320 CNA regions in scDNA-seq and scATAC-seq. Compared to the scATAC-seq data, the 321 scDNA-seq data has about 8-fold higher total read coverage and 7-fold higher 322 heterozygous site coverage per cell. Thus, while subclones corresponding to distinct 323 haplotype profiles are cleanly separated in the scDNA-seq data, they are much more 324 diffuse in the scATAC-seq data. Yet, cluster positions in scATAC-seq roughly match those 325 21 in scDNA-seq. As expected, the ( 2, 0 )-scatterplots reveal the high level of chromosomal 326 instability in this sample, with each region exhibiting multiple clusters of different 327 haplotype structures that indicate the existence of subclones carrying mirrored events 328 and, for some regions, the variation of haplotype dosage over a gradient across cells. 329 Figure 5c shows the hierarchical clustering of cells from scDNA-seq based on their allele-330 specific copy number profiles, revealing the subclonal structure and the co-segregating 331 CNA events that mark each subclone. For each cell in each region, Alleloscope also 332 produces a confidence score for its assignment to different haplotype profiles 333 ( Supplementary Fig. 12). Based on visual examination of the confidence scores at the 334 marker regions, we identified 6 subclones for further investigation (Clones 1-6 labeled at 335 the right of the heatmap). The allele-specific copy number profiles allow us to manually 336 reconstruct the probable evolutionary tree relating these 6 clones under the following 337 three rules: 338 (1) Parsimony: The tree with the least number of copy number events is preferred. 339 (2) Monotonicity: For a multi-allelic region with escalating amplifications (e.g. Mm, MMm, 340 MMMm), the haplotype structures were produced in a monotonic order (e.g. Mmà 341 MMmà MMMm) unless a genome doubling event occurred. 342 (3) Irreversibility of LOH: Once a cell completely loses an allele (i.e. copy number of that 343 allele becomes 0), it can no longer gain it back. 344 The evolutionary tree, thus derived, is shown in Figure 6b. The mirrored-subclonal 345 amplifications on chr3q, the deletion on chr4p, and the multiallelic amplification on chr20q 346 allowed us to infer the early separation of clones 3-6 from clones 1-2. Subclones 3-6 are 347 confidently delineated by further amplifications on chr3q, chr20q, chr11, chr13, and chr17. 348 Note that high chromosomal instability led to concurrent gains of 1q and 7p in both the 349 Clone 1-2 and Clone 3-6 lineages. We also observed a large number of low-frequency 350 but high-confidence CNA events indicating that ongoing chromosomal instability in this 351 population is spawning new sporadic subclones that have not had the chance to expand. 352 We now turn to scATAC-seq data, focusing on the 10 marker regions which, together, 353 distinguish Clones 1-6: chr1b, 3b-d, 4b, 7a, 11b, 13b, and 20b-c. The haplotype profiles for each region, can now be integrated with peak-level signals. 363  which gives a two-dimensional visualization of the geometry of the chromatin accessibility 367 landscape of this sample (Fig. 6a). UMAP scatterplots colored by clone assignment show 368 that the 6 clones exhibit marked differences in their chromatin accessibility profiles ( both clones 3 and 4. We expect some of these peak-level differences to be driven by 375

CNAs. 376
To delineate the peaks that differ between clones, and to distinguish peak differences 377 that are not accountable by CNAs, we identified differential accessibility peaks (DAPs) 378 across each split of the tree (Fig. 6b)  Nevertheless, along some branches we find a large number of DAPs not attributable to 391 broad CNAs, and thus must be due to other mechanisms. Two example DAPs of this 392 latter category are shown as insets in Figure 6b, with full list given in Supplementary Table  393 1. The first example is a peak at the transcription start site (TSS) of the REC8 gene, which 394 is located on chr14 where no apparent CNAs were observed across the six major LOH across all tumor cells, there are no detectable subclonal differences, and thus we 406 don't expect the decrease in accessibility at WWOX for subclone 3 to be due to a large 407 copy number event. Since WWOX is a well-known tumor suppressor whose down-408 regulation is associated with more advanced tumors 40, 41 , its decrease in accessibility 409 suggests a more aggressive phenotype for Clone 3. Overall, these two examples show 410 how Alleloscope can be used to dissect the roles of CNA and chromatin-level changes in 411 the identification of gene targets for follow-up study. 412 Fig. 6: Integrative analysis of allele-specific copy number and chromatin accessibility for SNU601 ATAC sequencing data. (a) UMAP projection of genome-wide scATAC-seq peak profile on 2,753 cells. The same group of cells were clustered into one of the six subclones based on their allele-specific copy number profiles across the 10 selected regions. Cells in different subclones are labeled with different colors, using the same color scheme as that for the subclone labels in Fig. 4c. The number of cells colored in each UMAP is shown at the bottom-right corners. (b) A highly probable lineage history of SNU601, with copy number alterations (CNAs) and differential accessibility peaks (DAPs) marked along each branch. P-values of the tests for association between DAPs and CNAs are shown along each branch. For two example DAP genes, pooled peak signals for each subclone are shown as inset plots.

Discussion 414
Despite the recent advances in the application of single cell sequencing to cancer, we are 415 still far from understanding the diversity of genomes that are undergoing selection at the 416 single cell level. Notably, little is yet known about the intratumor diversity of allelic 417 configurations within CNV regions, and to what extent the diversity of cells in chromatin 418 accessibility can be attributed to diversity in allele-specific copy number. We presented 419 Alleloscope, a new method for allele-specific copy number estimation that can be applied 420 to single cell DNA and ATAC sequencing data (separately or in combination). First, on 421 scDNA-seq data of 9 samples from 3 different tumor types, with phasing validation by 422 linked-read sequencing on three samples, Alleloscope revealed an unprecedented level 423 of allelic heterogeneity within hypermutable CNA regions. In these regions, subclones 424 reside on a gradient of allelic ratios that is unobservable in total copy number analysis. In 425 simple cases, these hypermutable regions contain mirrored subclones, as previously 426 identified 9,10 , but are often much more complex. We observed multiple instances of 427 recurrent CNA events, some verified by linked read sequencing, where the same region 428 is mutated multiple times during the evolution of the tumor, arriving at the same haplotype 429 profile in distinct clones. In accordance with the findings in Watkins et al. 42 , we found 430 using Alleloscope that chromosomal instability drives the formation of subclones not only 431 in primary tumors but also after metastasis. 432 Having established the allelic complexity of CNAs at single cell resolution, we next applied 433 Alleloscope to scATAC-seq data, thus enabling the combined study of clonal evolution 434 and chromatin accessibility. First, we considered the analysis of a public data set 435 consisting of two basal cell carcinoma samples, for which matched bulk whole-exome 436 sequencing data was used for initial genome segmentation upon which single cell CNA 437 genotyping was then conducted in the scATAC-seq data. Here we showed that 438 Alleloscope can detect amplifications, deletions, and copy-neutral LOH events accurately 439 in scATAC-seq data, and was able to find a subclone delineated by a copy-neutral LOH 440 event. Juxtaposing this subclone assignment with peak signals allowed us to detect a 441 wave of genome-wide chromatin remodeling in the lineage carrying the LOH. Next, we 442 applied Alleloscope to a complex polyclonal gastric cancer cell line with matched scDNA-443 seq data. We found, by overlaying peak signals with subclones delineated by allele- As expected, accuracy is a function of all three quantities (Supplementary Fig. 14). 454 Coverage at heterozygous SNP sites is especially important for scRNA-seq and scATAC-455 seq data, for which shifts in total coverage is an unreliable proxy for underlying DNA copy

Methods 464
ScDNA-seq Data Sets and Pre-processing 465 Table 1 summaries the nine 10x scDNA-seq samples analyzed in this study: 466 The Cell Ranger DNA pipeline (https://support.10xgenomics.com/single-cell-467 dna/software/) automates sample demultiplexing, read alignment, CNA calling and 468 visualization. We first applied the tool to process the sequencing data (beta version:  To include high-quality SNPs in the later analysis, we filtered out the SNPs with <5 reads 478 for P5846 and P5847, <10 reads for P5915 and P5931, <15 reads for P6335 and P6461, 479 <20 reads for P6198 and SNU601, and <40 for BC10X samples based on the number of 480 SNP detected for each sample. Additionally, SNPs located in the regions of repetitive 481 sequences such as centromeres and telomeres were excluded. To exclude cells that 482 might undergo apoptosis or cell cycles, the cells labeled noisy from the metadata output 483 by the Cell Ranger tool were excluded. 484 The scATAC-seq dataset for the SNU601 sample was generated in this study. About 488 400,000 cells were washed with RPMI media and centrifuged (400g for 5 min at 4°C) 489 twice. The supernatant was removed and chilled PBS + 0.04% BSA solution was added. 490

Single-cell ATAC Data sets, Sequencing and Preprocessing 485
The resuspended pellet was added to a 2ml microcentrifuge tube and centrifuged (400g 491 for 5min at 4°C). After removing the supernatant without disrupting the pellet, 100 µL of 492 Nonidet P40 Substitute, 0.1% Tween-20 and 0.01% digitonin) was added and carefully 494 mixed 10 times. The tube was incubated on ice for 7 min. After incubation, 1 mL of chilled 495 Wash Buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl2, 1% BSA and 0.1% 496 Tween-20) was added and mixed 5 times followed by centrifugation of nuclei (500g for 5 497 min at 4°C). After removing the supernatant carefully, nuclei were resuspended in chilled 498 Nuclei Buffer (10X Genomics), filtered by Flowmi Cell Strainer (40uM) and counted using 499 a Countess II FL Automated Cell Counter. Then the nuclei were immediately used to 500 generate scATAC-seq library. 501 ScATAC-seq library was generated using the Chromium Single Cell ATAC Library & Gel 502 Bead Kit (10X Genomics) following the manufacturer's protocol. We targeted 3000 nuclei 503 with 12 PCR cycles for sample index PCR. Library was checked by 2% E-gel 504 (Thermofisher Scientific) and quantified using Qubit (Thermofisher Scientific). 505 Sequencing was performed on Illumina NextSeq500 using NextSeq 500/550 High Output 506 Kit v2.5 (Illumina). 507 Raw sequencing reads of the SNU601 scATAC-seq sample was de-multiplexed with the 508 10x Genomics Cell Ranger ATAC Software (v.1.2.0; https://support.10xgenomics.com/single-509 cell-atac/software/pipelines/latest/algorithms/overview) and aligned to the human GRCh38 510 reference genome. The aligned scATAC-seq data of the two pre-treatment basal cell 511 carcinoma samples (SU006 and SU008) were downloaded from the Gene Expression 512 Omnibus under accession GSE129785 23 . To obtain all potential SNPs for the SU006 and 513 SU008 samples, GATK Mutect2 was used to call all single-nucleotide variants (SNVs) on 514 the deduplicated bam files by the Picard toolkits of both the t-cell dataset and the tumor 515 dataset from the same tumor. All SNVs from the paired tumor-normal datasets were 516 combined and the read counts of these SNPs were quantified for each cell in the tumor 517 scATAC-seq dataset. The pre-filtered cell barcodes for the two public scATAC-seq 518 datasets were retrieved from the previous study 23 . For the SNU601 scATAC-seq data, we 519 instead quantified the read counts of the two alleles of the SNPs more reliably called from 520 the paired normal scDNA-seq data. Like scDNA-seq, we applied VarTrix to generate two 521 SNP-by-cell matrices for both reference alleles and alternative alleles of all the SNVs for 522 all the scATAC-seq datasets. To obtain a SNP set including only SNVs that are more 523 possible to be germline SNPs, we further filtered out the SNVs <20 reads for the SU008 524 sample and <30 reads for the SU006 sample. SNPs with extreme VAF values <0.1 or 525 >0.9 were also excluded for both samples. Since we used the phasing information from 526 the paired scDNA-seq data to assist the estimation of the haplotype structures for the 527 SNU601 scATAC-seq data, we instead filtered out the cells <5 reads and the SNPs <5 528 reads to improve quality of the downstream analysis. 529

Linked-reads sequencing and data processing 530
The three samples with the linked-reads sequencing data were acquired as surgical 531 resections following informed consent under an approved institutional review board 532 protocol from Stanford University. Samples were subjected to mechanical and enzymatic 533 dissociation as previously described, followed by cryopreservation of dissociated cells 33 . 534 Cryofrozen cells were rapidly thawed in a bead bath at 37 ºC. Cell counts were obtained 535 on a BioRad TC20 cell counter (Biorad, Hercules, CA) using 1:1 trypan blue dilution.

SNP Phasing and Single-cell Allele Profile Estimation 577
For each region after segmentation, an expectation-maximization (EM)-based method is 578 used to iteratively phase each SNP and estimate cell-specific allele-specific copy number 579 states for all scDNA-seq and scATAC-seq data sets. Recall that by "major haplotype" we 580 refer to the haplotype with higher aggregate copy number in the sample. Let " indicate 581 whether the reference allele of SNP j is located on the major haplotype and ! denote 582 major haplotype proportion of cell i. The EM model iterates the expectation step and the 583 maximization step. The complete log likelihood of the model is 584 where !" and !" are the observed read counts for the reference and alternative alleles 587 of cell i on SNP j. In the E-step, we first calculate the expected value of the posterior 588 probability of the hidden variable " to construct a lower bound for optimization 589 is the parameter from the t th iteration. In the M-step, 0 ! is updated by solving 591 . SNPs in a region to be 30,000 in our analysis. For the SNU601 scATAC-seq dataset, 596 since the phase were estimated in the paired scDNA-seq dataset with higher depth, we 597 directly applied the estimated . " 's from scDNA-seq data to estimate the 0 ! 's of the cells in 598 the scATAC-seq data. To improve the estimation results, cells with <20 read counts 599 covering the identified SNPs were excluded for each region. 600

Selecting normal cells and normal regions for single-cell Coverage Normalization 601
Let represent a region in the genome after segmentation. To compute the relative 602 Since amplified regions with both haplotypes equally amplified can also have small sum 617 0 ! distance, adjusted raw coverages are also considered in diploid region selection. The 618 adjusted raw coverage of cell i in region r ( 1 !9 ) is computed by 619 where !5 is the total read counts in region r of cell i and ! is the total read counts of cell 621 i across the regions. 5 is length of the region r and is total length of the genome. For Alleloscope shows a list of potential diploid regions for each cluster by raking the sums 629 of 85 ranks and 85 ranks. Excluding the c th cluster identified as the normal group, 630 Alleloscope proposed a list for the candidate diploid regions across the clusters by 631 selecting the majority region. 632 Since coverage on scATAC-seq data is confounded by the epigenetic signals, 633 chromosome 22 for SU008 and chromosome 18 for SU006 were directly selected as 634 39 normal regions based on the WES data. Individual cells were classified into normal and 635 tumor cells based on the epigenetic signals on the scATAC-seq data. For the SNU601 636 scATAC-seq dataset, chromosome 10 was selected as the normal region based on the 637 paired scDNA-seq data. 638

Cell-level Genotyping 639
The cell-level allele-specific copy number profiles are defined by both relative coverage 640 change ( 2 !5 ) and major haplotype proportion ( 0 !5 ) of region r and cell i. After the normal 641 cells and normal control region are identified, cell-specific relative coverage change in 642 region r is calculated as 643 where !5 is total read counts in region r and !$ is total read counts in a reference region 645 of cell i. $5 is a vector denoting total read counts in region r of all identified normal cells 646 and $$ is a vector denoting total read counts in the same reference region r of all 647 identified normal cells. Since SNU601 is a tumor cell line with no normal cells in the 648 dataset, $5 and $$ were calculated from the cells in the matched normal P6198 sample 649 as a substitute for the scDNA-seq data. For SNU601 scATAC-seq data, we aligned the 650 distribution of the 2 !5 values in paired scDNA-seq data to the distribution of the Next, cells with extreme 2 !5 values larger than the 99 th percentile and smaller than the 655 first percentile across the cells are considered outliers and excluded for each region. With 656 the ( 2 !5 , 0 !5 ) pairs, cells in the scDNA-seq data can be classified into the haplotype profiles 657 (g) with the expected ( ? , ? ) values based on minimum distance. Although signals in the 658 scATAC-seq data are much noisier, the haplotype structures identified in the paired 659 scDNA-seq data can help to guide the genotyping for each region. In region r, the 660 posterior probability of cell i carrying a haplotype profile observed in region r in the paired 661 scDNA-seq data was 662 3 !5 = 5 L 2 !5 , 0 !5 4 = 3 2 !5 , 0 !5 L !5 = 5 4 ? + ∑ 3 2 !5 , 0 !5 L !5 = @ 4 ? + , ? + @ , 663 where 5 denotes the haplotype profiles observed in region r in the paired scDNA-seq 664 data and ? + denotes the prior probability that a randomly sampled cell carrying the 5 665 haplotype profile. A uniform prior can be used for ? + in the absence of external 666 information. In the formula, 667 3 2 !5 , 0 !5 L !5 = 5 4 = 3 2 !5 L = ? ; : = 0.254 × l 0 !5 m = ? ; = n 0 !5 31 − 0 !5 4 !5

o, 668
where !5 is the number of total read counts in region r for cell i. The haplotype profile of 669 cell i in region r was estimated by maximizing the above posterior probability. The 670 haplotype profiles of each region are visualized using different colors in the two-671 dimensional scatter plots for both scDNA-seq and scATAC-seq data with the confidence 672 scores calculated using the distance of the points to the canonical centers and the 673 standard deviations. 674 Validations using paired linked-reads sequencing data 675 We validated our algorithm using paired linked-reads sequencing data with two strategies 676 in one gastric cancer patient sample and two colorectal cancer patient samples. First, the 677 phasing accuracy was assessed by comparing the estimated SNP phases on the scDNA-678 seq data and the known phases of the same SNPs from the linked-reads sequencing data 679 in individual regions. In the linked-reads sequencing data, SNPs within the same phase 680 sets are phased with respect to one another, while those between different SNP sets are 681 not. Therefore, we compared the phases of the SNPs overlapping between our estimated 682 SNP set and the phase set with the largest numbers of SNPs in the linked-reads 683 sequencing data for each region. The reference alleles of the overlapping SNPs with . " 684 >0.5 are estimated to be on the major haplotype. Otherwise, the reference alleles of the 685 overlapping SNPs with . " <0.5 are estimated to be on the minor haplotype. The SNPs with 686 . " =0.5 are excluded. By comparing estimated phases and known phases from the linked-687 reads sequencing data of the overlapping SNPs, the phasing accuracy was computed for 688 each region. 689 Secondly, we evaluated the genotyping accuracy by comparing the estimated haplotype 690 profiles of each cell and the haplotype profiles inferred from the linked-reads sequencing 691 data in individual regions. In the linked-reads sequencing data, the phase set with the 692 largest numbers of SNPs was selected. The known phases of the overlapping SNPs 693 between the phase set and the estimated SNPs were used to infer ! for each cell. Cell-694