HBV Genome-Enriched Single Cell Sequencing Revealed Heterogeneity in HBV-Driven HCC


 Background: HBV-HCC is heterogeneous and frequently contains multifocal tumors. To interrogate heterogeneity of HBV-HCC, we developed a HBV genome enriched single cell sequencing (HGE-scSeq) procedure and a computational method to identify HBV integration sites and infer DNA copy number variations (CNVs). Results: We performed HGE-scSeq on 269 cells from 4 tumor sites and 2 tumor thrombi of a HBV-HCC patient. HBV integrations were identified in 142 out of 269 (53%) cells sequenced, and were enriched in two HBV integration hot spots chr1:34,397,059 (CSMD2) and chr8:118,557,327 (MED30/EXT1). There were also 162 rare integration sites. HBV integration sites were enriched in DNA fragile sites and sequences around HBV integration sites were enriched for microhomologous sequences between human and HBV genomes. Cells were grouped into 4 clonal groups based on CNVs. The HBV integration heterogeneity was associated with single cell’s CNVs. All of 269 cells carried chromosome 1q amplification, a recurrent feature of HCC tumors, suggesting that 1q amplification occurred before HBV integration events in this case study. Further, we performed simulation studies to demonstrate that the sequential events (HBV infecting transformed cells) could result in the observed phenotype with biologically reasonable parameters. Conclusion: Our HGE-scSeq data reveals high heterogeneity of HCC tumor cells in terms of both HBV integrations and CNVs.


Background 21
Hepatocellular carcinoma (HCC) is ranked as the third most lethal cancer worldwide [1], and 54% 22 of HCC cases originate from chronic Hepatitis B Virus (HBV) infection [2]. During HBV infection, a 23 small fraction of viral replication is in double-stranded linear DNA (dslDNA) form, which can be inserted 24 into the host genome at double-stranded break points [3]. HBV integrations only occur in early phase of 25 HBV infection [3,4]. HBV integration into the human genome is one of the most important etiological 26 mechanisms of HBV inducing HCC [5]. Recurrent HBV integrations are identified by sequencing studies 27 [6][7][8][9][10][11]. 28 HBV-HCC tumors are of high heterogeneity in terms of HBV DNA integration patterns and 29 somatic genomic alterations, and the heterogeneity associates with prognosis and drug response in . Both empirical and simulation studies show that only integration events of high allele 31 frequency can be detected at a given sequencing depth [9,13]. It is expensive to implement whole 32 genome sequencing (WGS) with high sequencing depth in a large scale study in general. HIVID (high-33 throughput Viral Integration Detection) by Li et al. [14] describes an efficient way to accurately detect 34 HBV integration in the whole genome. Regions containing virus genome sequences are enriched in the 35 process of preparing DNA library so that the genomic regions to be sequenced are remarkably smaller 36 than the whole human genome. Recently, HIVID has been applied in sequencing of a large number of 37 HBV-HCC samples [15] as well as in detecting Human papillomavirus (HPV) integration sites [16].  Table S7). 180 Next, we compared HBV integration patterns in adjacent normal tissues close to the 4 tumor sites. 181 In total, 17 integration events (Supplementary Table S5) were detected at 13 loci (Supplementary Table  182 S6) in the four adjacent normal tissues. The numbers of HBV integrations in adjacent normal tissues and 183 in tumors were not directly comparable as one based on bulk tissue sequencing and one based on single 184 cell genomic sequencing. In a loose sense, there were more integration events in tumors than in normal 185 tissues than tumors, consistent with previous reports [51]. The integration sites at the two hot spots were 186 also detected in each adjacent normal tissue except that the integration site at chr1 hot spot was not 187 detected in N1 and chr8 hot spot integration was not detected in N2 (in which only one soft clipped read 188 was detected and less than the minimum threshold of two soft clipped reads). The integration events at the 189 two hot spots were the only two recurrent events across 4 adjacent normal tissues. The available 190 information is not sufficient to distinguish whether HBV integrations at the two hot spots in adjacent 191 normal tissues were results of clone expansion or diffusion from tumor tissues. Additional information is 192 needed to inform clonal relationship between cells with HBV integrations at the two hot spots in adjacent 193 normal and tumor tissues.

Properties of HBV integration sites 195
The mechanism of how HBV genome is integrated into the human genome is still under explored. Hu et 196 al. [16] observed significant enrichment of microhomologous (MH) sequences at or near 120 HBV 197 integration sites detected from 31 liver samples from Sung et al [8]. Recently, Zhao et al. [52] sequenced 198 426 HBV-HCC patients and shows enrichment of microhomologous sequences around the HBV 199 integration sites as well. All these observations suggest the potential involvement of MH mediated 200 mechanism in the process of HBV integration. Based on single cell sequencing data we identified 164 201 unique integration sites. Microhomologous sequences between the human genome and HBV genome (an 202 example shown in Figure 2D) were enriched at the HBV integration sites ( Figure 2E). We also found the 203 enrichment of integration sites within the common and rare fragile regions [53] ( Figure 2F). The 204 enrichment of microhomologous sequences near HBV integration and enrichment of HBV integration on 205 fragile region elucidate that the HBV integration is a physical driven process, which is highly related with 206 the sequence content and corresponding physical characteristics of host genome sequence. 207

HBV integration hot spots 208
The two integration hot spots, chr1: 34,397,059 and chr8:118,557,327 locate at the intronic region of 209 CSMD2 and the intergenic region of MED30-EXT1, respectively. The chr1 hot spot could partially be 210 explained by microhomology ( Figure 2D), which led to loss of CSMD2 expression. The integration at the 211 chr8 hot spot resulted over expression of EXT1, which promoted cell growth in vitro and in vivo [42]. 212

Heterogeneity of CNVs 213
In addition to HBV integration, we estimated each cell's CNVs based on the HGE-scSeq data (Methods). 214 The whole human genome was divided into 5000 bins which were then group into 49 super-bins of equal 215 sizes for visualization purpose. As expected, most of the bins had normal copy number of DNA ( Figure  216 3A). All cells carried a DNA copy number amplification at chromosome 1q, which is a recurrent feature 217 of HCC [43] ( Figure 3A). The cells were clustered into 4 clone groups based on CNVs ( Figure 3A), each 218 clone had a distinct pattern of DNA copy number amplifications. And each clone group contained cells with different types of HBV integrations ( Figure 3B). From clones 1 to 4, the ratio of cells carrying rare 220 integrations decreased. 221

Clonal evolution and its relationship with HBV integration 222
Based on CNV pattern, we constructed a phylogenetic tree (detailed in Methods, Figure 4A), which 223 suggests that clone 1 directly developed from the ancestor. Clone 2 and clones 3&4 were derived from 224 clone 1, suggesting there were two different evolution directions. The inner node corresponding to the 225 origin of clone 2 and clones 3&4 as well as the inner node corresponding to the split between clone 2 and 226 clones 3&4 were annotated in Figure 4A. These inner nodes can be directly linked to CNVs on a specified 227 region. The root node in the phylogenetic tree corresponded to the cells with CNVs of 1q. The common 228 origin of clones 2, 3, and 4 had Chr11 amplification. The regions differentiating clones 2-4 from clone 1 229 contained potential genomic regions that may associate with the decreasing ratio of rare integration 230 carrying cells. Cells in clone 2 contained CNVs on Chr11 while cells in clones 3 and 4 contained 231 additional CNVs at Chr8:118,268,000-146,364,000. More CNVs split clones 3 and 4. Supplementary 232 Figure S9 is the same as Figures 4A except nodes colored according to cells with hot spot and rare HBV 233 integrations. It is clear that rare HBV integrations were not randomly distributed in the phylogenetic trees. 234 To identify the potential CNV regions associated with decreasing number of rare HBV 235 integrations, we tested the association between CNV and HBV integrations in the clone evolution process 236 from clone 1 to clones 2-4 and the split between clone 2 and clones 3&4 separately. The significant 237 regions (Supplementary Table S8) associated with the HBV integration difference between clone 1 vs. 238 clones 2-4 were enriched for immune related genes (Table 1) invasiveness of malignant tumor cells [61], which is consistent with the observation that more than 50% 259 cells in the two tumor thrombi were clones 3 and 4 ( Figure 4B). 260

Clone 2 vs. other clones 261
Somatic mutation patterns were derived from bulk tissue whole genome sequencing T1-4 tumors, 2 262 thrombi and a normal tissue [42]. A phylogenetic tree was constructed based the somatic mutation 263 patterns, which suggested that the largest tumor T1 was the primary tumor and other tumors were derived 264 from T1 [42]. Even though all tumors were from the same origin, the clonal composition of each tumor 265 was different. The proportion of clone 2 cells was significantly higher in T1 than in other tumors ( Figure  266 4B). To identify differences between clone 2 cells and other cells, we compared CNVs across all bins and 267 identified 282 bins (consisting of 2246 genes) where clone 2 cells had lower CNVs comparing to cells of 268 other clones. These genes were enriched in the GO term calcium-dependent cell-cell adhesion (p-269 value=9.6E-7, Fold enrichment=3.7, 0 3) and chemokine activity (p-value=6.2E-6, Fold enrichment=4.0, 270 Table 3). N-cadherin promotes cancer cell invasion [62]. Chemokines and their receptors involved in 271 tumor immunogenicity and aggressiveness [63,64]. Lower abundance of chemokines and their receptors 272 might lead to lower potential to metastasis. These may explain why the fraction of clone 2 cells in the 273 primary tumor T1 was higher than the fractions in other tumors ( Figure 4B). 274

Simulation of clonal evolution 275
To assess different clonal evolution scenarios, we performed cell simulations according the birth-death 276 [65,66]. We tested a wide range of parameter space, then calculated the posterior of parameters based on 277 the distance of simulated distribution and the observed data. A simulation starts from a cell after 278 malignant transformation. In the observed data ( Figure 4A), the root node had to carry the chromosome 279 1q amplification. Otherwise, no simulation resulted in the scenario that 100% cells carried the 280 chromosome 1q amplification. In each replication cycle, a cell divided or died at the probability Pdiv and 281 Qdeath, respectively ( Figure 5A). The simulations stopped when the total number of cells reached to 10 7 , 282 corresponding to a tumor of size 0.5cm x 0.5cm x 0.5cm. First, we simulated clonal evolution due to 283 CNV changes without HBV integrations. Each novel CNV change likely alters the fitness of the cell and 284 increases the probability of cell division over the probability of cell death, and the selection coefficient 285 was noted as SC ( Figure 5A). With n CNVs acquired addition to the root event, the division probability 286 was P(1+SC) n , and the corresponding death probability was 1-P(1+SC) n . In a normal cell, the DNA copy 287 number mutation rate (MR) per cell per division is in the range from 10 -10 to 3.4*10 -6 [67] . We 288 simulated HCC cells with the copy number mutation rate (5e-6,1e-5,5e-5,1e-4,5e-4,1e-3) and the 289 selection coefficient (0.01, 0.05, 0.1, 0.2, 0.3) for each additional CNV. For each simulation, a CNV 290 among the CNVs in Figure 3A was randomly draw and introduced to the cell according the mutation rate. Next, we performed simulations to examining HBV integrations with the parameter combination 294 for CNVs fixed as SC=0.01 and MR=0.001 estimated above. We assumed HBV infection occurred when 295 the tumor grew to 10 5 cells and random HBV integrations occurred in 1 out of 50 HCC cells in the tumor 296 ( Figure 6A). Among the HBV integrations, 1% were hot spot integrations, and only cells with hot spot 297 integrations gained cell growth advantage with the selection coefficient SCHBV in (0.01,0.05,  integrations (blue line in Figure 6C). When the simulated tumors reached to 10 7 cells, around 50% of 306 cells carried hot spot HBV integrations with SCHBV in the range between 0.075 and 0.1, close to the ratio 307 52% observed in the patient data (red line in Figure 6C), indicating the ratios of cells with hot spot HBV 308 integrations vs. cells with HBV integrations were close to 1 ( Figure 6D). 309

Discussion 310
HBV genome-enriched single cell sequencing approach can efficiently identify HBV integration sites and 311 genomic alterations. We developed a data analysis pipeline for HBV genome enriched single cell 312 sequencing data. Our analyses revealed both highly recurrent and rare HBV integrations in cells. 313 Especially, a large number of rare HBV integrations were identified in the single cell sequencing study, 314 and these rare HBV integrations suggest that HBV genome was randomly integrated at sites according to There were two HBV integration hot spots ( Figure 2C). The integration hot spot chr1: 34,397,059 323 (CSMD2) could partially be explained by microhomology ( Figure 2D). For the HBV integration hot spot 324 at chr8, EXT1 expressed significantly higher in tumor tissues than in adjacent nonneoplastic liver tissues 325 There are multiple limitations of the HGE-scSeq approach. The sensitivity of the approach is 340 hard to estimate unless an extensive single cell whole genome sequencing was performed as the ground 341 truth to compare with, which is expensive to do. Given the uncertainty of the sensitivity, it is not clear whether some tumor cells lacking chr1 or chr8 hot spot integrations were due to capture/sequencing 343 sensitivity or due to clonal expansion. We compared two scenarios: (1) the root clone had HBV 344 integration, which drive tumorigenesis. In this scenario, all clones should have exact same HBV 345 integration pattern (as HBV integration occurs only in early phase of HBV integration [3,4]), which 346 contradicts with our observation that some clones had more HBV integrations than other ( Figure 3B). (2) 347 the root clone had 1q amplification, and the root clone cells were of genome instability. Then, HBV 348 infection occurred and HBV integration in each cell occurred at different sites and at different frequency 349 depending on cell's molecular state and genome stability, consistent with our observation ( Figure 3B). 350 The cells in clone 4 were more likely missing the hot spot integrations than the cells in clone 1, 351 suggesting that no hot spot integration in these cells was unlikely due to the sensitivity of the assay but 352 due to molecular differences between the clones. 353 The relationship between CNVs and HBV integrations observed in this case study needs to be 354 considered as anecdotal until the relationship can be replicated in more patient samples or validated in 355 vitro experiments that exceeds the scope of this study. 356

Conclusion 357
We developed a data analysis pipeline for HBV genome-enriched single cell sequencing data. HCC 358 tumor cells were heterogeneous in terms of both HBV integration sites and CNVs. The frequency of HBV 359 integration observed in the study was much higher than expected. For the HBV-HCC case in the study, 360 multifocal tumors and tumor thrombi shared common HBV and CNV patterns, suggesting that they 361 shared the same tumor origin. 362 363

Patient and tissue samples 365
The study of tumor cell heterogeneity was approved by the Institutional Review Board of Tongji Hospital, 366 Tongji Medical College of HUST, in Hubei province, China. The signed written informed consent was 367 obtained before patients' recruitment, according to the regulations of the institutional ethics review boards. 368 The patient and sample information was detailed in Chen et al. [42]. The clinicopathological information 369 of the patient is summarized in Supplementary Table S10. In brief, a 47-year-old patient matched with 370 the research design. Obtained medical history indicated that he had no history of alcohol abuse, 371 recognized acute hepatitis, mother-to-child transmission of HBV, blood transfusion, or injection drug use.  Figure S1B). 378 Tumor tissues from the 4 tumor sites (noted as T1-4) and corresponding adjacent normal tissues as well as 379 tissues from two tumor thrombi were collected after surgery. 380

HBV genome enriched single cell sequencing (HGE-scSeq) 381
The fresh (with 1 hour after surgery) frozen (stored in -80°C) tumor tissue samples were thaw in water 382 bath at room temperature and digested into cell solution by collagenase as previously described [68]. With 383 sufficient collagenase dissociation and dilution, the cancer tissues were separated into single cells solution, 384 cell clusters and cell debris. Suspension was filtered by injecting into the membrane filter (pore size = 385 20µm) to filter out the massive cell clusters. To avoid contamination of cell debris, suspension was re-386 suspended and centrifuged in Phosphate Buffered Saline (PBS) for 5 times. After filtration, cell 387 suspension was added into a PBS droplet containing 0.5% BSA. Single cell isolation was performed 388 using micro pipette as previously described [68] under microscope and cells with intact cell membrane 389 were randomly selected for single cell sequencing.

Mapping HGE-scSeq reads 401
On average, 17.39M (17,393,993) reads were generated for each cell. Low quality reads were filtered out 402 according the following criteria. If any single read in a read pair had more than half base of quality less 403 than 5, the corresponding read pair was filtered (Supplementary Figure S11A). If a read pair was 404 contaminated by adaptor sequences, it was filtered. If two read pairs were the same, only one copy was 405 kept in further analysis. After quality filtration 5.49M (5,494,183) reads were kept in further analyses. 406 Among them 77.13% and 0.24% were aligned to human and HBV genome, respectively on average. With 407 paired-end assembly and re-mapping, reads supporting virus integration were identified. The number of 408 reads supporting HBV virus integration in each cell was in a range of 0 to 53,290. The average percentage 409 of human genome covered by sequencing reads was 3.13% with average depth of coverage 3.14. The 410 detailed information of reads distribution can be found in Supplementary Table S2 and Supplementary  411   Table S11. 412 413

Bulk tissue HBV enriched DNA sequencing
Corresponding adjacent non-neoplastic liver tissues for the four independent tumor sites, noted as N1-4, 415 were collected for bulk tissue HBV enriched DNA sequencing. For the 4 adjacent normal tissues, the 416 HIVID procedure was directly applied to the extracted DNA without the WGA step, followed by the 417 same 101 cycles paired-end index sequencing. On average, 45.96M reads were generated for each tissue 418 sample. After quality filtration 12.13M reads were kept for further analyses. Among them 78.48 % reads 419 were mapped to the human genome, and 0.013% reads were mapped to HBV genome. On average only 420 50 reads supporting HBV integration were detected for each control tissue sample. The average 421 percentage of human genome covered by reads was 6.9% with average depth of coverage 1.272. The 422 detailed information of reads distribution can be found in Supplementary Table S2 and Supplementary  423   Table S11. 424 425

Quality check of whole genome sequencing reads 426
Our previously described pipeline [13] was used to process the whole genome sequencing data. In brief, 427 prinseq-lite [69] was used to filter the reads that were exactly the same or of mean reads quality lower 428 than 20 and more than 10% Ns. The remaining reads were mapped to human genome with Bowtie2 (-D   Table S12). The distribution matched with a Poisson distribution until k=28, which corresponding to 87.97% of human genome. When 467 k>=29, the distribution was not a Poisson distribution anymore. Thus, the mapped reads on the majority 468 of human genome following a Poisson distribution, except the region consistently missed by all cells and 469 0.81% of human genome covered by reads from a number of cells significantly more than expected by 470 chance. These observations suggested that CNV profile at single cell level could be accurately estimated 471 with appropriate normalization method. 472 473

Comparing human genome regions with and without HGE-scSeq reads 474
To infer CNVs from reads mapped to the human genome, these reads should be evenly distributed across 475 the human genome and there should be no systematic difference between the regions covered with 476 sequencing reads and the regions without reads. To investigate the property of the regions with and 477 without read sequence coverage, we first constructed a Fisher machine prediction model [48] to 478 distinguish HBV and human genome sequences by randomly sampling 10,000 sequences of 100bp length 479 from HBV and human genomes. Then, we applied to Fisher machine to test whether the sequences in the 480 human genome regions with or without HGE-scSeq reads were similar to HBV or human genome 481 sequences. For each cell, 10,000 sequences of 100bp length were randomly samples from human genome 482 regions with and without mapped reads, and input them to the Fisher machine. There was no difference 483 between scores of regions with and without mapped reads (Wilcox rank sum test p-value=0.3636, 484 Supplementary Figure S15). 485 486

Mapping to HBV virus genomes 487
The filtered reads were aligned to UCSC hg19 with soap2 [75] (Version: 2.20) in paired-end mode 488 (Supplementary Figure S11B). The parameters used were "-s 85 -l 50 -v 2 -r 1 -p 6 -m 100 -x 500". If any 489 read in a pair was not mapped to human genome, the pair was kept as a candidate for virus detection. 490 These reads were collected and transformed from FASTQ to FASTA format. The virus detection part in 491 VirusFinder [76,77] was used to detect the virus. The reads not mapped to human genome were aligned to a virus database which contains genomes of all known viruses (32102 in total) [78]. The reads aligned 493 to virus genome were de novo assembled into contigs. Then, the contigs were aligned to human genome 494 and virus database. The contigs that can be aligned to human genome were filtered out. If the percentage 495 of identity between the contig and virus' genome was less than 85% or less than 75% of the contig was 496 aligned to a reference genome, the alignment was filtered out. The alignment score of contigs was defined 497 as the multiplication of the mapped length of the contig and percentage of identity between the mapped 498 region of the contig and the virus genome. The virus substrains were ranked by the maximum alignment 499 score of contigs aligned to its genome. The top ranked virus substrain was reported as the matched virus 500 substrain in the cell (Supplementary Table S13). The top common substrains were all HBV B subtype and 501 were similar in sequences (Supplementary Table S14). 502 503

Detecting HBV integration sites 504
The reads not mapped to the human genome were aligned to the detected virus genome using soap2 505 (Version: 2.20 with the following parameters "-s 85 -l 50 -v 5 -r 1 -p 6 -m 100 -x 500"). The paired-end 506 reads not mapped to the human genome and virus genome were collected and assembled to long reads 507 using flash (with parameters "-m 5 x 0.2 -p 64") [79]. The designed smaller insertion size compared to 508 the total length of a pair of reads enabled most read pairs to be assembled into one read of much longer 509 length. The assembled reads were aligned to the human genome and virus genome using bwa and bwasw 510 [80] (-a 1 -b 2 -q 5 -r 2). The soft clipped reads with at least 30bp aligned to the human genome and at 511 least 30bp aligned to virus genome were collected for identifying the integration sites. If the distance 512 between two breakpoints was less than 20bp on both the human genome and HBV genome, we defined 513 them as one breakpoint which was supported by reads combined from the two breakpoints. In order to 514 make the predicted integration events between different cells comparable, we also merge integration sites 515 within 20bp when collecting the predicted integration sites across different cells. The number of soft 516 clipped reads was tightly correlated with the number of HBV reads (Supplementary Figure S12F). We 517 normalized soft clipped read against the number of HBV reads. The Optimal threshold of soft clipped 518 reads for HBV integration was selected to minimize the correlation between numbers of HBV reads and 519 detected HBV integrations. Further refinement based on Bayesian model was used to identify recurrent 520 HBV integrations. Steps are detailed as following:  6. Then, we tried to identify more recurrent integrations. We implemented a Pseudo Count Weight 544 Adjustment (PCWA) to rescue the potential false negative integrations given supporting reads in 545 other cells. The pseudo count weight matrix m n W  is defined as: The pseudo count weight adjusted score m n S  is defined as:  Figure S16A). Corresponding to the best cutoff, 568 N_gain=44, N_gain_random=4, N_loss=1 (Supplementary Figure S16B).

Estimating copy number variations 574
Reads mapped to human genome were randomly distributed (Supplementary Figure S14), which enabled 575 us to estimate DNA copy numbers across human genome. Because the sequencing data was based on an 576 enriched single cell sequencing protocol [14], the existing pipelines for detecting copy number variations 577 in single cell sequencing data [17,81,82] are not applicable directly. If applied directly, more regions of 578 copy number aberration than the regions of normal copy number were identified, which is counter 579 intuitive. Therefore, a new pipeline was needed for analyzing the data set. Based on reads mapped to 580 human genome, we developed a pipeline for inferring copy number variations by modifying the method 581 reported by Baslan et al [82]. DNA copy number profiles by removing peaks of mapped reads and compensating according to the size 584 of peaks and average local coverage. However, Kuilman et al.'s method [83] was not directly applicable 585 for this data set due to the sparsity of reads covered region originated from single cell whole genome 586 amplification. Baslan et al. [82] describes a procedure characterizing single cell copy number variation 587 based on flow sorting of single nuclei, whole genome amplification and next generation sequencing. An 588 informatics workflow of inferring CNV from the raw single cell sequencing data is outlined in 589 Supplementary Figure S11C. In addition to correction for mappability, removal of duplication, and GC 590 content normalization, we used soap [84] as the alignment software to be consistent with the alignment 591 software used in HIVID. 592 There are following steps in the pipeline for calling CNV: for each cell was calculated as the ratio between standard deviation and mean of bin's read count. As 625 suggested in Garvin et al. [81], the sample with lowest index of dispersion is mostly likely to be the cell with the most balanced ploidy. Therefore, the read count profiles of all cells were normalized against the 627 one having lowest index of dispersion with LOWESS. 628 12) Similar to Gao et al. [85], outliers were removed with R function winsorize. Then, a multiple sample 629 population segmentation algorithm with default parameter [86] was used to call the segments under the 630 condition that these cells are related. Segments with less than 10 bins were removed and the neighbor 631 segments were joined or separated in the middle of removed segment if they differed significantly. At the 632 end, bins were merged into a total of 49 segments. 633 13) A least-square rounding method was used to get the optimum scaling factor that had the least sum of 634 deviations from the closet integer after rounding. Integer copy number status was further classified into 3 635 cases of loss, normal and amplification and denoted as -1, 0, 1. 636

Evaluation of read count correction 637
Because sequences containing HBV sequence were enriched at the DNA library preparation step, we need 638 to correct read count bias due to enrichment sequencing. For each cell, we collected the rank of read 639 counts for bins with HBV integrations detected. Then, for all the bins with HBV integrations, we 640 calculated the fraction of bins with rank higher than %  , ( ) f  , where  is ranging from 1 to 100. Then  , which is 650 originally used as a QC metric for microarray data. Cai et al. [27] propose to utilize MAPD [87] to 651 measure the quality of read counts in bins. MAPD is shown to be more robust to identify true CNVs [27]. 652 An optimal threshold of 0.45 is suggested by Cai et al. [27], which is also used in other studies [19]. 653 Supplementary Figure

Evaluating the CNV pipeline with reads from normal control 667
As our CNV pipeline was modified from a CNV pipeline for single cell sequencing data, which takes full 668 consideration of correcting for bias incorporated from WGA [88]. Whether our modified pipeline can 669 handle bulk tissue enrichment sequencing data needs to be evaluated. 670 As shown above, the reads mapped to human genome for normal control tissue resulted in higher 671 average coverage and width comparing to ones for tumor single cells. However the improvement on 672 coverage and width was not large. The cases for normal control tissue and tumor single cell were comparable (Supplementary Figure S18). Thus, the sequencing data for normal tissue samples can be 674 used for evaluating the performance of our CNV pipeline in correcting the bias generated from enriched 675

sequencing. 676
The dispersion of the binned reads counts for the four adjacent normal liver tissue samples after 677 mappability and GC content correction was lower than the smallest corresponding dispersion in tumor 678 single cells (Supplementary Figure S19A). Therefore, mappability correction and GC content correction 679 for the normal control tissue data were necessary. Most of the regions across the 4 adjacent normal tissue 680 samples were of normal copy number (Supplementary Figure S19B). Therefore, our CNV pipeline for 681 correcting the potential bias due to enriched sequencing step was validated. 682 683

Association between clone evolution and HBV integrations 684
Parsimony method is mostly recommended for constructing phylogenetic trees from single cell CNV 685 profiles [85,89]. The distance based tree building method generally assumes that evolution drives by 686 mutations independently accumulated one at a time. CNVs estimated in our study contain multiple 687 alterations. Therefore, in this study, we used a parsimony method [85] to build phylogenetic trees based 688 on CNVs at the 49 identified CNV segments. 689 We identified 4 putative clones according to copy number profiles ( Figure 3A). Meanwhile, there 690

were two categories for cells based on HBV integration, cells with only hot spot integrations and cells 691
with extra rare integrations. There was a clear trend that the ratio of cells carrying rare integrations 692 decreased when number of regions with DNA copy number amplification increased ( Figure 3B). 693 A phylogenetic tree based test of association method was used to decode the association between 694 specific CNV and ratio for cells with rare HBV integrations. First, inner node in the phylogenetic tree that 695 separate parent and child clones which are identified by hierarchical clustering was detected. Second, 696 copy number variations at the genomic region corresponding to the detected inner node for the cells 697 belonging to parent and child clones were collected. Third, a contingency table was built based on numbers of cells with extra rare HBV integrations or only HBV integrations at hot spots, copy number 699 amplification, copy number normal. Fourthly, Fisher's exact test p-value corrected by multiple testing 700 was used to assess the association. Last, functional enrichment analysis by DAVID [90] was used to 701 annotate genes in the CNV bins that were significantly associated with rare HBV integration events. 702 For example, CNVs on Chr11 which differentiated Clone 1 and Clones 2-4 were identified based 703 on the phylogenetic tree shown in Figure 4A. Next, the cells from related clones (Clone 1 and Clones 2-4) 704 were compared at bins within the genomic signaling region. For each bin in the region, a contingency 705 remaining authors declare that the research was conducted in the absence of any commercial or 733 financial relationships that could be construed as a potential conflict of interest.   Table 3. GO enrichment of genes in the CNV bins where cells of clone 2 had consistently lower CNVs 1051 than clones 1, 3, and 4 cells. 1052 Figure 1 Overview of the study. 269 cells from 4 tumor tissues and 2 thrombi tissues were extracted. HBV genome sequence enrichment was performed after whole genome ampli cation on the single cell DNA genome. Pair-end sequencing was used. A pipeline was developed for HBV integration identi cation and CNV inference. Tumor clones were inferred based CNV pro le. Association between HBV integration and CNV was assessed based on clone inference and phylogenetic tree. Key CNVs differentiate two clones were identi ed with phylogenetic tree. Statistical test was performed on the key genetic regions while considering only cells belonging to related clones.

Figure 2
HBV integration heterogeneity and mechanisms of HBV integration. A) Fractions of cells in each tissue with or without HBV sequences detected. B) Circos map of integration; each circle indicates integrations identi ed in a tumor tissue. C) HBV integration distribution across the human genome. Each row represents the integration pro le of a cell. The cells are labeled by its tissue source. The columns are loci with HBV integrations along chromosomes. The cells were clustered by hierarchical clustering. D) An example of Microhomolog between sequences of the human genome and HBV genome at an HBV integration hot spot site Chr1 34,307,059. There are two 4bp homologs between human genome and HBV genome (AGAG and TGAA) with 1bp mismatch in the middle. E) Microhomology enrichment. Numbers of HBV integrations carrying different length of homology sequences between human genome and HBV genome near the HBV integration sites were collected (blue). The observed numbers were signi cantly different from the numbers based on random simulations (red). F) Fragile region enrichment. Both common and rare fragile regions on the human genome were enriched for HBV integrations. Copy number variation heterogeneity at single cell level. A) CNV pro les. Each row corresponds to a cell. The cells are labeled with regard to source tissue, clone annotation, HBV integration category and HBV sequence detection result. Each column corresponds to a bin. The bins are ordered by their chromosome locations (chromosome 1 to 22). Cells can be categorized into 4 groups corresponding to 4 clones. White means normal copy number, blue indicates copy number loss, red indicates copy number ampli cation. B) Composition of cells with no HBV detected, cells with HBV sequence detected but no integration, cells carrying rare integration, and cells carrying only hot spot integration only in each clone. The frequency of cells carrying rare HBV integrations is highest for clone 1 and lowest for clone 4. The frequencies for clones 2 and 3 were comparable, and both were lower than the one for clone 1.

Figure 4
Clonal relationship of cells from different tumor sites. A) A phylogenetic tree built based on single cell CNV pro les. Each node corresponds to a cell. The cells are colored according clone annotation. Splitting nodes are marked as squared nodes. The scale of splitting node correlates to the number of its decedent nodes. B) Clone composition of each tumor tissue. Pie plots for each bin on the fractions of 4 clones.
Each tumor tissue had one major clone and 3 minor clones. There was no single major clone in the thrombus tissues, but clones 3 and 4 together accounted for more than 50% of cells in the tumor thrombi, suggesting the two clones were more invasive.

Figure 5
Simulation of clonal evolution with only CNVs. A) The scheme of birth-death clonal evolution model. Cells accumulated CNVs during cell growth. Each additional CNV increased cell's probability to divide over to die. B) Cell populations/tumors were simulated with different combinations of mutation rates (MRs) and selection coe cients (SCs). The posterior probability of each parameter combination was calculated.

Figure 6
Simulation of clonal evolution with both CNVs and HBV integrations. A) The scheme of birth-death clonal evolution model with HBV integration. At the tumor size of 105 cells, cells were infected with HBV and HBV integration events occurred. Simulations were generated with the selection coe cient of the hot spot integrations SCHBV in a wide range. B) The frequency of cells with HBV integrations in the simulated cell populations. The red line is the observed frequency of cells with HBV integrations in the patient data (142/269) and the blue line marks the initial frequency of HBV integration (2%). C) The frequency of cells with the hot spot HBV integrations in the simulated cell populations. The red line is the observed frequency of cells with the hot spot HBV integrations in the patient data (139/269) and the blue line marks the initial frequency of the hot spot HBV integrations (0.02%). D) The ratio of cells with the hot spot HBV integrations versus cells with HBV integrations. The red line is the observed ratio in the patient data (139/142) and the blue line marks the initial ratio (1%).