A Practical Approach for Targeting Structural Variants Genome-wide in Plasma Cell-free DNA

Interrogating plasma cell-free DNA (cfDNA) to detect cancer offers promise; however, no current tests scan structural variants (SVs) throughout the genome. Here, we report a simple molecular workflow to enrich a tumorigenic SV (DNA palindromes/fold-back inversions) that often demarcates genomic amplification and its feasibility for cancer detection by combining low-throughput next-generation sequencing with automated machine learning (Genome-wide Analysis of Palindrome Formation, GAPF-seq). Tumor DNA signal manifested as skewed chromosomal distributions of high-coverage 1-kb bins (HCBs), differentiating 39 matched breast tumor DNA from normal DNA with an average AUC of 0.9819. In a proof-of-concept liquid biopsy study, cfDNA from 0.5 mL plasma from prostate cancer patients was sufficient for binary classification against matched buffy coat DNA with an average AUC of 0.965. HCBs on the X chromosome emerged as a determinant feature and were associated with AR amplification. GAPF-seq could generate unique cancer-specific SV profiles in an agnostic liquid biopsy setting.


Introduction
2][3][4][5] Changes at the single nucleotide level (i.e., mutations) can activate oncogenes or inactivate tumor suppressor genes.Equally common are structural rearrangements of large chromosomal segments (i.e., structural variants, SVs), such as duplications, deletions, and translocations.These genetic changes are the primary drivers of malignant transformation, disease progression, and therapy resistance.In addition to their roles in cancer biology, tumor-speci c DNA changes are targets for detecting cancer, which has increasingly been sought after in the context of liquid biopsy. 6,7Liquid biopsy holds promise as less invasive and more repeatable than currently available assays at the clinic (e.g., needle biopsies or imaging scans).Historically, blood has been a source of protein biomarkers such as prostate-speci c antigen (PSA). 8More recently, with the advent of DNAanalyses technologies for cancer genomes, DNA released into the blood from tumor cells (i.e., circulating tumor DNA, ctDNA) has become a mainstream target for developing liquid biopsy-based clinical tests.
While these DNA tests report residual tumor burden and actionable mutations for advanced cancer patients, mutation detection in plasma cfDNA would also be feasible for cancer screening in the general population.However, even non-malignant hematopoietic cells accumulate clonal mutations in cancerassociated genes such as p53 and KRAS, [15][16][17] which confound the mutations as effective cancer-speci c biomarkers.Furthermore, because ctDNA is only a small fraction of the total cfDNA (tumor fraction, TF) in blood, particularly in patients with early-stage tumors, 18 ctDNA detection requires very high sensitivity.
Alternatively, SVs are plausible biomarkers for cancer, but have not been pursued for liquid biopsy targets as rigorously as mutations and other emerging targets based on epigenetic features, such as DNA methylation and fragmentation patterns.SVs often manifest as DNA copy number alterations (CNA), but detecting CNA also suffers from very low TF in plasma. 19,20Tumor-speci c translocations can be detected by discordant mate pairs in whole genome sequencing (WGS) data 21 However, the detection of SVs by mate-pair analyses requires deep coverage and would be computationally very challenging in cfDNA with low TF.In silico selection of naïve short DNA fragments (<150 bp) for WGS libraries could enhance CNA signal from the tumor fraction. 22However, cfDNA fragment sizes could be more variable (long and short) in cancer patients' plasma than in healthy individuals' plasma 23,24 .Tumor-derived high molecular weight DNA in plasma has also been documented. 25w TF can be overcome by enriching cancer DNA-speci c features in vitro by processing cfDNA.To test the idea, we turned to our SV-targeting technology called Genome-wide Analysis of Palindrome Formation (GAPF).8][29] The procedure for enriching DNA palindromes is relatively simple and can be done in vitro from genomic DNA.While large (>10 kb) germline palindromes are characteristic of sex chromosomes, 30,31 DNA palindromes arise de novo in cancer by the inverted duplication of single-copy DNA sequences. 27Multiple palindromes can form during Breakage-Fusion-Bridge (BFB) cycles initiated by a chromosome with two centromeres (i.e.dicentric chromosomes), arising from common processes in genome instability such as illegitimate repair of DNA double-strand breaks (DSBs), fusions of dysfunctional and critically short telomeres, and translocation gaining two centromere-bearing chromosomes. 32,33Importantly, anaphase bridges between dividing nuclei, an indication of segregating dicentric chromosomes, were observed in preneoplastic regions, 34 suggesting that DNA palindromes can arise very early during carcinogenesis, and thus can potentially be an excellent cancer biomarker.

Palindrome Detection by GAPF-seq
A protocol for the Genome-wide Analysis of Palindrome Formation (GAPF) was modi ed to contend with the low amounts of cell-free DNA (cfDNA) in plasma (Fig. 1A).Brie y, input genomic DNA (as low as 30 ng) was rst digested with relatively infrequent cutting enzymes, boiled for denaturation, and immediately quenched in ice to facilitate the rapid self-annealing of palindromic sequences (i.e., snap-back). 27,35The DNA was then digested using the single-strand speci c nuclease S1.The junctions of palindromic DNA can self-anneal following denaturing by boiling to re-form double-stranded DNA (dsDNA) and resistant to S1, whereas non-palindromic sequences will be digested by S1 after denaturation.The remaining dsDNA was ligated with sequencing adaptors and ampli ed during library construction.Resulting sequencing libraries should thus be enriched with palindromic junctions.
The sequencing reads from GAPF-processed DNA were aligned to the hg38 human reference genome and counted in 1 kb non-overlapping bins (Fig. 1A).An accumulation of reads in these 1 kb bins leads to high coverage in the bin and so, high coverage bins (HCBs) could indicate the presence of DNA palindromes.29,36 Indeed, HCBs were seen in known palindromic junctions on chr1 at RPRD2, ECM1, and CTSK in Colo320DM cell line DNA (Fig. 1B). 29To simulate low tumor fraction (TF) as expected from plasma cfDNA, Colo320DM DNA was diluted with IMR90 normal broblast DNA in vitro prior to GAPF.HCBs were detected at RPRD2, ECM1, and CTSK for all dilutions.No such peaks in this region were apparent in GAPF-seq data from IMR90 DNA (Fig. 1B, 100% IMR90).The average normalized read coverage (number of sequencing reads per bin divided by a per-million scaling factor) for HCBs at CTSK was 65.5 for 100% Colo320 input DNA and decreased exponentially by dilution (Fig. 1C).This exponential trend was similarly observed for RPRD2 and ECM1.The lowest peak was detected at ECM1 in 10% Colo DNA with an average read coverage of 3.67.Since the average read coverage genome-wide was approximately 0.31, GAPF-seq can still reliably detect palindromic signals after 10-fold dilution.
With paired-end sequencing, the relative orientation of mates can corroborate the presence of palindromes at ampli ed genomic segments.From WGS data, mates facing in the same direction (e.g., forward-forward, FF, and reverse-reverse, RR) relative to the reference genome can be found at the ends of the ampli ed region within 1q21.Identifying the FF or RR read pairs was possible in publicly-available deep coverage WGS data (Fig. 1D) 37 , but not our shallow WGS, possibly due to the underrepresentation of secondary structure-forming DNA in sequencing libraries.Navigating the secondary structure of palindromic DNA during PCR, a part of library construction processes, is known to be challenging. 38,39PF-seq for Breast Tumor DNA To evaluate the ability of GAPF-seq to differentially call tumor DNA and normal DNA, we rst collected tumor and normal GAPF-seq data from 39 pairs of fresh frozen breast tumors and matched leukocytes (buffy coat) (Supp Table 1).Breast tumors were either triple-negative breast cancer (TNBC) or luminal breast cancer subtypes, with stages ranging from Stage I to Stage IV (Fig. 2A).After GAPF, sequencing reads were counted in 1 kb bins genome-wide and the read coverage was calculated as described above (Fig. 1A).The average read coverage in each cytoband (instead of 1 kb bins) throughout the genome was evaluated to test the reproducibility of GAPF-seq using replicates from seven pairs of tumor and normal DNA (Fig. 2B, Supp Fig. 1).The read coverage at the cytoband level was almost identical, with an intraclass correlation coe cient of 0.9971 (95% CI: 0.9966-0.9974),con rming the reproducibility of the assay.
A vast majority of 1 kb bins (approximately 3,130,000 bins out of 3,209,513 total bins) from either breast tumor or buffy coat had an average read coverage less than 5, likely due to the digestion of nonpalindromic sequences by S1 nuclease.However, the average number of 1 kb bins with read coverage > 5 differed between tumor and normal samples (Fig. 2C).Especially at higher coverage ranges (read coverage > 10), tumor GAPF-seq data exhibited a greater number of bins than normal GAPF-seq data, suggesting the presence and enrichment of tumor-speci c palindromes.This observation formed the basis for de ning HCBs to differentiate tumor and normal DNA.
Since DNA palindromes are often associated with genomic ampli cation, the average copy number (CN) of chromosomal segments from shallow WGS data was plotted against the read coverage of GAPF-seq data for 26 luminal or 5 TNBC tumors (Fig. 2D).Overall, TNBC showed more CNA than luminal breast tumors, consistent with previous reports. 40Genome-wide GAPF read coverage appeared indistinguishable between subtypes.However, by looking at GAPF-seq and WGS data from individual pairs, we noticed unique subtype coverage patterns.For example, for patient 14-990 with TNBC, there was signi cant CNA genome-wide (Supp Fig. 2A).Notably, there was enrichment of DNA by GAPF-seq in cytoband 8q24.3 that correlated to increased CN.In contrast, a patient with luminal breast cancer (14-752) presented CNA and increased GAPF coverage that were localized primarily to cytoband 11q13 (Supp Fig. 2B).The coverage from GAPF-seq data was plotted against that from WGS at the cytoband level (Fig. 2E).Overall, cytobands 11q13.2-5were enriched in luminal tumors, while cytoband 8q24.3 was predominant in TNBC.
Tumor DNA Calling Algorithm by GAPF-seq Upon further inspection of regions of high GAPF coverage in the genome, it was found that several bins were commonly enriched in both tumor and normal samples (Supp Fig. 3).We considered the possibility We considered the possibility of germline palindromes. 41,42Alternatively, particular DNA sequences were relatively more resistant to denaturation.As a result, these sequences would be resistant to S1 nuclease and subsequently would be ampli ed during library construction.It was shown that GC-rich regions of DNA are not effectively denatured during boiling due to high thermostability and thus, can lead to nonpalindromic enrichment during GAPF. 35Because of these technical issues, 1 kb bins consistently enriched in both tumor and normal GAPF samples (75,765 bins, ~2.5% of the entire 3,209,513 bins) were removed in order to identify tumor-speci c de novo DNA palindromes (Supp Table 2).
The remaining 3,133,748 bins were ranked in descending order according to read depth coverage.The observation that HCBs were tumor-speci c (Fig. 2C) led us to test the differential tumor/normal DNA calling using the chromosomal distribution of the top 1,000 HCBs (as GAPF pro les visualized in pie charts) (Fig. 3A).
It was observed that normal DNA GAPF pro les showed HCBs originating from almost every chromosome (Fig. 3B, Supp Fig. 4 and 5, Supp Table 3).In contrast, HCBs in tumor GAPF pro les appeared to originate primarily from one or a few chromosomes, possibly due to the clustering of de novo palindromic sequences in the breast tumor DNA.The even distribution of HCBs to all chromosomes in the normal GAPF pro les suggests randomness of residual dsDNA ampli ed before sequencing, as would be expected from a sample that does not contain any palindromes in single-copy regions of the genome analyzed.We also validated that GAPF pro les were highly reproducible (Fig. 3C, Supp Fig. 6).The apparent tumor-speci c clustering of HCBs on chromosomes hinted at the intriguing possibility that a global threshold could be identi ed to classify samples as tumor or normal (threshold model) without prior knowledge of the precise locations of DNA palindromes.
To evaluate the optimal number of HCBs that can adequately differentiate tumor and normal DNA, receiver operating characteristic (ROC) curves were generated by taking the top 300, 1,000, or 10,000 HCBs and assessing the chromosomal distribution (Fig. 3C, Supp Table 3, 4, and 5).The classi cation criterion by which each sample was determined to be GAPF-positive or GAPF-negative was whether or not HCBs exceeded the global threshold on any one chromosome.For the top 1,000 bins, this global threshold varied from 70, at which point every sample was classi ed as GAPF-positive, to 915, at which point every sample was classi ed as GAPF-negative.The average area under the (ROC) curve (AUC) for the binary classi er using the top 1,000 HCBs was 0.9819 (s.d.± 0.0249) with three-fold cross-validation and three different random assignment seeds.The optimal global threshold was then determined by Youden's J statistic, which attempts to maximize the sensitivity and speci city of the binary classi er.
Using the top 1,000 bins, the optimal threshold was determined to be 100 bins with a sensitivity of 94.02% (s.d.± 2.96%) and a speci city of 99.15% (s.d.± 1.48%).Using the top 300 or 10,000 HCBs instead yielded AUC values of 0.9628 (s.d.± 0.0266) and 0.9050 (s.d.± 0.0556), respectively, indicating that using the top 1,000 HCBs appeared to be the most effective for the binary classi er for this dataset.
Based on the top 1,000 HCBs and threshold model, there were 36 GAPF pro les with at least one chromosome above the 100 bin threshold (GAPF-positive) and 3 tumors misidenti ed as normal (GAPFnegative) (Fig. 3D).Notably, clustering of HCBs on a chromosome was apparent for all stage I tumors (TNBC or luminal), showing that GAPF-seq is able to detect early stage breast cancers with palindromes (Fig. 3B, Fig. 3D).The three GAPF-negative samples were all stage II A luminal breast tumors.Luminal tumors exhibit fewer SVs than TNBC, 40 which might underlie GAPF-negative cases.Because SV often manifests as CNA, we quanti ed CNA for 31 breast tumors using shallow WGS and ichorCNA, a tool for estimating TF in plasma cell-free DNA (Fig. 3G).GAPF-negative tumors were low in CNA; however, several samples with low CNA were GAPF-positive (14-604, 16-094, and 16-263).To further examine tumors with very low CNA, we ran GAPF-seq with 10-fold in vitro dilutions of tumor DNA with paired normal DNA.All samples yielded the expected tumor DNA content after 10-fold dilution of tumor DNA.Based on the threshold (Fig. 3C), seven out of 8 samples were correctly called as tumor DNA suggesting that GAPF pro les can identify tumors with low DNA content (Fig. 3H, Supp Fig. 7, Supp Table 6).

Subtype-speci c Clustering of High Coverage Bins
By looking at chromosomes frequently carrying top 1,000 HCBs above threshold, chr1 and chr8 appeared to be common between luminal and TNBC subtypes, whereas chr16 and chr11 were unique to luminal and chr7 was unique to TNBC (Fig. 4A).We also examined HCBs at the cytoband level where the number of GAPF top 1,000 HCBs in each cytoband from tumor GAPF pro les was compared to that of normal GAPF pro les (Fig. 4B).In TNBC on average, 197.6 of 1,000 HCBs were in 8q24.3, the telomeric end of 8q, in contrast to 28.0 HCBs in normal DNA.The 8q24.3 cytoband was also a hotspot for enrichment in luminal tumors and normal DNA (Fig. 4C).HCBs were often clustered in cytobands close to telomeres.In addition to 8q24.3, clustering was seen in the telomeric cytobands 13q34 (in TNBC) and 16p13.3(in luminal) in a tumor subtype-speci c manner (Fig. 4B, Fig. 4C).The fusions of critically short telomeres followed by BFB cycles 43 may underlie the enrichment of palindromes and HCBs at the cytobands of chromosome ends.Notable exceptions include cytobands 8q24.21(in TNBC) and 11q13.2-5(in luminal), which harbor prevalent breast cancer oncogenes MYC and CCND1, respectively, and are frequently ampli ed.Indeed, the ampli cation of the CCND1 locus was strongly associated with HCBs (Fig. 4D).Complex CNA patterns at 11q13 were common in luminal, with another amplicon telomeric to the CCND1 amplicon associated with HCBs.
Aberrant read pairs spanning palindromic junctions are likely underrepresented, 38,39 and deep coverage would be necessary to identify FF or RR read pairs (Fig. 1D).Nonetheless, we were able to nd multiple RR read pairs at the copy number breakpoint of the amplicon at 11q13.2 with HCBs clustering nearby (Supp Fig. 8).

Machine Learning Algorithms for Binary Classi cation
Aside from the threshold model for our dichotomous diagnostic test, a variety of statistical, probabilistic, and optimization techniques can be applied to genomic data using machine learning (ML) for binary classi cation.An automated end-to-end ML pipeline called STREAMLINE 44 was used for the binary classi cation of the breast tumor and paired normal DNA (as in Fig. 3).The number of top 1,000 HCBs on each chromosome from the GAPF-seq data was input as a two-dimensional array and STREAMLINE attempted to identify the samples as tumor or normal.After basic data cleaning and three-fold crossvalidation partitioning (Fig. 5A), ML models were trained using Naïve-Bayes (NB), random forest (RF), extreme gradient boosting (XGB), category gradient boosting (CGB), and extended supervised tracking and classifying system (ExSTraCS) algorithms.All models yielded very high AUC values (Fig. 5B).ML modeling was repeated using two additional random assignment seeds in order to minimize batch effects from partitioning, and the AUC results from all three seeds were averaged together (Fig. 5C).Average AUC values were greater than 0.93 for all models, with the best-performing ML model being created using XGB (Avg.AUC = 0.958).Each unique seed yields a predicted class for a sample, and so, a heatmap visualizes the congruence of true positive outcomes (Supp Table 7).Binary classi cation was mostly consistent between models, with a few exceptions (Fig. 5D).We found improvements over the threshold model in calling 14-544T, which was correctly called by RF and CGB models when tumors with low TF are often mis-classi ed as GAPF-negative as in Fig. 3G.In contrast, many non-linear prediction models incorrectly classi ed 14-828T, 16-412T, and 16-298T.
The automated ML pipeline STREAMLINE also estimated the relative importance of each chromosome toward accurately predicting tumor or normal DNA based on the GAPF pro le (Fig. 5E).Feature importance was generated by randomly permutating one chromosome at a time in the testing data and evaluating its impact on the performance of each ML model. 45Interestingly, non-linear classi cation models RF, XGB, CGB, and ExSTraCS all deemed chr2 as most important.In particular, RF and CGB based their models almost exclusively on the GAPF HCB data from chr2, which stand in contrast to our threshold model.The top ve chromosomes (chr2, chr10, chr14, chr15, chr18) identi ed as important for binary classi cation by RF, XGB, CGB, and ExSTraCS all demonstrated depletion in tumor GAPF pro les relative to normal DNA pro les.For example, normal samples had an average of 51.3 of the top 1,000 HCBs on chr2 whereas tumor tissue had only 24.3 (Supp Table 3).In contrast, for chromosomes frequently carrying HCBs above the threshold such as chr8 and chr11, the average numbers of HCBs were 51.3 and 64.8 in normal and 135.9 and 121.2 in tumor, respectively.Only the linear classi er algorithm Naïve-Bayes yielded high feature importance scores for chr1, chr8, and chr11, all of which were frequently above threshold in our global threshold model approach.

Long cfDNA Fragments and Tumor Fraction
With the ability to differentiate tumor DNA from paired normal DNA, we next tested GAPF-seq using plasma-borne cfDNA (cfGAPF-seq).For GAPF-seq, su ciently long strands of dsDNA are needed so that both arms of DNA palindromes are physically tethered together in a continuous fragment to facilitate intrastrand annealing by snap-back.However, current mainstream methods for cfDNA extraction based on a nity-columns or -beads predominantly yield fragments bound to mono-or di-nucleosomes, which would be too short to capture palindromes.Whether very long ctDNA fragments exist in the plasma of cancer patients is under debate. 46To examine whether or not long cfDNA fragments exist and are as feasible for cancer detection as short cfDNA fragments, we isolated cfDNA using 0.5 mL plasma from 10 advanced prostate cancer (PCa) patients using either a nity-beads or traditional phenol-chloroform extraction (PCE) (Fig. 6A).Advanced PCa patient plasma should carry su cient ctDNA for analyses whereas the majority of our breast tumor samples are Stage II or earlier and less likely to have high amounts of ctDNA in plasma.Additionally, the use of PCa cfDNA serves as an initial test for cfGAPF-seq as a pan-cancer approach.The bead-based cfDNA isolation kit extracted 11 ng of cfDNA per 500 mL of plasma on average with only two samples yielding more than 10 ng.In constrast, PCE yielded 49 ng on average with at least 27 ng from every sample.Because of the low amount of cfDNA extracted by the bead-based isolation kit, there often was not enough cfDNA for additional characterization experiments in addition to sequencing.Furthermore, whereas the bead-based cfDNA isolation kit yielded mono-, di-, and even some tri-nucleosomal cfDNA, extraction using PCE yielded a broad range of cfDNA fragment sizes in addition to the nucleosome-sized fragments (Fig. 6B).Notably, there were fragments longer than 10 kb, which would be amenable to snap-back.We then used fragmentation and NGS library construction for shallow WGS to evaluate TF using ichorCNA.Patterns in CNA appeared similar between a nity beadextracted and PCE cfDNA (Fig. 6B and Fig. 6C).Slightly lower TF was observed in PCE cfDNA extracted from either 262L or 322J.To examine whether the decreased TF was due to lower TF in long fragments of the library, we evaluated TF for paired sequencing reads with short (<200 bp) and long (>200 bp) insert sizes (Fig. 6D), assuming that read pairs with >200 bp insert sizes likely came from long DNA fragments after fragmentation rather than mononucleosomal fragments.TF was equally represented in both populations of size-selected reads indicating that long cfDNA fragments would be feasible for cancer detection.
High Coverage Bins on chrX as a Determinant of Binary Classi cation GAPF pro les with the top 1,000 HCBs were generated from cfDNA along with matched leukocyte DNA from the buffy coat from 10 PCa patients (Fig. 7A, Supp Fig. 9, Supp Table 8).As with breast tumor and buffy coat, HCBs from the normal DNA were distributed to nearly every chromosome.In contrast, PCa cfDNA exhibited modest clustering of HCBs.An ROC curve with the top 1,000 HCBs showed that the AUC was 0.887 (s.d.± 0.131) (Fig. 7B).The optimal global threshold determined by Youden's J statistic was 90 bins, yielding a sensitivity of 87.04% (s.d.± 15.65%) and a speci city of 79.63% (s.d.± 21.29%).The TF for the cfDNA samples were grouped based on this output from GAPF pro les by our threshold model (Fig. 7C).Overall, the 5 samples with TF>0.1 were all called GAPF-positive.The lone false negative sample (353B) did not have the lowest estimated tumor fraction and two samples (324D and 213I) were correctly classi ed as GAPF-positive despite an estimated tumor fraction of zero.STREAMLINE was then used to generate binary classi cation models.The AUC values from each cross-validation set from 3 different models were averaged together and the best-performing model, on average, used the NB algorithm (Avg.AUC = 0.905) (Fig. 7D).A heatmap visualized the congruence of true positive outcomes against the castration status, prostate-speci c antigen (PSA) level prior to treatment, and TF of PCa cfDNA (Fig. 7E, Supp Table 9).Increased PSA and TF correlated with more accurate classi cation by GAPF-seq and castration resistance.
Since plasma cfDNA and DNA extracted from the buffy coat have different pre-analytical steps, the binary classi cation results could ostensibly result from differences in sample processing.cfGAPF-seq was performed on 11 plasma samples from males without any known diseases (Supp Fig. 10, Supp Table 10).[mention AUC] (Fig. 7F).These healthy "normal" plasma samples were classi ed along with the 10 plasma samples from PCa patients using the threshold model approach and ML modeling.The average AUC values from three different models for each algorithm were averaged together, and the bestperforming model used the CGB algorithm (Avg.AUC = 0.965) (Fig. 7G).Models using NB, RF, XGB, and CGB all considered chrX as the rst or second most important feature for binary classi cation (Fig. 7H).Because none of the samples had a seemingly signi cant number of top 1,000 bins on chrX, our threshold-based calling was insensitive to chrX.Therefore, we independently evaluated chrX HCBs in calling tumor cfDNA (Fig. 7I).While cfDNA from healthy individuals or buffy coat DNA from PCa patients contained zero or one top 1,000 HCBs, quite a few of these bins were mapped to chrX in cfDNA GAPF pro les from PCa patients.Indeed, HCBs on chrX are statistically different between PCa cfDNA and either matched normal DNA (p < 0.001, Wilcoxon signed-rank test) or healthy individuals' cfDNA (p = 0.009, two-sided Wilcoxon rank-sum test).Because the androgen receptor gene (AR) is located on chrX and causes therapy-resistance in advanced PCa [47][48][49] , we examined AR ampli cation (AR-amp) by shallow WGS.AR-amp was notable in four patients with therapy-resistant tumors and very high PSA, with dozens of top 1,000 bins within the ampli ed regions (Fig. 7J).In other patients, AR-amp was not evident, and HCBs were fewer and scattered throughout chrX.These PCa were therapy-sensitive at the time of the blood draw but would eventually progress to resistance and life-threatening disease.HCBs on chrX, as assessed by cfGAPF-seq could be an initial sign of chrX-wide rearrangement that would eventually leads to AR-amp, which could be consistent with DNA palindrome formation as an initial step of gene ampli cation.

Discussion
In this study, we tested the feasibility of our genome-wide assay targeting DNA palindromes (GAPF-seq) for liquid biopsy-based cancer detection.Because DNA palindromes are mechanistic components of BFB cycles, which are crucial drivers of genome instability and genomic ampli cation, targeting DNA palindromes for cancer detection seemed a rational strategy to test.GAPF is designed to enrich DNA palindromes and eliminate non-palindromic DNA.Therefore, GAPF-seq could detect cancer-speci c DNA with high sensitivity in plasma cfDNA with low TF.The idea was tested in this study using samples with in vitro serial dilution of tumor DNA with normal DNA, and cfDNA with various TFs.
GAPF-seq would be uniquely positioned for liquid biopsy targeting SVs if developed as a cancer detection test.Fold-back inversions can be identi ed successfully by WGS from tumor DNA. 26,50However, WGS requires decent coverage (>30x).In cfDNA, DNA representing fold-back inversions are diluted, the detection of which would require ultra-deep coverage.Ultra-deep WGS would be cost-inhibitory and requires signi cant time and effort for analyses.Targeted sequencing for gene mutation detection in cfDNA would have a higher sensitivity for cancer detection, as exempli ed by detecting tumor-informed mutations. 51,52However, ampli ed oncogenes lack mutations in most instances.Gene ampli cation not only occurs in treatment-naïve primary tumors for activating oncogenes but also occurs later during disease progression to confer therapy resistance, which primary tumors would not inform.Early detection of resistance would allow physicians to consider treatment options early when resistant tumors are small.Thus, frequent monitoring by liquid biopsy and timely detection of gene ampli cation is important.GAPF-seq with simple cfDNA processing, low coverage NGS (~7.5x coverage), and streamlined data analysis seems practical for frequent monitoring.
The proof of concept was tested in this study for AR-amp in lethal PCa.AR-amp is a major genomic change associated with advanced PCa. 49 AR signaling is a major driver of PCa progression.In recent years, contemporary AR signaling inhibitors (e.g.enzalutamide, apalutamide, abiraterone) have signi cantly improved the overall survival of patients. 53,54Despite these advances, resistance eventually develops.The underlying CRPC eventually metastasizes leading to death.The only indication of resistance is a rising PSA which is typically only veri ed later.Given mounting data pointing toward the bene ts of early initiation of chemotherapy, a test indicating resistance to AR suppression early would provide an opportunity to pivot the therapeutic approach that could potentially result in improved survival. 47We found that genomic regions enriched by GAPF on chrX were almost exclusively seen in PCa plasma cfDNA (Fig. 7I): all four cases with anti-AR therapy resistance and ve out of six therapysensitive cases.Resistance cases were associated with high TF in cfDNA and AR-amp (Fig. 7J).These results may suggest that palindrome formation and BFB cycles underlie AR-amp.The other six cases did not show AR-amp; however, HCBs scattered throughout chrX.Such bins may be early signs of BFB cycles and chromosome-wide rearrangements that eventually lead to AR-amp.If that is the case, these patients will eventually develop AR-amp.If not, HCBs in chrX could be a simple indicator of cancer DNA.
Systematic studies with longitudinally collected early and late-stage samples would distinguish these two possibilities.
We employed chromosomal distributions of HCBs (GAPF pro les) as outputs from GAPF-seq and tested their ability to distinguish breast tumor DNA samples from normal DNA.The output worked well in most cases, but several tumors (luminal) DNA and at least one cfDNA case are inconsistently classi ed.We found that all of these cases had low CNA measured by ichorCNA.It is possible that tumor-derived DNA was very rare in these samples.This is most likely the case for PCa cfDNA, because disease burden, estimated by PSA, was low for inconsistently classi ed or GPAF-negative cases.It is also possible that GAPF-seq pro les are not a good measure of tumor DNA when chromosomal rearrangements are rare.In this regard, all three GAPF-negative cases were from luminal tumors, which are known to be driven by hormone-dependent signaling. 55Understanding the power of GAPF-seq to differentiate tumor DNA from normal DNA for each tumor tissue of origin would thus be necessary.
In many cases, the relatively straightforward threshold approach yielded better binary classi cation results with respect to AUC than the ML models in many cases.In contrast with the ML models, which only receive HCBs as input data, the simple threshold models have the advantage of incorporating domain knowledge; i.e. ( 1) the understanding that a burden count above some threshold may be informative and (2) there is an informative higher-level feature abstraction, i.e. one or more chromosomes have a burden above the target threshold.In order for a machine learning algorithm to train a model similar to the threshold models, it would need to rst learn the abstract concept which decides "IF" any of the HCB features in the dataset have a value over a target threshold.The challenge of this abstraction is re ected in long standing computer science benchmark problems such as the "majority-on" or "evenparity" problems wehre simultaneousl knowledge of multiple features values is required to solve the problem (i.e., the majority of features in a set above some threshold, or the sum of feature values in a set is even, respectively). 56,57This ML challenge can be addressed in future work via feature engineering; applying domain knowledge to construct new features that capture information across all chromosomes (e.g.construct a binary feature identifying if any chromosome in the dataset has an HCB above a given threshold).This would be expected to improve ML binary classi cation.Beyond feature engineering, additional opportunities to improve the predictive performance of GAPF-seq with ML include (1) utilizing more sophisticated ML algorithms, (2) conducting a more extensive optimization of algorithm hyperparameters, and (3) combining base models trained by different ML algorithms into a predictive ensemble.
Nevertheless, utilizing chrX-based tumor/normal differentiation for PCa cfDNA/normal DNA was guided by insight from ML (Fig. 7I).Because HCBs were few in chrX, none of the cfDNA samples reached the threshold.However, the estimation of feature importance revealed that ML algorithms strongly relied on HCBs from chrX for binary classi cation (Fig. 7H).Thus, ML pointed to a feature that humans (i.e the authors) did not take into consideration.Feature importance is the most popular technique to explain why the ML models behave the way they do, the importance of which is pronounced in the medical eld. 58he importance of DNA palindromes in chrX is easily interpretable, given that AR-amp is a primary driver -20 o C overnight.The mixture was centrifuged at 14,000 g for 30 min at 4 o C and the pellet was washed with 70% ethanol and resuspended in low TE buffer.cfDNA concentration was measured using the dsDNA HS Assay Kit (Life Technologies, Q32851) and a Qubit 3.0 uorometer.DNA fragment analysis was performed using an Agilent 2100 Bioanalyzer system.

Genome-wide Analysis of Palindrome Formation:
We used a modi ed protocol of the Genome-wide Analysis of Palindrome Formation (GAPF) 35 for small amounts of DNA.Brie y, either 100 ng or 30 ng of input genomic DNA was digested with either KpnI or SbfI, both of which are relatively infrequent cutting enzymes, at 37 o C for 16 hr.The restriction enzymes were heat-inactivated at 65 o C for 20 min and the reaction products were mixed.The mixture was boiled for 7 min in 100 mM NaCl and 50% formamide and immediately quenched in ice water for 5 min.DNA was digested with 100 U/mg of S1 nuclease at 37 o C for 1 hr and processed with the ChargeSwitch PCR Clean-up Kit or Monarch PCR and DNA Cleanup Kit.
Puri ed DNA was fragmented using the Covaris M220 and the libraries were constructed using the NEBNext Ultra II DNA Library Prep Kit for next generation sequencing.Alternatively, GAPF libraries were constructed with the NEBNext Ultra II FS DNA Library Prep Kit, which uses fragmentase.Whole genome sequencing libraries were constructed using 40 ng of genomic DNA and the NEBNext Ultra II FS DNA Library Prep Kit.

GAPF-seq Quality Control:
To determine the quality of palindrome ampli cation by GAPF-seq, we validated the sequencing data using known inverted repeats on chromosomes 1, 9, 17, 19, and X (Supp Table 11).These short, inverted repeats are infrequently cut by the restriction enzymes and readily ampli ed by GAPF.Reads were counted at the inverted repeats at the X chromosome genes DMRTC1, SSX2, SSX4, CXorf49, CXorf51A, PNMA6A, and SPANXA2 as well as the inverted repeats at HIST2H3C on chr1, FAM225A on chr9, SNORD3A on chr17, and 19p13.2.The normalized read counts for each of the regions were averaged to determine an ampli cation score.Because these regions are repeated within the reference genome, ltering based on mapping quality score that only returns reads that uniquely map to the reference genome should deplete the normalized read counts (samtools view options: -b -q 40).Thus, ltered values for each region were averaged to determine a depletion score.GAPF-seq passes quality control with an ampli cation score above 1.0 and a depletion score less than 0.075 demonstrating adequate enrichment of DNA palindromes and elimination of non-palindromic or repeated sequences.
GAPF Background Bins: Due to their repetitive structure and high GC content, certain regions of the genome such as Alus, LINE elements, short tandem repeats, and segmental duplications can result in false positives or obscure the identities of DNA palindromes.To address this, we rst applied a lter based on mapping quality score (MAPQ > 40) that returns the reads that uniquely map to the reference genome (hg38).Then, we removed regions that were commonly enriched in both tumor and normal samples.Out of the original 3,209,513 non-overlapping 1 kb bins, this curation resulted in 3,133,748 bins for further GAPF analysis.(Supp Table 2).

Cytoband Analysis:
Coordinates for cytogenetic bands in hg38 were downloaded from the UCSC genome browser (cytoBand.txt.gz,2022-10-28 21:15).After removing the GAPF background bins (Supp Table 2), reads were counted in cytobands in order to calculate the coverage for each cytoband.There were 1,293 cytobands downloaded from the UCSC genome browser.Unlocalized sequences were removed from the analysis leaving a total of 863 cytobands for subsequent analyses.For gures containing 1-kb HCBs in cytobands, the number of top 1,000 high coverage GAPF bins corresponding to each cytoband were counted.

Automated Machine Learning Pipeline (STREAMLINE):
Machine learning classi cation modeling and evaluation was conducted using version 0.2.5 of the Simple Transparent End-To-End Automated Machine Learning Pipeline (STREAMLINE) 44 applying mostly default pipeline run parameters (exceptions identi ed below).STREAMLINE is a recently developed AutoML tool that facilitates ML modeling while enforcing rigorous modeling algorithm comparisons, analysis transparency, reproducibility, and sensitivity to complex associations in data.In the rst year of its release it has been successfully applied to various biomedical ML modeling and data mining tasks. 61,62Input data were the number of top 1,000 HCBs on chromosomes 1-22 and X (23 features) unless otherwise stated.Three-fold cross-validation (CV) was used in the predictive modeling using the following classi cation algorithms: Naïve-Bayes (NB), Random Forest (RF), Extreme Gradient Boosting (XGB), 63 Category Gradient Boosting (CGB), 64 -wide Analysis of Palindrome (GAPF) enriches structural variants.a DNA palindromes are genomic regions with sequences that are reverse compliments of each other and could span large genomic segments.Genomic DNA was rst digested with restriction enzymes (RE), by which only palindromic junctions folded-back to hairpin-capped dsDNA after denaturing DNA by boiling and rapidly renaturing in an ice water quench.Non-palindromic DNA, including DNA liberated from palindromic junctions by RE, was digested by S1 nuclease.Note that S1 also digested the hairpin part of DNA from palindromic junctions.Following adapter ligation to dsDNA and ampli cation by PCR was done as a part of NGS library construction.Read coverage was quanti ed in 1 kb bins genome-wide.b Read coverage in 1 kb bins for genomic or GAPF DNA is visualized using the Integrative Genomics Viewer (IGV).Colo320 colorectal adenocarcinoma cancer cell line DNA is diluted by IMR90 normal broblast DNA prior to GAPF.Read coverage in GAPF data show peaks at known palindromic junctions on chr1 at RPRD2, ECM1, and CTSK.c The GAPF read coverage decreases following an exponential trend at RPRD2 (R2 = 0.9908), ECM1 (R2 = 0.9848), and CTSK (R2 = 0.9776) as the amount of diluting IMR90 normal DNA increases.d Paired sequencing reads can have anomalous orientations such as "reverse-reverse" (RR) or "forward-forward" (FF) at sites of structural rearrangements near known palindromic junctions in Colo320 cancer cell line DNA (left).A schematic for how palindromes create RR or FF read pairs is presented (right).

Figure 2 High
Figure 2

Figure 3 High
Figure 3

Figure 5 A
Figure 5

Figure 7 GAPF
Figure 7 or Extended Supervised Tracking and Classifying System (ExSTraCS).65Standardized within STREAMLINE, all algorithms with hyperparameters (excluding NB) undergo an Optuna-driven hyperparameter sweep 66 with 200 target trials and a timeout of 15 min across a broad range of hyperparameter options.In addition to evaluating all models across 16 classi cation metrics, STREAMLINE also calculates model feature importance estimates uniformly for each algorithm and model using permutation feature importance estimation.STREAMLINE model training was repeated for each analysis using random seeds of 22, 32, or 42 unless otherwise stated.