Discovery of structural deletions in breast cancer susceptibility genes using whole genome sequencing data from >2,000 African ancestry women

Structural deletions in breast cancer susceptibility genes could confer to cancer risk, but remain poorly characterized. Here, we conducted in-depth whole genome sequencing (WGS) in germline DNA samples from 1,340 invasive breast cancer cases and 675 controls of African ancestry to discover such deletions. We identied 33 deletions, including ve protein-truncating deletions in BRCA1, BRCA2, RAD51C, GEN1, and BRIP1, were observed only in cases but not in controls. Three deletions, including one protein-truncating deletion in TP53, were found to have a higher frequency in cases than in controls. In total, 4.6% of cases and 0.6% of controls carried any of these 36 deletions, resulting in an odds ratio (OR) of 8.0 (95%CI = 2.94 - 30.41). In addition, we identied a low-frequency deletion in NF1 associated with breast cancer risk (OR = 1.93, 95%CI = 1.14 - 3.42). These ndings have signicant implications for genetic testing for this common cancer.

Structure variants (SVs) from deletions can lead to the loss of genomic DNA fragments ranging from a few hundred to a million bases [14][15][16] . Consequently, SV deletions could have a larger impact on gene functions than SNVs and Indels 15,17 . However, it is challenging to systematically investigate SV deletions using the targeted sequencing assays that have been implemented in most previous studies. Deep wholegenome sequencing (WGS) techniques enable a systematic survey of the whole genome to identify all genetic variants in the genome. We recently reported the identi cation of six novel deletions in breast cancer susceptibility genes in a small WGS study conducted among 128 patients of Asian and European descent 10 . To expand the study and evaluate breast cancer genetic risk variants in African ancestry women, we analyzed WGS data generated in a large study conducted as part of the African American Breast Cancer Genetic (AABCG) consortium and the Ghana Breast Health Study (GBHS).

Results
The demographic and clinical characteristics of 1,340 invasive breast cancer cases and 675 controls selected from the AABCG consortium and the GBHS are presented in Table 1. The mean age at diagnosis of breast cancer cases was 54.8 years old (standard deviation (SD) = 7.7 years), and the mean age of controls was 58.0 years old (SD = 6.9 years). Approximately 14.8% of cases and 1.0% of controls reported a family history of breast cancer. Approximately 38.1% of the cases had hormone receptor negative tumors.
The summary of the identi ed potential loss-of-fucntion deletions We identi ed a total of 80 deletions in the intragenic regions of the 29 established or suspected susceptibility genes (Supplementary Data 4). The median length of these deletions was 405 bp (from 50 bp to 32.8 kb) and the majority of them were low frequency or rare (78.8% with a frequency of deletion carriers < 0.01). Of these deletions, 33 were only presented in the cases (n = 44), but not in controls. In contrast, 16 deletions were only presented in the controls (n = 17), but not in cases. The majority of these 49 deletions were located in the intronic regions (83.7%). Although overall there was no signi cant casecontrol difference in the frequency of carriers of the deletions detected in this study, cases were signi cantly more likely to carry the deletions either in the coding/exonic regions or in the intronic reigions with the evidence of epigenetic signals (P < 0.05; 2.6% in cases and 1.2% in controls). We also observed that four particular deletions showed a higher frequency in cases than in controls (odds ratio (OR) > 1.5 for each deletion; Supplementary Data 4).

Potential loss-of-function deletions only presented in cases
Of the 33 deletions that were only presented in cases (n = 44, 3.3% of 1,340 cases), we found ve putative pathogenic deletions in the exonic or coding regions of BRCA1, BRCA2, RAD51C, GEN1, and BRIP1, and 28 in the intronic regions of 18 other genes ( Table 2; Supplementary Data 4). Several deletions were found in all participating studies (Supplementary Data 5 and 6). No cases carried more than one deletion. Eleven of these deletion were identi ed in seven established breast cancer susceptibility genes, including BRCA1, BRCA2, PTEN, CDH1, NF1, STK11 and CHEK2, and the remaining 22 were identi ed in the 15 putative breast cancer susceptibility genes ( Table 2). One of these deletions, located in the PTEN gene, had been reported in our previous study conducted in Asian and European descendants 10 ; in the present study, the deletion was found in nine cases and no controls (P = 0.03) (Supplementary Data 4).
Three cases carried potential loss-of-function deletions in the BRCA1 or BRCA2 genes, accounting for 0.22% of cases under study ( Table 2). Of them, two putative pathogenic deletions are located in the coding region, a 3.34 kb deletion in the BRCA1 gene involving three exons (Fig. 1a) and a 32.8kb deletion in the BRCA2 gene, with a loss of nine exons (Fig. 1c). The third deletion (3.41 kb) is located in the rst intron of the BRCA1 gene. This deletion may involve functional elements with the evidence of ChIP-seq enriched peaks (Fig. 1b).
Sixteen cases carried potential loss-of-function deletions in ve other established breast cancer susceptibility genes, including PTEN, CHEK2, NF1, CDH1, and STK11, accounting for 1.2% of the total investigated cases ( Table 2). Of these deletions, three (~ 0.9 kb, 7.1 kb, and 0.5 kb) were identi ed in PTEN, one (~ 3.0 kb) in CHEK2, one (~ 9.8 kb) in NF1, two (~ 3.8 kb and 1.1 kb) in CDH1, and one (~ 0.3 kb) in STK11. All of these deletions can lead to the loss of intronic sequences of these breast cancer susceptibility genes. In addition, these deletion regions are likely to involve potential regulatory elements with the evidence of epigenetic signals, such as histone modi cations, DNase I hypersensitive sites and ChIP-seq enriched peaks (Supplementary Data 4).
Twenty-ve cases carried potential loss-of-function deletions in 15 putative breast cancer susceptibility genes, accounting for 1.9% of the total investigated cases. Of the 22 deletions observed in these genes, three putative pathogenic deletions involve exonic or coding sequences, including one (~ 140 bp) for BRIP1, one (~ 82 bp) for GEN1 and one (~ 4.9 kb) for RAD51C ( Table 2). All other 19 deletions can lead to the loss of intronic sequences of these breast cancer susceptibility genes ( Table 2). Of them, 11 deletion regions are located in nine genes, including BMPR1A, POLE, AKT1, RAD51D, MSH2, MSH6, XRCC2, FANCM and FANCC, which may involve regulatory elements supported by epigenetic signals (Supplementary Data 4).

Rare deletions with higher frequency in cases than controls
We identi ed three potential loss-of-function deletions in breast cancer susceptibility genes, with each showing a higher frequency in cases than in controls ( Table 2; Supplementary Data 4). The deletion in TP53 (~ 1.6 kb in the coding region) results in a loss of the whole last exon and the 3' untranslated region (UTR). This deletion was observed in six cases and one control in the present study and has been reported in our previous study 10 . The other two deletions were observed in the intronic regions of GEN1 and MSH6 and were observed to have epigenetic signals (Table 2; Supplementary Data 4).
Taken together, the above three potential loss-of-function deletions, together with those only observed in cases, were presented in 4.6% cases (n = 61) and 0.6% of controls (n = 4). Carrying any one of these 36 deletions was associated with an 8.0-fold increased risk of breast cancer (95%CI = 2.94 -30.41, P = 1.5 × 10 -7 ) ( Table 2). We observed signi cant different frequencies of the deletion carriers among studies (e.g. 3.6% and 1.4% for African Americans and Ghanaians, respectively; P = 0.01; Supplementary Data 6).
A low-frequency deletion associated with breast cancer risk We identi ed a low-frequency deletion (~ 72 bp) in the intronic region of NF1 associated with breast cancer risk (OR = 1.93, 95%CI = 1.14 -3.42, P = 0.01) ( Table 3). The deletion region may involve regulatory elements with the evidence of the epigenetic signals, including ChIP-seq enriched peak (Supplementary Data 4). We observed that deletion accounts for 5.3% of cases and 2.8% of controls.

Discussion
This is the rst large study that uses deep WGS technology to systematically search for putative pathogenic SV deletions in the established and putative breast cancer susceptibility genes in African ancestry populations. We identi ed 37 potental loss-of-function deletions that likely confer to breast cancer susceptibility, including several putative pathogenic deletions with strong evidence of the loss of coding or exonic regions in the susceptibility genes. Of them, 33 deletions were identi ed in seven established and 15 putative susceptibility genes that were only presented in cases. Of them, 36 deletions were seen in 4.6% of the cases and 0.6% controls under study, resulting in an 8-fold elevated risk of breast cancer among deletion carries. One of these deletions, located in PTEN, was associated with an increased risk of breast cancer in our previous study of European populations 10 . These identi ed potental loss-offunction deletions should be considered in genetic testing or in determining cancer therapies, such as platinum-based chemotherapy and inhibitors of poly(ADP-ribose) polymerase (PARP) commonly recommended for patients with BRCA1/2 mutations.
Two of the observed deletions that result in a deletion of the coding sequences of BRCA1 (3.34 kb, covering three exons) and BRCA2 (~ 32.8Kb, covering nine exons) are highly likely to be functionally signi cant as they disrupt gene function. In addition to these two well-known established breast cancer susceptibility genes, we showed that the loss of coding or exonic sequences for putative breast cancer genes, including BRIP1, GEN1, and RAD51C, could affect their gene functions, thus contributing to breast cancer susceptibility. Of the 31 potental loss-of-function deletions identi ed in the intronic regions of breast cancer susceptibility genes, 23 (74.2%) deletions may lead to the loss of potential regulatory elements, consequently disrupting the regulation of gene expression. For example, we observed H3K4me1, DNase clusters and TF ChIP-seq binding sites in the deletion in the intronic sequence of PTEN (Fig. 2). Future functional exploration and large-scale case-control association studies are warranted to con rm the deletions for breast cancer susceptibility.
We searched public SV datasets from the gnomAD database for the 37 SV deletions identi ed in our study and found that 28 deletions (75.7%) have not been reported in this database. Of the remaing nine SVs that overlapped with gnomeAD, no studies have reported their potential functional roles in disease susceptibility (Supplementary Data 7). In addition to our reported 37 SV deletions, we also identi ed three deletions, located in the coding or exonic regions of POLE (~ 505 bp), PPM1D (~ 107 bp) and STK11 (~ 1 kb), that were only presented in controls (Supplementary Data 4). Interestingly, the deletion carriers did not show either relatively young age or family history. It's possible these deletions may not signi cantly confer breast cancer susceptibility, while follow-up studies are needed to confrm these ndings.
It should be noted that our computationally predicted deletions lacked technical validation. However, we demonstrated the validity of our deletion calling in our previous study 10 . Speci cally, to enhance the accuracy of SV deletion determination, we applied multiple commonly used SV detection algorithms to generate a consensus set from at least two overlapping SV callers. In addition, we developed comprehensive bioinformatics analysis strategies by analyzing informative reads to re-genotype these identi ed deletions. In particular, the number of supporting reads, and a ratio of mapped reads in the deletion region relative to its anking region, were applied to remove some potential false positive ndings. Of the 80 identi ed potental loss-of-function deletions, 76.3% (61/80) were identi ed consistently by both LUMPY and Manta, providing some assurance for the validity of SV calling in our study. Furthermore, we manually reviewed each of the identi ed deletions using the Integrative Genomics Viewer (IGV) and samplot. As demonstrated in Figs. 1 and 2, the sequencing reads were observed to be signi cantly reduced in the deletion regions compared to the anking regions. These results provide strong evidence that the identi ed deletions were reliable.
In conclusion, this is the largest study using WGS data to search for potential loss-of-function deletions, especially putative pathogenic deletions, in breast cancer susceptibility genes. We provided strong evidence for the presence of putative pathogenic SV deletions in breast cancer genes. Our study reveals a large number of newly-identi ed potential loss-of-function and putative pathogenic SV deletions in African ancestry women, and these novel ndings may improve clinical testing and selection of cancer treament.

Study populations
This study included 1,340 invasive breast cancer cases and 675 cancer-free controls from ve studies, the Southern Community  these studies have been previously published and are brie y described below [18][19][20][21][22] . The SCCS is a prospective cohort study that recruited approximately 86,000 study participants aged 40-79 years between 2002 and 2009 from 12 states in the southeastern U.S. Approximately 32,500 of the SCCS participants were African American women. Cancer cases, including those diagnosed with breast cancer, were identi ed via linkage to state cancer registries. The NBHS is a population-based case-control study conducted in the Nashville metropolitan area. Participants were recruited between 2001 and 2008. Breast cancer cases were identi ed through the Tennessee State Cancer Registry and ve major hospitals in Nashville that provide medical care for breast cancer patients. Controls were identi ed via random digit dialing of households in the same geographic area as the cases. The MEC is a prospective cohort study conducted in Hawaii and the Los Angeles area and included 215,251 study participants recruited between 1993 and 1996. African American study participants were recruited from the Los Angeles area. Incident cancer cases were identi ed through two state-wide Surveillance, Epidemiology and End Results (SEER) registries: the Hawaii Tumor Registry and the California State Cancer Registry. The STSBHS is a population-based case-only study conducted in Tennessee, South Carolina and Georgia. Participants have been recruited since 2012. Breast cancer cases were identi ed through the Tennessee Cancer Registry, the South Carolina Central Cancer Registry and the Georgia Comprehensive Cancer Registry, also a SEER registry. The GBHS is a population-based case-control study conducted in Accra and Kumasi, Ghana, between 2013 and 2015. Breast cancer cases were identi ed through three major hospitals and controls were frequency matched to cases by age and district of residence.
In the present study, we selected cases and controls according to the following criteria. For both cases and controls, these criteria were used: 1) blood or saliva samples available; 2) with an African ancestry estimate of > 70% if the estimate is available. For breast cancer cases, the following additional criteria were used: 1) with ER status or tissue availiable for ER measurement; 2) for ER + cases, age of diagnosis < 65 or with family history of breast cancer; 3) for ER-cases, age of diagnosis < 70 or with family history of breast cancer; 4) for both ER + and ER-cases, excluding potential BRCA carrier cases with age of diagnosis < 45 and with family history of breast cancer. For controls, the following additional criteria were used: 1) without diagnosis of any cancer except non-melanoma skin cancer and and without family history of any cancer at the last follow-up; 2) with age at the last follow-up > 60, 50% of them in 60-64 yrs and 50% of them in 65-70 yrs.

Whole genome sequencing library construction
The whole-genome sequencing was performed using the Illunima HiSeq X Ten and BGIDEQ-500 platforms. For HiSeq X Ten sequencing, 1 μg genomic DNA was randomly fragmented by Covaris, followed by puri cation by an AxyPrep Mag PCR clean up kit. The fragments were end repaired by End Repair Mix and puri ed afterwards. The repaired DNAs were combined with A-Tailing Mix, then the Illumina adaptors were ligated to the Adenylate 3'Ends DNA and followed by the products' puri cation. The products were selected based on the insert size. Several rounds of PCR ampli cation with PCR Primer Cocktail and PCR Master Mix were performed to enrich the Adapter-ligated DNA fragments. After puri cation, the library was quali ed by the Agilent Technologies 2100 bioanalyzer and ABI StepOnePlus Realtime PCR System and were sequenced pair-end using HiSeq X Ten. For BGISEQ sequencing, 1μg genomic DNA was randomly fragmented by Covaris. The fragmented DNA was selected by Agencourt AMPure XP-Medium kit to an average size of 200-400bp. The selected fragments were sequenced through end-repair, 3' adenylated, adapters-ligation, PCR Amplifying and the products were recovered by the AxyPrep Mag PCR clean up Kit. The double stranded PCR products were heat denatured and circularized by the splint oligo sequence. The single strand circle DNA (ssCir DNA) was formatted as the nal library and quali ed by QC. The quali ed libraries were sequenced on a BGISEQ-500 platform.
Whole genome sequencing data processing All of the sequencing samples reached the average sequencing depth with at least 30X. Bcl2fastq2 Conversion Software (Illumina) was used to generate de-multiplexed FASTQ les. FASTQ les were processed according to the GATK best practices pipeline 23 . The sequencing reads were aligned to the human reference genome (GRCh38) using the Burrows-Wheeler Aligner (BWA) program (version 0.79a) 24 . The mapped reads were further processed by removing the duplicated reads using MarkDuplicates from the Picard tool (http://picard.sourceforge.net/) and recalibrating the base quality scores using BaseRecalibrator. The mapped reads from each analysis-ready BAM le for each sample was generated for subsequent analysis of SV deletion calling.

SV deletion calling and quality control
We systematically searched SV deletions using six SV callers, including GenomeSTRiP 25  We developed a pipeline to merge and compare SVs from different callers. To generate a consensus call from different SV callers, the initial merging step was implemented using the SURVIVOR tool using recommended parameters, including 1) the maximum distance of 1kb for breakpoints, 2) detected by at least two SV callers, 3) being on the same strand and 4) the minimum length of 50 bp for SV deletion 31 . For a consensus set of SV deletion for each sample, we further performed re-genotyping by analyzing informatics reads (i.e., split reads) from the analysis-ready BAM le using the SVTyper tool 32 . To remove a potential false-positive SV deletion, we ltered deletions if there were less than seven mapped reads around the deletion. We also calculated a ratio of mapped reads in the deletion region relative to its anking 1kb region using the duphold tool and ltered deletions if the ratio > 0.7 33 . We further merged the remaining SV deletions across all samples using SURVIVOR, with the recommended parameters. In the end, we generated a Variant Call Format (VCF) le of the high-con dent SV deletions for the study.

Characterization of identi ed SV deletions
To systematically annotate SV deletions in an intragenic region, we used the tool 'GeneOverlapAnnotator' from the GenomeSTRiP (17) to detect the overlaps between deletions and annotated genes. Using the human transcriptome annotation Gencode version 33 (GRh38) from the GENCODE (https://www.gencodegenes.org/human/release_33.html), we assigned each SV deletion a gene body region, including the coding, exonic, and intronic regions. Deletions between 50 bp and 1 Mb in length were evaluated in 40 genes, including 11 established breast cancer susceptibility genes and 29 putative breast cancer susceptibility genes (Supplementary Data 3). The function of SV deletions in intronic regions was also accessed through the UCSC Genome Browser (https://genome.ucsc.edu/cgibin/hgGateway). The epigenetic landscape of histone markers H3K4Me1, H3K4Me3, and H3K27Ac, DNase I hypersensitive sites, and chromatin immunoprecipitation sequencing (ChIP-seq) binding sites of transcription factors on all available ENCODE cell lines 34 were examined through layered tracks from the UCSC Genome Browser. We estimated the association of each deletion with breast cancer risk using Fisher's exact test. The analysis was implemented using R (version 3.4.3).

Comparison of identi ed SV deletions with public SV databases
We compared the SV deletions identi ed in this study with SV deletions from large genomic databases, including the genome Aggregation Database (gnomAD v2.1.1) 35 . We downloaded a SV dataset generated using whole genome sequence data from 10,738 unrelated individuals in gnomAD v2.1.1 through http://gnomad.broadinstitute.org/. The intersection of SV deletions identi ed in the study and the above databases were analyzed using the 'bedtools intersect' function. A SV deletion between our study and gnomAD was de ned as the same one if their reciprocal overlap (RO) was more than 80%.

Declarations
Data availability Access to the whole genome sequencing data could be requested by submission of an inquiry to Dr. Wei Zheng.

Code availability
Access to the custom code could be requested by submission of an inquiry to Dr. Wei Zheng.    Figure 1 SV deletions in the BRCA1 and BRCA2 genes. a An SV deletion in the coding region of BRCA1 (transcript: ENST00000357654) in one patient. b An SV deletion in the intronic region of BRCA1 (transcript: ENST00000634433) in one patient. The epigenetic evidence of TF ChIP-seq clusters was observed in this region. c An SV deletion in the coding region of BRCA2 in one patient.