Patients
Plasma samples from 75 women with BIRADS 4c/5 mammography findings were collected just before tissue biopsy prior to cancer diagnosis and treatment.
Tumour biopsies were extracted using core needle biopsies which were fresh frozen. Immunohistochemical (IHC) analysis was performed to quantify expression of human epidermal growth factor receptor 2 (HER2), hormone receptors (HR) and Ki67. Estrogen receptor (ER) and progesterone receptor (PR) were considered positive in tumours presenting more than 1% nuclear-stained cells. HER2 staining was scored according to guidelines[9]. HER2 status was considered positive when graded as 3+, while 0 to 1 + were negative and 2 + was an inconclusive result and silver in situ hybridization was performed.
Blood sample processing
10 ml of plasma were obtained from each recruited individual in STRECK tubes (Streck, La Vista, NE). Within 2h after collection, plasma was isolated from whole blood by centrifugation for 10min at 3000 rpm at room temperature and stored at − 80 ◦C until circulating-free DNA (cfDNA) extraction.
DNA extraction and quantification from plasma and solid biopsies
cfDNA was extracted from plasma samples using the QIAamp Circulating Nucleic Acid Kit (Qiagen, Hilden, Germany) according to manufacturer’s instructions. Tumour DNA was isolated from fresh frozen tissue samples using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany) following manufacturer’s instructions. cfDNA and DNA from solid tumours were quantified using droplet-digital PCR (Bio-Rad, Hercules, CA) and the RNAseP assay (Thermo Fisher Scientific, Waltham, MA, USA) as previously published[10].
Sequencing BC panel design
The genes to be included in the custom panel were selected as follows: i) Genes with mutations in BC in > = 1% of samples from a public database (https://www.cbioportal.org/), ii) Genes analysed and mutated in BC samples from a seminal study[11], iii) Genes with interest in BC biogenesis and iv) other interesting genes showing low mutation frequencies in BC databases but with important roles in other cancer types. Thus, the panel included the coding regions of following gene list: AKT1, ARID1A, ATM, BAP1, BRAF, BRCA1, BRCA2, CBFB, CDH1, CDKN1B, CTCF, ERBB2, ESR1, GATA3, HRAS, KDM6A, KRAS, MAP2K4, MAP3K1, MEN1, NCOR1, NF1, PBRM1, PIK3CA, PIK3R1, PTEN, RB1, RUNX1, SF3B1, SMAD4, TBX3, TP53, USP9X (Additional file 2: Table S1). The custom NGS panel for BC was designed using the SureDesign software (Agilent, Santa Clara, CA) with the next settings: 5x for tiling, least stringent for masking, XTHSBoosting for Boosting and a value of 30 for extension into repeats.
Sequencing library preparation
SureSelectXTHS (Agilent, Santa Clara, CA) methodology was employed to generate sequencing libraries. We constructed libraries using a median input plasma DNA of 39.78 ng (max 173.91 ng – min 5.01 ng) from BC patients and 21.78 ng (max 113.52 ng – min 1.71 ng) from healthy individuals and a median tissue DNA of 199.50 ng (max 200 ng – min 6.95 ng) from tumours. The DNA from tissue was fragmented using the SureSelect Enzymatic Fragmentation kit (Agilent, Santa Clara, CA) and the libraries prepared using the SureSelectXT Target Enrichment System kits (Agilent, Santa Clara, CA) following manufacturer’s indications. All PCR steps were carried out in the C1000 Touch Thermal Cycler (Bio-Rad, Hercules, CA).
Fragment ranges from libraries were assayed with the Bioanalyzer High-Sensitivity DNA chips (Agilent, Santa Clara, CA) and quantified using the KAPA Library Quantification Kit (Roche, Basel, Switzerland). For tumour tissue DNA sequencing, 8 pools containing 8 to 9 library samples per pool were prepared and sequenced. For BC plasma DNA, 8 pools containing 9 to 10 library samples per pool and 3 pools containing 7 to 8 library samples per pool from healthy controls plasma DNA were also prepared and sequenced. 19 lanes (1 lane per pool) were employed to sequence the libraries aiming to obtain ultra-deep sequencing of around 20,000X before de-duplication in the DNBseq-G400 platform (MGI, Hong Kong) at 100 pair-end reads following manufacturer´s instructions for UMIs sequencing.
Sequencing data processing
We created a custom pipeline for the processing of the SureSelectXTHS (Agilent, Santa Clara, CA) sequencing data. We initially performed quality control of the sequencing data using fastQC v0.11.9. Next, we trimmed reads for adapters and quality filtered using trim-galore v0.6.7. To perform the processing steps that involve barcoded data, we used a subset of fgbio tools v1.5.1. We mapped the data to the GRCh38 reference genome using bwa v0.7.17. We next used fgbio GroupReadsByUmi to collapse by barcode using the Identity option to take into account that SureselectXT-HS barcodes are degenerate. Next, we generated consensus reads using fgbio CallMolecularConsensusReads. The generated consensus reads were mapped again with bwa. We then filtered these aligned consensus reads using fgbio FilterConsensusReads requiring a minimum base quality of 30 and keeping consensus reads supported by at least a minimum number of reads. We then used fgbio ClipBam to remove forward and reverse reads overlapping regions.
Finally, we performed variant calling with Mutect2 (gatk v4.2.2.0–1) including a panel of non-cancer DNA and a germline variant annotation file for the GRCh38 genome, obtained from the gatk resource bundle, that we used to annotate variants for filtering and only considering the regions included in the SureSelect panel. We annotated the variants with ANNOVAR[12] v20200608 with custom made databases for COSMIC version 95 and TCGA, downloading the calling results generated with the MuTect2 variant caller from the GDC data portal[13] for the latter.
Variant filtration and analysis
For tumour, we used a more stringent approach in order to create a solid reference to compare with the ctDNA findings. We generated consensus reads requiring a minimum of 3 contributing reads per read family. We accepted as valid calls only variants with VAF > 0.05 that were also present in either COSMIC or TCGA, increasing the VAF threshold to VAF > 0.2 for Formalin-Fixed Paraffin-Embedded tissues.
In case of ctDNA, we identified mutations using two methods: i) Stringent; using the same approach as described above but filtering for a minimum of 1 read per read family, with no VAF threshold applied. To consider mutations not found in tumour as detected in plasma we required them to have a duplex configuration, with at least two fragments mapping to different coordinates and to be present in both COSMIC and TCGA BC. We applied the same processing approach to control samples. ii) Exploratory; visualizing the alignments in the IGV genome browser[14] in order to identify mutations previously found in the corresponding tumours but missed by variant callers. When we detected the presence of the variant not reported by the variant caller, we counted the number of reads carrying the mutation in a given sample and the number of reads not carrying the mutation. Then, we compared them against the same read proportions in controls and BC plasma samples without the corresponding mutation using a Fisher test (Additional file 2: Table S2).
Statistical analyses and data visualization
We performed statistical analyses and plotted data with R (https://www.R-project.org/). Fisher´s exact test or Chi-square test were applied when appropriate both for testing association between clinicopathological variables and plasma sequencing data as well as in sequencing data analyses. Wilcoxon test was also applied to test for differences in sequencing coverage between cases and controls (Additional file 1: Figure S1). The threshold for statistical significance was established at p < 0.05. Sensitivity, specificity and PPV values were calculated using the caret v6.0.93 package. The oncoplot function from the maftools[15] v2.12.0 package was used to plot mutations and clinicopathological data.