4.1 Study design
This multicenter, case-control study aimed to create and assess a 5hmC-based classifier for detecting CRC, even at early stage. Blood samples were obtained from 2483 individuals from 12 commercial suppliers covering 56 individual sites, sourced from both biobanks (~ 40%) and prospective collection (~ 60%). Control samples were from people aged 45–85 years who were at average risk for CRC and had been assessed by colonoscopy with results that showed no presence of CRC or adenomatous polyps (adenomas). Cancer samples were from people aged 45–85 years who underwent colonoscopy and were diagnosed with CRC, lung, breast, bladder, prostate, ovarian, stomach cancer or adenomas. Advanced adenoma was defined as high-grade dysplasia or with ≥ 25% villous histologic features OR measuring ≥ 1 cm in the greatest dimension, in agreement with Imperiale et al.39
We recorded the following data for each blood sample donor: age, sex, ethnicity, smoking, diabetes status, previous medical history, and concomitant medications (including daily NSAID use) and type of diagnosis.
Extraction of cfDNA was attempted from double spun plasma donated by 2483 individuals (Fig. 1). cfDNA from 2198 individuals was passed forward to sequencing. 5hmC was quantified across the cfDNA genome for all individuals via comparison of region read depth in sequencing libraries enriched by highly sensitive capture of 5hmC, to a non-enriched library. The hydroxymethylome of 701 participants was interrogated by machine learning algorithms to produce classifiers distinguishing CRC from a range of other conditions. The classifiers were then tested on previously unseen 5-hydroxymethylome data from 691 other individuals (the validation set).
The sample size of the validation set was computed based on demonstrating improvement over the multitarget stool DNA test (sensitivity 92.3% and specificity 86.6% [95% CI: 85.9–87.2%]) and fecal immunochemical test (FIT) (sensitivity 73.8% and specificity 94.9% [95% CI: 94.4–95.3%])39. Achieving a desired 1-sided sensitivity for CRC, we initially chose a lower bound of 83.0% for the confidence interval and a point estimate of 92.3% for this study. Without accounting for gender, age and stages of disease differences, 95% power,
0.025 alpha and 5% dropout rate, a sample size of at least 330 CRC confirmed cases was deemed required for sensitivity characterization. To achieve approximately the same power, a similar sample size of at least 340 would be required to demonstrate the desired specificity. However, we chose to trade off model training error with validation error, reducing the validation sample size to 220 CRC and 164 controls. Consequently, the ability of 5hmC to detect CRC may be underestimated by this study.
The study was performed in accordance with the Declaration of Helsinki and was approved by the relevant independent ethics committee or institutional review board for each commercial supplier of blood samples. Written informed consent was obtained for all donors of samples.
4.2 Method details
4.2.1 Blood sample collection
Where possible, sampling was performed before colonoscopy. Blood (10 ml) was drawn into a K2 EDTA blood tube, placed on ice and processed within 4 hours. Samples were centrifuged (2000 g for 10 minutes at room temperature). The plasma layer was transferred to a clean tube and was again centrifuged (2000 g for 10 minutes at room temperature), to remove any remaining cellular material. Double-spun plasma was aliquoted into tubes in at least 1 or 2 ml volumes, then immediately frozen and stored at -80°C before shipment to the central laboratory for investigation.
4.2.2 Sample balancing
To avoid confounding the biological signal, the OSAT algorithm40 was utilized to achieve an even distribution of disease state and potential confounders across experimental plates for both cfDNA extraction and hydroxymethylome capture. The associations of these factors with batch were tested using a Chi-square test and the design was modified where necessary. The distribution of disease, sex, ethnicity, and age group did not show statistically significant variation (p < 0.05) over the 96-well plates that were subject to the automated library processing. This ensured that any plate related processing biases were distributed evenly across sample characteristics.
4.2.3 cfDNA extraction and library creation
cfDNA was extracted using the NextPrep-Mag™ kit on the Chemagic Prime platform (Perkin Elmer chemagen Technologie GmbH, Baesweiler, Germany) using 2 ml of plasma, in 48- well plates. Two plates were extracted simultaneously and combined in a single 96-well plate at the end of the extraction process. cfDNA concentration was pre-quantified by PicoGreen (Life Technologies) assay on a CLARIOstar plate reader. cfDNA that reached a threshold concentration by PicoGreen was further quantified and assessed for cfDNA purity by gel electrophoresis (Fragment Analyser, Agilent, Santa Clara, CA, USA). cfDNA samples with yield ≥5 ng were normalized and 5 ng was plated using the Chemagic Prime instrument into 96-well plates ready for library preparation. Five 166bp synthetic controls were included in every sample of the experiment to control the quality of hydroxymethylome capture. The positive controls contain 1, 3 or 6 of 5hmC residues, and negative controls contain 6 of 5mC and unmodified Cs, respectively.
cfDNA samples were end repaired, adenylated (Kapa Hyper Prep kit, Roche Sequencing and Life Science), ligated to unique dual index (UDI) adaptors (Illumina TruSeq DNA Unique
Dual (UD) Indexes, Illumina, San Diego, CA, USA) and purified using SpeedBeads™ magnetic carboxylate modified particles.
Part of each sample (1 µl) was used to create an “input library” by directly PCR amplifying (9 cycles) ligated cfDNA. The remaining 12 µl of each sample was used for hydroxymethylome capture.
4.2.4 Hydroxymethylome capture
After the adapter ligation, the cfDNA strands were denatured and copied using a primer complementary to the sequence in the Illumina adapter using DNA polymerase I Klenow fragment (3′→5′exo) (Enzymatics, QIAGEN). Consequently, all the DNA fragments in the library comprised duplexes where one strand represented the original native genomic DNA, complete with epigenetic marks, and the other strand was an unmarked complimentary copy. 5hmC residues in the original genomic strand were selectively labelled with an azide- modified UDP-Glucose by incubation with UDP-6-N3-Glu (Jena Bioscience, Jena, Germany) and T4-beta-glucosyltransferase (Thermo Fisher Scientific, MA, USA). In turn, the azide groups were biotinylated with DBCO-PEG4-Biotin (Click Chemistry Tools, AZ, USA).
Samples were then purified using the DNA Clean & Concentrator kit (Zymo Research, Irvine, CA, USA). The 5hmC biotin conjugates were selectively bound to streptavidin beads (Dynabeads M-270, Invitrogen, Carlsbad, CA, USA). Finally, the single strand copies of the hydroxymethylome library were liberated from the beads by 0.1M NaOH and were PCR amplified (16 cycles). (Fig. 2A). The input and hydroxymethylome libraries were purified using SpeedBeads™ magnetic carboxylate modified particles (Sigma-Aldrich).
Concentration was determined using PicoGreen, library size and concentration were also determined using Fragment Analyser data.
We prepared 3 nM of non-hydroxymethylome enriched libraries (“input”) and hydroxymethylome library pools, respectively. Libraries were sequenced on the
NovaSeq platform using 100bp paired-end mode, yielding approximately 60 million reads per sample.
4.2.6 Bioinformatic data processing and quality control
Demultiplexing and trimming was achieved using bcl2fastq Read (Illumina Basespace). Reads were aligned to the human genome (GRCh38) using BWA-MEM41, those with a BWA mapping quality score (MAPQ score) less than 1 were filtered. Sequence duplicates were removed using Picard MarkDuplicates. Libraries were scored for quality on 30 criteria. A cumulative quality score of 0 indicates perfect library quality and the absence of quality issues. Libraries scoring over 15 were discarded (along with their mate pair). Libraries scored 5 points if they had < 10M reads post deduplication, or a ratio of reads per kilobase of genebody, per million mapped reads (RPKM) across all for gene bodies divided by the RPKM in intergenic regions of < 1 (for the hydroxymethylome libraries only), or a lack of spike-in control amplification (< 1 for ratio of hydroxymethylation over methylation and cytosines), or a mitochondrial RPKM > 1000, or < 10% of reads mapping to peaks, or a median insert size > 200nt, or uniformity < 0.8 (for input libraries only). Libraries scored 3 points if they had > 1.5x the interquartile range for 26 quality metrics. Hydroxymethylome libraries scored 1 point if they deviated by > 2 standard deviations from ranges of gene body versus intergenic enrichment, duplication rate and coverage of the previously observed in other in- house studies, and to input libraries which deviated by > 2 standard deviations in sequence diversity score to the observed ranges from previous studies.
4.2.7 Feature Definition
To calculate 5hmC levels at gene enhancers, we first calculated read counts using Bam readcounts v0.01. RPKM were calculated over candidate gene-enhancers (adapted from42) downloaded from GeneCards v4.4. 5hmC enrichment was computed as the log2 ratio between the hydroxymethylome library RPKM and the input library RPKM after the inclusion of pseudocounts.
We produced a set of cfDNA fragment features inspired by the DELFI approach adapting the computational methodology, available via GitHub36. Briefly, we divided the genome into 100KB bins and quantified cfDNA fragment sizes per bin. We removed blacklisted regions, genomic gaps (UCSC table) and non-standard chromosomes a priori. We excluded outlier bins in fragment size, only retaining fragments between 100nt to 220nt length. Finally, we split the genome into 100KB bins (in total 26170 non-overlapping genomic regions) and calculated the following characteristics of fragment size distribution per genomic bin: number of short fragments (100-150nt), number of long fragments (151-220nt), ratio between short and long fragments and the total number of fragments. This approach generates 26170 features per metric and per sample. The last step is the averaging of the 100 KB bins into larger non-overlapping genomic regions of 5 MB (in total 512 bins).
The final set of fragmentomics features, referred to as Nucleosome Presence Score (NPS) features, consists of metrics related to nucleosome presence and is capturing information at a highly localized scale. Our approach is inspired by the windowed protection score method of Snyder et al16 but includes some key differences. Firstly, 40 samples (20 CRC and 20 HV) from the training cohort were reserved for developing the NPS approach. Subsequent models were never trained using these 40 samples. Based on 5hmC pulldown libraries in these 40 samples, a total of about 235,000 regions were identified by merging peaks produced by the MACS2 and EPIC2 peak callers. Using the bedtools coverage tool, average per-base coverages were computed for each region for each of the 40 samples. By sorting regions by the median coverage across all 40 samples, the 200 regions with the highest median coverage were chosen. In total, these regions covered about 4.5Mb of the genome. Next, the reads from the input libraries of the 40 samples were pooled, thus producing a single .bam file of depth 110.65X. This pooled sample was used for identifying nucleosome positions in the 200 regions defined above. Nucleosome calling was done by computing NPS profiles and using a simplistic peak calling approach, this approach assigning nucleosomes to NPS maxima in a 151bp sliding window. If multiple maxima existed within 76bp of each other, these were assumed to represent a single nucleosome position located at the midpoint between the maxima. NPS profiles were computed using fragment size data describing the start and end position of fragments. Fragment data were generated from deduplicated bam files with non- properly paired reads removed using Samtools (-f2 flag). Pairs of reads were collapsed into fragments using the bedtools bamtobed command and fragments of length more than 1000bp were removed, these were assumed to be errors. To compute the NPS profile in a given region, a sliding window approach was employed with a window size of 121bp and with NPS values defined for the midpoint of the window. With a view to capturing single nucleosome configurations, fragments less than 120bp or larger than 250bp were discarded. For each window, the NPS was defined as the ratio of the number of fragments spanning the window (n_span) to the number of fragments with at least one endpoint inside the window (n_within). As a result, the metric takes positive values and is independent of read depth. To limit cases of divergence, a pooling approach was applied wherein the NPS at position i was defined using information from +/- 5 neighboring positions:
In events where, the NPS was set as NA and subsequently imputed. Imputation was achieved using a simple “fill in the gaps” strategy where missing values were assigned such as to linearly bridge the nearest non-NA values. Samples where more than 90% of the NPS profile of a given region was NA were categorized as “undefined” for this specific region. Such incidences were addressed by the feature-level imputation strategy. Finally, with a view to constructing a clear nucleosome signal, NPS profiles were smoothed using a Savitzky-Golay filter of degree 2 with a window size of 151bp.
Feature matrices were constructed for all samples using the nucleosomes identified from the 40 left out samples. Features were defined as the minimum NPS value in a +/-50bp neighborhood around the midpoint between two nucleosome positions, provided the nucleosomes were no more than 300bp apart. The midpoint between nucleosomes were found to be marginally more informative than the actual nucleosome positions. Samples where more than 90% of the NPS profile of a given region was NA were categorized as “undefined” for this specific region. Such incidences were addressed by the feature-level imputation strategy. No feature had more than 2% missingness.
4.2.8 Classifier development and internal cross-validation
We trained a Support Vector Machine (SVM) models using a linear kernel function on feature scaled (z-score normalization) 5hmC levels of enhancers quantile-normalized over samples (see Supplementary Fig. 4). An ensemble of 50 models were trained. Each model in the ensemble was trained on a randomly selected 80% of the samples in the training set and the trained models were used to predict 20% of the remaining samples in an internal cross- validation procedure within the training set. We identified features significantly correlated with technical covariates such as age, sex and vendor using linear regression and ANOVA F-
test for continuous and categorical variables, respectively. These features were then excluded. Model hyperparameters (the C parameter) were optimized for the highest AUC within a 10- fold cross validation strategy. Performance was averaged across the 50 individual learners, and the unique features selected by all 50 were retained (Supplementary Fig. 4). 95% confidence intervals of various performance metrics (e.g., AUC, sensitivity) for each of the classifier ensembles were computed using 2000 bootstrap replications of predicted samples. Classifier development and cross-validation within the training set was performed by the model development team, who logged trained and timestamped classifiers in a registry, along with auxiliary information on the training procedures for each model. A MD5 checksum was then computed for each classifier, functioning as a unique identifier.
4.2.9 Assessing classifier performance in validation set
The cross-validated classifier was then assessed in the “held-out” validation set of 691 samples, locked prior to model development process.
To facilitate a blinded model validation strategy a separate team performed the model validation. Version history on the classifier registry verified that hash keys had not been modified since initial logging. The validation team generated feature matrices and metadata files for the validation set and subsequently applied the models on the validation data. The validation team operated on virtual machines and storage belonging to a separate cloud project which was inaccessible to the model development team. During the model validation process, the hash key from each applied model was compared to the logged hash key to ensure model integrity. Here again, prediction probabilities from every learner within the classifier ensemble were averaged to compute the final prediction probability for each sample. Final performance metrics such as the AUC, sensitivity and specificity were computed based on the averaged prediction probabilities. The performance results were automatically uploaded to a cloud database without any intervention from the model development team.
4.2.10 Assessing Tumor Fraction with ichorCNA
We ran the ichorCNA workflow on input libraries sequenced for the internal validation sample set.
This involved first running readCounter from the hmmcopy version 0.1.1 with the following command with the parameters window size set to 1000000 and quality set to 20. We then further ran ichorCNA, using the recommended settings for low tumor fraction samples as per below:
4.2.11 Functional analysis of 5hmC enhancer regions driving classifier performance
-centromere GRCh38.GCA_000001405.2_centromere_acen.txt \
estimateNormal True --estimatePloidy True --estimateScPrevalence False \
scStates 'c()' --txnE 0.9999 --txnStrength 10000 --normal 'c(0.95, 0.99, 0.995, 0.999)' \
ploidy 'c(2)' --maxCN 3 --normalPanel HD_ULP_PoN_hg38_1Mb_median_normAutosome_median.rds
chrs 'c(1:22)' --chrTrain 'c(1:22)'
To develop a better mechanistic understanding of the classifier we ran pathway analysis using the Key Pathway Analysis (KPA) tool on top discriminatory enhancers. Enhancers were ordered based on their average contribution (averaged across the models) to the classifier and from the top 500 enhancers that appeared in at least 5 individual models were selected for the pathway analysis. The top scoring gene target for each 500 enhancers was taken from the ‘connected_gene’ field in the GeneHancer database42and used as the input for pathway analysis.