Construction of cfDNA fragmentation profiles
The cfDNA fragmentation profile consists of a series of ratios to capture the cfDNA fragment patterns across specific genomic regions or the entire genome. Within the profile, each data point represents the ratio of short to long cfDNA fragments within a given genomic region. Cristiano and his colleagues [29] established this profile using 5 Mb nonoverlapping genomic segments over the whole genome. In this work, we used a similar scheme to construct the cfDNA fragmentation profile for the whole-genome cfDNA fragmentation data. We categorized the short cfDNA fragments as those with lengths from 90 bp to 165 bp and the long cfDNA fragments as those ranging from 166 bp to 250 bp (Fig. 2).
For the targeted panel data, we initiated by concatenating all the targeted regions based on their genomic coordinates. Subsequently, we divided this amalgamated genomic area into nonoverlapping bins, with each bin being 20 kb in width. Cristiano et al. suggested that a GC-content correction procedure may reduce the bias introduced during high-throughput sequencing [30–32]. However, considering that a smaller bin width may lead to greater variations in the number of fragments of these bins, we skipped this phase to avoid many NA values introduced during GC correction. By comparing the profiles with and without GC correction, no evident differences were observed for the cfDNA fragment profiles with and without GC correction for these targeted panel data (Figure S1).
Association analyses of cfDNA fragmentation profiles and cancer stages
We constructed cfDNA fragmentation profiles from 214 whole-genome cfDNA cancer samples across seven cancer types: bile duct, breast, colorectal, gastric, lung, ovarian, and pancreatic cancers. In Fig. 3A, we employed circos plots [35] to visualize the fragmentation profiles of these cancer types according to cancer stages I, II, and III/IV. The first circos plot depicts the median fragmentation profile consolidated from all cancer samples. The subsequent plots illustrate the median profiles for each cancer type, segmented by cancer stage. As controls, we integrated the median fragmentation profiles of healthy individuals into these plots [36]. As shown in the first circos plot, subtle distinctions are presented among the median fragmentation profiles across the three cancer stages for all cancer types included. However, divergence was demonstrated in the fragmentation profiles of cancer stages in each cancer type to varying degrees. For instance, the fragmentation profiles of the colorectal cancer cascade in an order from healthy, transitioning to stages I, stage II, and stages III/IV. In contrast, for lung cancer, the profiles of stages I and II cluster closely to each other, while the profile of stage III/IV is markedly distant from these two stages.
Circulating tumor DNA (ctDNA) has more variability in length, and compared to cfDNA from healthy cells, the consensus of studies has indicated that the average length of ctDNA is shorter in the mononucleosome range. Hence, we hypothesized that the cfDNA fragmentation profile, reflecting the ratios of short-to-long cfDNA fragments over genome regions, would increase with the progression of disease. To test this hypothesis, we applied Jonckheere's trend test to the compiled cfDNA fragmentation profiles. Jonckheere's trend test is a nonparametric statistical test designed to determine if there exists a significant trend across ordered groups. Using the fragmentation profiles from healthy samples as a reference, we anticipate that if the hypothesis stands, the fragmentation profiles of a specific cancer type could be in a sequential order – starting from the early cancer stage and advancing to the more advanced cancer stages. Jonckheere's trend test enables a quantitative assessment of the correlation between cfDNA fragmentation patterns and the progression of cancer.
Figure 3B summarizes the results of Jonckheere’s trend test across various cancer types, ranked by the p-values in ascending order. Here, only the p-value of colorectal cancer (p-value = 0.0158) falls below the conventional statistical significance threshold of 0.05, suggesting a statistically significant association between cfDNA fragmentation profiles and the progression of cancer stages.
Validating the association of cfDNA fragment profiles and cancer stages in CRC
To confirm the association between cfDNA fragmentation profiles and the stages of colorectal cancer, we sequenced 29 patients with colorectal cancer using a comprehensive cancer gene panel (read depth 40-50x). This panel targets 151 clinically validated cancer-associated genes (Table S1), covering approximately 1.32 MB or approximately 0.044% of the entire human genome.
Figure 4A denotes the cfDNA fragment length distributions of the WGS and targeted panel data. Although the targeted panel covers a small fraction of the human genome, the overall fragment length distributions between the two datasets are congruent. In addition, both datasets display oscillating waves occurring every 10 bp in the 100 bp to 150 bp range, a typical trait of cfDNA fragment length distribution, indicating that the targeted panel data can essentially reflect the characteristics of cfDNA fragmentation patterns.
To compare the cfDNA fragmentation profiles of the two datasets, we constructed profiles for both datasets at the chromosome level (chr1 to chr22). Figure 4B and Fig. 4C demonstrate the overall median profiles of these two datasets according to cancer stage. Although owing to different genomic coverage, little similarity is shown between the overall fragmentation profiles of these two datasets (Fig. 4B), when we compare them regarding their cancer stages, these profiles are stratified in the same order with respect to cancer stages. Jonckheere's trend test, applied to the fragmentation profiles of the targeted panel data at a bin width of 20 kb, achieved a p-value of 0.01 (Fig. 3B). Above all, we conclude that for colorectal cancer, the cfDNA fragmentation profiles consistently align with cancer stages in order, and hence, the cfDNA fragmentation profiles of CRC could be used as a biomarker to distinguish cancer stages.
Frag2Stage: using cfDNA fragmentation profiles to classify cancer stages of CRC
Our insights from the previous findings inspired the development of "frag2stage", a machine-learning model that leverages cfDNA fragmentation profiles to classify CRC cancer stages. Frag2stage utilizes a l1-regularized linear regression model (LASSO) to predict cancer stages from cfDNA fragmentation profiles. We employed a 5-fold cross-validation with 10 repeats to avoid overfitting. For imbalanced datasets, we adopted the up-sampling strategy and repeated this step 50 times to offset the potential biases introduced from sampling. Our method was implemented in R (version 4.2.1). We used the caret package (version 6.0.94) to control cross-validation and model training and used the glmnet package (version 4.1.8) for LASSO.
For the whole-genome cfDNA data, our analysis encompassed two classification tasks: differentiating stage I from stage IV samples and distinguishing between stages I/II and stages III/IV samples. For the targeted panel data, we introduced an additional layer to discern stages I/II from stage IV samples. The whole-genome data were analyzed at a bin width of 5 MB (resulting in 518 bins), while the targeted panel was assessed at a bin width of 20 KB (56 bins). Table 3 summarizes the prediction performance with three metrics: AUC (area under the curve) of ROC (receiver operating characteristic), precision (\(\text{precision}=\frac{\text{true positives}}{\text{true positives} + \text{false positives}}\)) and F1-score (\(\text{F1}=2\times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\)). Figure 5A illustrates the classification outcomes using the ROC curves. In detail, the frag2stage model, when applied to the whole-genome data, adeptly differentiates between stages I/II to III/IV and stages I to IV, with AUC values of 0.73 and 0.77, respectively. For the targeted panel data, the resulting AUC values were 0.68, 0.79, and 0.99 for the classifications of stages I/II to III/IV, stages I/II to IV, and stages I to IV, respectively. For simplicity, we incorporated ROC-AUC values in the downstream analysis.
Table 3
Performance of the frag2stage method on the whole-genome and targeted panel datasets
Whole-genome cfDNA data |
| ROC-AUC | Precision | F1-score |
Stage I/II vs. III/IV | 0.73 (0.70–0.78) | 0.71 (0.68–0.76) | 0.69 (0.68–0.72) |
Stage I vs. IV | 0.77 (0.70–0.83) | 0.72 (0.69–0.76) | 0.72 (0.70–0.73) |
Targeted panel cfDNA data |
Stage I/II vs. III/IV | 0.68 (0.64–0.73) | 0.78 (0.75–0.81) | 0.66 (0.63–0.68) |
Stage I/II vs. IV | 0.79 (0.74–0.84) | 0.76 (0.72–0.80) | 0.71 (0.69–0.74) |
Stage I vs. IV | 0.99 (0.97-1.00) | 0.95 (0.93–0.96) | 0.88 (0.86–0.89) |
* Mean (95% confidence interval) |
Evaluation of the robustness of frag2stage
Frag2stage is based upon cfDNA fragmentation profiles, and therefore, we assessed the robustness of the model against two variables, the bin width of fragmentation profiles and the coverage of genomic regions that are used in the classification.
To understand how the performance of our model varies with respect to genomic bin width, we applied frag2stage across a spectrum of widths. As illustrated in Figure, for both whole-genome and targeted panel datasets, the shift of prediction performance remains modest, hovering within a 0.06 margin in the AUC values.
Next, we tested how frag2stage would perform when only a fraction of data points in the cfDNA fragmentation profile were employed. This involved random selection of varying percentages of data points from the profiles, upon which frag2stage was applied. To reduce evaluation biases, this randomized sampling was repeated 50 times at each percentage (Fig. 5C). For the whole-genome data, the AUC values across all two stage classifications drop gradually within 0.05. In contrast, the AUC values of the targeted panel data for all classifications remain consistent with minor fluctuations.
Overall, these analyses reveal that cfDNA fragmentation patterns are a reliable biomarker for distinguishing stages of colorectal cancer. Remarkably, even with minimal genomic data slices, frag2stage can classify cancer stages with satisfactory precision.