3.1. Reshaping the mutational spectrum of CH in peripheral blood by error-corrected sequencing
As a step towards developing a non-invasive method for early-stage CRN screening, we characterized mutations including CH-related variants in peripheral blood mononuclear cells (PBMCs) using the ECS approach (Supplementary Fig. S1.). Initially, 63 early-staged CRN patients and 32 risk-matched healthy individuals were enrolled for the feature discovery (Table 1). Before further analysis of the variants, we implemented a quality control to ensure that the unique depth of each individual is greater than 3,000X (Supplementary Fig. S2). By using ECS method in this study, 1,446 variants were identified in the testing cohort (Fig. 1). There were 1,171 variants identified in CRN patient group and 753 of 1,171 (64.3%) variants were unique and not shared with the risk-matched controls. On the other hand, 693 variants were identified in risk-matched control group and 275 of 693 (39.7%) variants were unique and not shared with CRN patient group. The average number of variants in CRN patients and risk-matched controls were 128.2 (92-167) and 127.2 (101-166), respectively (Supplementary Fig. S3). Somatic mutations with VAF greater than 0.02 are the traditional clonal hematopoiesis indeterminate potential (CHIP) definition, only 7 (11.11%) were found in 63 CRN patients, and 3 (9.38%) were found in 32 risk-matched controls (Supplementary Table1).
Not surprisingly, the rare hematopoietic clones (defined as VAF lower than 0.02, also refer to "low-allele-fraction variants" in this study) were detected in all samples (Fig. 2). The average proportion of rare hematopoietic clones was 39.05% (12.63%-53.33%) in CRN patients and 37.07% (8.6%-54.3%) in risk-matched controls. Notably, the mean proportion of variants with VAF lower than 0.001 were 15.88% (3.16%-27.5%) and 16.1% (3.31%-26.53%) in CRN patients and risk-matched controls, respectively. There was no statistical significance in the difference of the variant quantities between CRN patients and risk-matched controls.
3.2. Machine learning constructs the CH-based CRN detection model for early-stage CRN identification
In this case-control study, those variants were scattered and it was difficult to deal with those statistical extremes through traditional analysis methods. Therefore, we implemented machine learning (ML) methods to discriminate early-stage CRN patients from risk-matched controls. Initially, the pre-processing filtering was performed using the 1,446 variants from the discovery cohort with following filters: (1) removed the shared variants present in CRN patient and risk-matched controls with a Wilcoxon rank sum test p-value >0.1, (2) removed the unique variants (only in CRN patients or controls) with less than two cases in the discovery cohort, (3) rescued the filtered unique variants which reported with the pathogenicity significantly. With the filter criteria, 108 resulting variants were used as the variables for model training (Fig. 3A). Next, we performed the model training with the ML package based on a leave-one-out cross-validation framework method. The result of model training was that the AUC was 0.988, the sensitivity was 94.2%, and the specificity was 99.3% (Fig. 3B). It is worth noting that in this cohort, the FIT test sensitivity of the AA group and the stage I cancer group were 48.9% and 72.2%, respectively.
Another validation cohort, including 20 early-staged CRN patients and 10 risk-matched controls, was used to validate the model training. The accuracy of the validation study was 0.933 (95% CI, 0.779-0.982; p =0.00065), the sensitivity was 95.0%, and the specificity was 90.0% (Table 2).
3.3. Mutational signature analysis reveals the influence of genetic architecture on DNA damages
To understand the mechanism and influence of identified variants, we performed the mutational signature analysis by COSMIC mutational signature v2, in which may reflect the activity of 30 specific mutational processes. Initially, we compared the contribution of mutational signatures generated from 753 unique variants unique in the early-stage CRN group to 693 variants observed in the risk-matched control group (Fig. 4A). As the result as shown in Fig. 4C, signature 1, 3, 12, 13, 20, 22, 28 were contributed in both CRN unique group and risk-matched control group. Contrastively, signature 15, 18, 21 were uniquely contributed in the CRN-unique group as the related-contribution score as 0.024, 0.007, 0.013, respectively. Next, we analyzed the contribution of mutational signatures for 86 CRN-unique variants of 108 variants used in ML approach (Fig. 4B). Signature 3, 10, 12, 15, 20, 21, 22, 30 were contributed from those 86 variants and signature 10, 15, 21 and 30 were uniquely contributed from those 86 variants compared to the risk-matched control group. The cosine similarity value between 86 CRN-unique ML model variants and 753 CRN-unique variants was 0.855. Relatively, the cosine similarity value between 86 CRN-unique ML model variants and 693 variants in control group was 0.558.
About the characteristics of these four signatures, signature 10 has been found in six cancer types including CRC. The proposed etiology of signature 10 is to alter the activity of the error-prone polymerase POLE. Samples exhibiting signature 10 have been termed "ultra-hypermutators". Signature 15, has been found in gastric cancer and small cell lung cancer, is associated with defective DNA mismatch repair (MMR) and with high numbers of small indel (shorter than 3 bp) at mono/polynucleotide repeats. The etiology of signature 21 remains unknown but it was found only in samples that also have signature 15 and signature 20. As a result, signature 21 is probably related to microsatellite instable (MSI) tumors. The etiology of signature 30 remains unknown. Furthermore, we found that signature 1 did not contribute to the 86 variant groups, which may indicate that age-related variants are common in both groups28 , which is why the ML method did not select the variant related to signature 1.