Identification of differentially methylated markers
The workflow showing the step-by-step procedure for this analysis and the demographic characteristics of participants are presented in Figure 1 and Table 1 respectively. We analysed the microarray methylation profile of 166 (87 males and 79 females) CRC and 424 (84 males and 340 females) healthy normal subjects. The average age of CRC subjects was 55 years old whereas, the normal subjects had a mean age of 53. CRC and healthy normal subjects were statistically significantly different with respect to gender, but did not differ with respect to age. We adjusted for age and gender in our models. The average time-to-diagnosis for cases was 6.2 years (range = 0-14.3). The Illumina Human Methylation 450 Beadchip contained DNA methylation status of 485,512 CpG sites. Pre-processing and quality control were performed and the poor performing probes were filtered out. A total of 399,934 CpG sites (Additional file 1: Figure S1) were yielded, and their methylation data were used for further analysis. A total of 49,299 CpGs (corresponding with 11, 786 unique genes) were differentially methylated (FDR < 0.05) between the CRC and healthy normal subjects.
Gene Ontology (GO) terms and KEGG pathway enrichment analysis for genes associated with the 49,299 differentially methylated CpGs were performed. The GO analysis showed the molecular functions, cellular components and biological functions of differentially methylated genes under the criterion FDR < 0.05 (Additional file 2: Table S1). In the KEGG pathway genes showed enrichments in the metabolic pathway (FDR = 1.19e-03), cancer- pathways (FDR = 6.58e-03), human papillomavirus infection (FDR = 1.61e-02), Rap1 signaling pathway (FDR = 4.36e-04) and Axon guidance (FDR = 2.12e-03) (Additional file 3: Table S2).
Of the 49,299 CpGs differentially methylated, 48 CpGs (corresponding with 29 unique genes) which had absolute mean β-value difference (|∆β| ≥ 0.05) were selected and denoted DMPs (Additional file 4: Table S3). Among the DMPs, a total of 15 CpGs (corresponding with 8 unique genes) were hypermethylated and 33 CpGs (corresponding with 21 unique genes) were hypomethylated. Hierarchical clustering was implemented to determine whether the identified DMPs could distinguish CRC from healthy normal subjects. The results showed significant difference in methylation between CRC and healthy normal subjects (Figure 2).
Methylation risk score construction
The entire sample of 590 was randomly split into training (117 CRC subjects and 297 healthy normal subjects) and testing (49 CRC subjects and 127 healthy normal subjects) sets (Table 1). Differentially methylated markers associated with CRC risk were screened on the training dataset using LASSO selection and stepwise logistic regression analysis. The sixteen markers mapped to nine genes including LGR6, PTPN12, PPFIA3, LOC399959, PCDHGA1, RNF39, ESYT3, MRGPRG and ATHL1 overlapping between the two methods were selected (Additional file 5: Figure S2). The associations of the sixteen individual markers with CRC by univariate and multivariate logistic regression analysis are presented in Additional file 6: Table S4 and Table 2 respectively.
Furthermore, using the sixteen-CpG panel we calculated a methylation risk score (MRS) for each subject on the training dataset using the formula:
MRS = (-0.4100*cg06551493) + (0.4332*cg01419670) + (0.2895*cg16530981) + (-0.5172*cg18022036) + (-0.3915*cg12691488) + (-0.3246*cg17292758) + (-0.2886*cg16170495) + (0.2451*cg11240062) + (-0.5651*cg21585512) + (0.3615*cg24702253) + (-0.2445*cg17187762) + (-0.3951*cg05983326) + (-0.5089*cg06825163) + (-0.2504*cg11885357) + (-0.2357*cg08829299) + (-0.3607*cg07044115). The methylation levels of 4 CpG (cg01419670, cg16530981, cg11240062, cg24702253) sites were hypermethylated, and 12 CpG (cg06551493, cg18022036, cg12691488, cg17292758, cg16170495, cg21585512, cg17187762, cg05983326, cg06825163, cg11885357, cg08829299, cg07044115) sites were hypomethylated.
The MRS (range, -5.59 to 4.35) was significantly higher for CRC subjects than in healthy normal subjects (P < 0.000), with a median MRS of 1.68 (IQR, 1.43) in CRC subjects and -0.430 (IQR, 2.89) in healthy normal subjects (Additional file 7: Figure S3a) in the training dataset. The MRS was associated with a 2.68-fold increased risk of CRC (OR = 2.68, 95% CI: 2.13, 3.38, P < 0.0001) Table 2. The MRS showed good predictive ability for discriminating between CRC and healthy normal subjects (AUC, 0.85; 95% CI: 0.81, 0.89) Figure 3a.
Validation of the sixteen-CpG panel MRS for CRC prediction in the testing dataset.
In order to validate the predictive performance of the sixteen-CpG panel MRS for the prediction of CRC risk, the predictive model was applied to the testing dataset. The MRS (range, -5.73 to 3.89) was also significantly higher for CRC subjects than in healthy normal subjects (P < 0.0001), with median MRS of 1.83 (IQR, 1.80) in CRC subjects and -0.45 (IQR, 2.64) in healthy normal subjects (Additional file 7: Figure S3b). Consistent with the training dataset, the MRS was associated with a 2.02-fold increased risk of CRC (OR = 2.02, 95% CI: 1.48, 2.74, P < 0.0001) Table 2. Similar to the training dataset, the MRS showed good predictive ability for discriminating between CRC and healthy normal subjects (AUC, 0.82; 95% C: 0.76, 0.88) Figure 3b.
Subgroup analysis for the association between MRS and CRC risk
When the study subjects were stratified according gender, age and time-to-diagnosis, the MRS still demonstrated an increased risk of CRC among both male and female subjects, younger (< 60 years) and the older (≥ 60 years) subjects as well as short and long time to diagnosis in the training and testing datasets (Table 3). Also, the case-only analysis demonstrated no correlation between methylation levels time-to-diagnosis (Additional file 8: Table S5).
Independent validation of the sixteen-CpG panel MRS for CRC prediction in TCGA dataset
We used TCGA dataset of 391 CRC and 45 controls for independent validation of our sixteen-CpG panel MRS. Only thirteen CpGs of the panel were differentially methylated in the TCGA dataset. The beta values of the thirteen CpGs were extracted and a univariate logistic regression models were constructed (Additional file 9: Table S6). We identified nine CpGs (cg06551493, cg12691488, cg17292758, cg16170495, cg21585512, cg24702253, cg17187762, cg05983326, cg11885357) that were associated with CRC and the MRS for each sample was calculated. The MRS (range, -4.05 to 2.92) was significantly higher for CRC subjects than in controls subjects (P < 0.0001), with a median MRS of 0.16 (IQR, 1.59) in CRC subjects and -0.712 (IQR, 0.95) in controls (Additional file 10: Figure S4). The MRS was associated 1.96-fold increased risk in CRC (OR = 2.06, 95% CI: 1.55, 2.78, P < 1.08e-06) (Additional file 9: Table S6). The MRS showed good predictive ability for discriminating between CRC and control subjects (AUC, 0.73; 95% CI: 0.66- 0.79) Additional file 11: Figure S5