Data preparation
Thirteen methylation datasets, which were GSE77954, GSE101764, GSE131013, GSE199057, GSE164811, GSE193535, GSE129364, GSE139404, GSE107352, GSE75546, GSE77965, GSE68060 and EMTAB6450 from public databases were selected based on three criteria: 1) were generated by Illumina HumanMethylation 450k BeadChip and had the raw IDAT files, 2) the sample size was greater than 10, and 3) consisted of CRC or adenoma or normal adjacent tissue (NAT). All the IDAT files were then processed using the minfi tool [16] to obtain methylation β values. They were integrated as a single dataset (n= 1165), which we defined as Phase I discovery set for candidate marker identification.
Level 3 methylation data for 31 cancer types were retrieved from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). Data for 8258 primary tumor tissues corresponding to 31 cancer types and their 710 NATs were retained. The 31 cancer types included ACC (n=79), BLCA (n=412), BRCA (n=778), CESC (n=306), CHOL (n=36), CRC (n=379), DLBC (n=48), ESCA (n=183), GBM (n=137), HNSC (n=523), KICH (n=65), KIRC (n=312), KIRP (n=271), LGG (n=513), LIHC (n=374), LUAD (n=456), LUSC (n=364), MESO (n=87), OV (n=10), PAAD (n=183), PCPG (n=178), PRAD (n=495), SARC (n=257), SKCM (n=104), STAD (n=393), TGCT (n=133), THCA (n=503), THYM (n=124), UCEC (n=418), UCS (n=57), UVM (n=80). The TCGA dataset was defined as Phase II discovery dataset. DMCs were deemed eligible if they exhibited methylation levels <0.2 on other cancer types (non-CRC), >=0.55 on CRC, and <0.15 on 710 NATs.
The other three datasets, GSE48684 [17], GSE40279 [18], and GSE122126 [19], generated by the same platform of Illumina HumanMethylation BeadChip in Gene Expression Omnibus (GEO) were used as validation sets to verify the methylation status of candidate differentially methylated CpGs (DMCs). The GSE48684 cohort consisted of 105 qualified samples, of which 41 were NAT and 64 were CRC. The GSE40279 cohort comprised whole blood cell (WBC) samples collected from 656 healthy individuals. The GSE122126 cohort included three CRC and 12 normal plasma samples.
Sample collection
This case-control study enrolled 772 cases, including 60 tissue samples (30 CRC and 30 NATs) and 712 plasma samples from the First Affiliated Hospital of Zhengzhou University between April 2022 and June 2022. Tissue samples were obtained from the preserved formalin-fixed paraffin-embedded (FFPE) sections collected from surgical patients. The plasma cohort was randomly divided into a training set and a validation set in a 2:1 ratio. The training set comprised 474 participants, including 115 healthy blood donors, 123 non-digestive disease patients (NDD), 65 intestinal disease patients (ID), 14 polyps patients, 10 non-advanced adenoma patients (Non-AA), 16 AA, 125 CRC patients, and 6 patients with other cancers. In the validation set, there were 235 participants, including 57 healthy blood donors, 60 NDD, 30 intestinal ID, 6 polyps patients, 5 Non-AA, 9 AA, 66 CRC patients, and 2 patients with other cancers (Supplemental Table 1). This study was approved by the ethics committee. Each participant signed an informed consent form. Patients with polyps, adenomas, CRC and other cancers were further confirmed by histopathological examinations. The included CRC patients were required to meet the following criteria: 1) did not receive any radiotherapy, chemotherapy or surgery, 2) ages older than 18. All CRC patients were classified as I, II, III, and IV stages according to the American Joint Cancer Committee (AJCC) staging system.
Differential methylation analysis
We performed differential methylation analysis using the rank-sum test on three groups in the discovery set: cancer vs. normal (NAT), adenoma vs. normal, and cancer vs. adenoma. DMCs were defined as significant if they meet two criteria: an FDR < 0.05 and a fold change ranking in the top 1% of all probes. The DMCs were then classified as hyper- or hypo- DMCs based on whether they exhibited high or low methylation levels in the tumor or adenoma samples compared to the normal samples. LASSO regression was implemented in the R package 'glmnet' to reduce the number of features. LASSO regression was repeated 100 times, and the frequencies of probes with non-zero coefficients in the regressions were counted.
Tissue and plasma DNA extraction
Genomic DNA from tissue samples and plasma cfDNA was isolated using the FFPE DNA rapid extraction kit (TianGen, Beijing) and the plasma/serum cfDNA extraction kit (Ammunition Life-tech, Wuhan), respectively, according to the protocols. Briefly, cfDNA extraction was divided into two steps. First, approximately 10 ml of blood was drawn from participants at room temperature and then centrifuged twice with 1300g and 14000g for 10 min at 4°C within 2 hours. Then, about 2 ml of plasma was retained for cfDNA purification. Purified cfDNA was eluted in 45 ul TE buffer and then treated with bisulfite using a DNA bisulfite modification kit (Ammunition Life-tech, Wuhan) in accordance with the instructions described in the aforementioned study [20], or stored at -80 °C until required.
Sanger sequencing
Sanger sequencing (ABI 3730XL, GENEWIZ, Suzhou) was performed for the target regions of NTMT1 and MAP3K14-AS1 after bisulfite treatment to confirm their methylation status on 30 CRC tissue samples and 30 NATs. The sequencing primers are shown in Supplemental Table 2. Following bisulfite treatment, cytosine in methylated CpG sites remained cytosine, while unmethylated cytosine was converted to thymine. Sanger sequencing results thus allowed for the direct determination of the methylation status of target regions.
Developing the SADMP and methylation-specific PCR (MSP) technique on plasma
The location and sequences of MSP primers/ probes are displayed in Supplemental Table 3. For NTMT1, two pairs of methylation primers targeted the sense and antisense strands, and their corresponding MGB probes were designed. For MAP3K14-AS1, the MGB-probe 1 and MGB-probe 2 were designed according to the antisense strand sequence and the reverse complementary sequence of the antisense strand.
Gradient dilutions of synthetic plasmids or plasmid mixtures are employed as templates for the determination of test parameters. Specifically, standard curve experiments were carried out to assess the amplification efficiency of each pair of primers. The experiments consisted of five ten-fold dilutions of fully methylated plasmid DNA templates with five replicates at each dilution (1, 10, 100, 1000, and 10000 copies per run). Primer tolerance experiments were designed to assess the specific amplification ability of methylated primers against methylated templates. Eight concentration dilutions of fully methylated DNA templates (0, 1, 5, 10, 50, 100, 200, and 400 copies/μL) mixed with a high concentration of unmethylated plasmid (107 copies/μL) were prepared to evaluate the ability of methylated primers selectively amplifying methylated templates under the background of unmethylated templates. Each concentration experiment was repeated five times.
The amplification system of MSP was shown in Supplemental Table S4. The test employed a triplex PCR with a FAM channel for the internal reference gene ACTB, a ROX channel for NTMT1 (dual-strands), and VIC channel for MAP3K14-AS1 (dual-MGB probes). The PCR procedure was pre-denaturation at 95°C for 5 min (step 1), denaturation at 95°C for 15s (step 2), annealing at 60°C for 30s (step 3), repeating step 2 and 3 forty-five times in plasmid samples and fifity times in plasma samples on the 7500 Fast Real-Time PCR System (Applied Biosystems, USA), the cycle threshold (Ct) was assigned to 50 for the target without amplification.
Statistical analysis
The data processing and analysis in this study were conducted using R software (version 4.1.0). The ‘glm’ function was employed to fit the logistic regression model with a parameter of ‘family=binomial’. Receiver operating characteristic (ROC) curve analysis was performed using the ‘pROC’ package, and the area under the ROC curve (AUC) was subsequently calculated to assess the test classification performance. The optimal sensitivity and specificity of the test were estimated when Youden’s index reached its maximal value. Sensitivity and specificity were calculated using the following formulas:
And the Youden index = sensitivity+specificity-1.
The rank-sum test and Kruskal test were employed for the comparisons between two groups and the comparisons between multiple groups, respectively. The Chi-square test was utilized for the comparisons between categorical variables. The Other statistical methods used in this study were described in the corresponding results.