Identification of the predictive signature for CIMP status of RCC
Fig. 1 describes the flowchart of this study. The GSE39582 dataset including the largest sample size of stage II and III RCC with CIMP status was used as the training data for selecting an REOs-based signature. Firstly, we identified 2209 DE genes between the 64 CIMP+ RCC samples and the 117 CIMP- RCC samples (limma test, FDR < 0.01). From all gene pairs consisted of at least one DE gene, we extracted 383,591 CIMP-related gene pairs whose specific REOs patterns occurred more frequently in the CIMP+ than in the CIMP- samples (Fisher’s exact test, FDR < 0.01). Then, 53 panels of gene pairs were got within different range of FD value. After a redundancy removal process for each panel of gene pairs, we calculated the largest F-score with the optimal vote rule (Fig. 2a, see Methods). Finally, the 19 gene pairs, which obtained the largest F-score within the range of FD more than 0.58, were denoted as 19 gene pairs signatures (19-GPS) for predicting CIMP status of stage II and III RCC (Fig. 2b).
A sample was predicted as CIMP+ if the REOs of at least 12 gene pairs in 19-GPS voted for CIMP+; otherwise the CIMP-. According to the classification rule, the F-score of the signature in the training data was 0.91 (Table 1) with a sensitivity of 0.91 and a specificity of 0.90, and the AUC of ROC curve was 0.95 (95% CI: 92.08%-97.83%) (Fig. 3a).
Table 1. The performance of 19-GPS for right-sided colon cancer (RCC) samples in the training and validation datasets
|
pre-CIMP+
(CIMP+:CIMP-)
|
pre-CIMP-
(CIMP+:CIMP-)
|
sensitivity
|
specificity
|
F-score
|
GSE39582
|
70(58:12)
|
111(6:105)
|
0.91
|
0.90
|
0.95
|
GSE39084
|
11(6:5)
|
8(0:8)
|
1
|
0.62
|
0.76
|
GSE25070
|
6(5:1)
|
7(1:6)
|
0.83
|
0.86
|
0.85
|
E-TABM-328
|
13(10:3)
|
9(8:1)
|
0.91
|
0.73
|
0.81
|
Total RCC
|
100(79:21)
|
135(15:120)
|
0.84
|
0.85
|
0.85
|
Note: The CIMP+ and CIMP- represented the original CIMP status; pre-CIMP+ and pre-CIMP- represented the CIMP status predicted by 19-GPS.
Based on the knowledge that stage II and III CIMP+ RCC patients treated with surgery alone have better prognoses than CIMP- RCC patients [4, 5], we evaluated the reliability of 19-GPS through survival analysis. In the training dataset containing 31 samples of stage III RCC patients treated with surgery alone, one of the 16 original CIMP- samples was reclassified as CIMP+ by 19-GPS (Supplementary Table 1, Additional File 1). The survival analysis showed that the RFS of the 16 predicted CIMP+ patients was significantly longer than the 15 predicted CIMP- patients (log-rank P = 4.90e-3, HR = 0.14, 95% CI = 0.03-0.68, Fig. 4a), which was more significant than the difference between patients with the original CIMP status due to the reclassified sample (log-rank P = 5.24e-3, HR = 0.15, 95% CI = 0.03-0.69, Fig. 4b). It is also known that stage III CIMP- RCC patients treated with 5-Fu-based ACT have better outcomes than patients treated with surgery alone [4]. In the 41 stage III RCC samples of training data for patients receiving 5-Fu-based ACT, 2 of the 29 original CIMP- samples were reclassified as CIMP+ by 19-GPS, and 2 of the 12 original CIMP+ samples were reclassified as CIMP- (Supplementary Table 1, Additional File 1). The survival analysis showed that the RFS of the 29 predicted CIMP- patients receiving 5-Fu-based ACT was significantly longer than the 15 predicted CIMP- patients treated with surgery alone (log-rank P = 5.97e-3, HR = 0.27, 95% CI = 0.10~0.73, Fig. 4c), which was more significant than the different between original CIMP- patients treated with 5-FU-based ACT and surgery alone (log-rank P = 1.69e-2, HR = 0.33, 95% CI = 0.13~0.85, Fig. 4d). The survival analysis validated that 19-GPS could perform better for predicting CIMP status of stage II and III RCC patients than current methods.
There were 12 CIMP- and 6 CIMP+ samples reclassified by 19-GPS in the total of stage II and III RCC of training dataset. We contrasted the gene expression patterns of the 18 signature-disconfirmed samples with the 163 signature-confirmed samples through hierarchical clustering analysis. Firstly, we identified 4685 DE genes between the 58 signature-confirmed CIMP+ samples and the 105 signature-confirmed CIMP- samples in the training dataset (limma test, FDR < 0.01). Secondly, using the expression measurements of the top 100 significant DE genes, the samples were classified into two subgroups using the complete linkage hierarchical clustering analysis based on the Euclidean distance (Fig. 5a). The results showed that all of the samples reclassified as CIMP+ and CIMP- were clustered with the group of signature-confirmed CIMP+ and CIMP- samples, respectively. The gene expression patterns validated the correctness of 19-GPS in training dataset.
Validation of 19-GPS in independent datasets
In three validation datasets (GSE39084, GSE25070 and E-TABM-328) of RCC samples, the CIMP status of samples was predicted based on 19-GPS. In GSE25070, TMEM150C and CCDC170 included in 19-GPS were not detected by Illumina Human Ref-8v3.0 expression beadchip, which resulted in 17 gene pairs available for classification. Then we observed that the classifier of 17 gene pairs achieved the largest F-score when requiring that at least 10 of 17 gene pairs voted for CIMP+ determination in the training dataset, so the vote rule was regarded as the optimal vote rule in GSE25070. Similarly, in E-TABM-328, 18 gene pairs were detected by Whole Human Genome Microarray 4x44K, and CIMP+ determination could be voted by at least 11 of 18 gene pairs as the optimal vote rule. The F-score of the signature were 0.76, 0.85 and 0.81 in GSE39084, GSE25070 and E-TABM-328. The AUC of ROC were 97.44% (95% CI: 91.37%-100%), 91.67% (95% CI: 61. 68%-100%) and 82.23% (95% CI: 70.59%-100%) (Fig. 3b, c, d).
Because the therapeutic and survival information was unavailable in three validation datasets, we compared the gene expression patterns of the signature-disconfirmed samples with the signature-confirmed samples through hierarchical clustering analysis in the validation datasets. Using the expression levels of the top 100 significant DE genes between the signature-confirmed CIMP+ and CIMP- samples (limma test, FDR < 0.01), the samples were classified into two subgroups using the hierarchical clustering analysis (Fig. 5b, c, d). In GSE39084, the result showed 4 of 5 CIMP- samples reclassified as CIMP+ by our signature were clustered with the group of signature-confirmed CIMP+ samples. The similar results were observed in GSE25070 and E-TABM-328 that all of the samples reclassified as CIMP+ and CIMP- were clustered with the group of signature-confirmed CIMP+ and CIMP- samples, respectively. These results provided transcriptional evidence of the correctness of the prediction of 19-GPS.
The differentially methylated CpG sites and expressed genes between CIMP+ and CIMP- samples
The CIMP+ status is characterized by high frequency of promoter hypermethylation whose regions almost locate in tumor suppressor genes [30]. We used the datasets detected both gene expression and DNA methylation profiles to select the differentially methylated CpG sites between predicted CIMP+ and CIMP- samples (match GSE25070 to GSE25062 and match GSE79793 to GSE79794). The CIMP status predicted by 19-GPS in GSE25070 was used in GSE25062. Then, the 1581 hypermethylated CpG sites were selected between the predicted CIMP+ and CIMP- samples in GSE25062 (limma test, P < 0.05, Fig. 6a). The hypermethylated CpG sites located in the regions of 26 tumor suppressor genes which were downloaded from The Cancer Gene Census containing 316 tumor suppressor genes (https://cancer.sanger.ac.uk/census). Meanwhile, the 1147 hypermethylated CpG sites were selected between original CIMP status samples in GSE25062, and they located in the regions of 15 tumor suppressor genes (limma test, P < 0.05, Fig. 6b). The results showed that the predicted CIMP+ samples had much more hypermethylated CpG sites and tumor suppressor genes than the original CIMP+ samples.
And then, we calculated the number of hypermethylated CpG sites and tumor suppressor genes of predicted CIMP+ samples based on the same method in GSE79793 and GSE79740. Compared with the predicted CIMP- samples, the predicted CIMP+ samples had 3124 hypermethylated CpG sites which located in the regions of 57 tumor suppressor genes, (limma test, P < 0.05, Fig. 6c). Because the samples in above datasets had no original CIMP labels, so we could not compare the difference of the number of hypermethylated CpG sites and tumor suppressor genes between the predicted and original CIMP status. Moreover, the 552 hypermethylated CpG sites between predicted CIMP+ and CIMP- samples were identified in both GSE25062 and GSE79740, which were not randomly distributed among all of the hypermethylated CpG sites (P < 2.2e-16, Hypergeometric test).
Besides, we selected 4771 DE genes between the predicted CIMP+ and CIMP- samples, which were more than 2209 DE genes among the original samples in the training dataset (limma test, FDR <0.05). This indicated that the differences in methylation and gene expression patterns between the predicted CIMP+ and CIMP- samples were more significant than the original samples. In conclusion, the differentially methylated CpG sites and expressed genes analysis provided the evidence that the characteristic of predicted CIMP status of samples conformed to the truly biological property.
The robustness against varied proportions of tumor epithelial cell
Some reports show the qualitative signatures based on REOs of gene pairs are robust against to varied proportions of tumor epithelial cell [20]. To validate the robustness of 19-GPS, our laboratory collected 13 fresh-frozen primary tumor tissue samples through surgical excision. Fresh-frozen primary tumor tissue samples were retrospectively collected at Union Hospital of Fujian Medical University. And the 13 solid tumor tissue samples were from 5 patients whose excisions were from different sampling positions with different information of ‘percentage of tumor cells’ as shown in Table 2. The institutional ethical review boards of Union Hospital of Fujian Medical University approved the protocol, and all patients signed informed consents before sample collection. And we used the fragments per kilobase of exon model per million mapped fragments to quantify the gene expression level from RNA sequencing data. Then, we used 16 gene pairs available for 19-GPS to predict the CIMP status of 13 samples. And the gene expression levels of 19-GPS were detailed in (Supplementary Table 1, Additional File 2). There were 4 of 5 patients containing samples with different percentage of tumor cells predicted the same CIMP status, and 2 of 3 samples of the one remaining patient were also predicted the same CIMP status (Table 2). The results indicated that the performance of 19-GPS was not effected by varied proportions of tumor epithelial cell.
Table 2. The predicted CIMP status of samples with different percentage of tumor cells
Sample ID
|
Percentage of tumor cells(%)
|
Predicted CIMP status
|
HCF1
|
40
|
Negative
|
HCF2
|
100
|
Negative
|
HCF3
|
100
|
Negative
|
LGL1
|
50
|
Negative
|
LGL2
|
90
|
Positive
|
LGL3
|
90
|
Positive
|
SDL1
|
100
|
Negative
|
SDL2
|
100
|
Negative
|
WCY1
|
60
|
Negative
|
WCY2
|
100
|
Negative
|
WCY3
|
100
|
Negative
|
ZCH1
|
70
|
Negative
|
ZCH3
|
40
|
Negative
|