3.1 Identification of differential genes between normal tissues and colon cancer
In the TCGA-COAD dataset, there are 398 colon tumor samples and 39 normal samples. The gene differential expression profile of normal tissues and colon cancer of patients were calculated using the R project. A total of 1593 genes were identified as DEGs with FDR value < 0.05 and |log2FC| > 2.
3.2 Construction of a 13-genes signature prognostic model in the training set
Among the 1593 genes, the DEGs in each patient was displayed. The patients were randomly divided into training set and testing set. According to clinical information, we combined the survival time of the training set with DEGs expression to analyze the genes that may affect the patients’ clinical prognosis. Through Cox regression analysis of the training set, the single factor significance was defined as P value < 0.01, and prognosis-related genes were selected. The hazard ratio (HR) and the 95% confidence interval (CI) of each variable were shown in Table 1.There were a total of 48 genes related to prognosis. The candidate genes were further reduced to 15 genes through the lasso regression method to reduce the overfitting of the model (Fig. 1). Finally, through the following stepwise variable selection procedure, we established an optimal Cox proportional hazard model containing 13 key genes: MIR6835(microRNA 6835),TYRO3(TYRO3 protein tyrosine kinase),SRCIN1(SRC kinase signaling inhibitor 1),HOXC11(homeobox C11),CLDN23(claudin 23),UPK2(uroplakin 2),TNNI3(troponin I3),KLHL35(kelch like family member 35),KRT6B(keratin 6B),HAND1(heart and neural crest derivatives expressed 1),IL23A(interleukin 23 subunit alpha),LINC02474(long intergenic non-protein coding RNA 2474),SIX2(SIX homeobox 2)(Table 2). Thepredicted risk score can be calculated with the formula: 0.325257866*MIR6835 + 0.367503693*TYRO3 + 0.418341346*SRCIN1 + 0.134589791*HOXC11 + 0.123203974*CLDN23 + 0.205044971*UPK2 + 0.236533736*TNNI3 + 0.15943496*KLHL35 + 0.009257475*KRT6B + 0.083391046*HAND1 + 0.09208856*IL23A + 0.255146739*LINC02474 + 0.079143843*SIX2. The forest graph of risk genes was shown (Fig. 2A).The samples of the training set were separated into high-risk and low-risk subgroups according to the median value of risk scores. In order to evaluate the diagnosis prediction value of the model, the time-dependent survival ROC curves were also established. At the training set, the predictive capability of the genes remains well, with an AUC of 0.823(Fig. 2B).The Kaplan-Meier survival curve also showed that the survival rate of the high-risk group was significantly reduced (P < 0.01,Fig. 2C).The risk distribution plot and heat map of gene expression in the training set are displayed (Fig. 2D–2F).
Table 1
The hazard ratio (HR) and the 95% confidence interval (CI) of 48 genes related to prognosis
Id
|
HR
|
HR.95L
|
HR.95H
|
P value
|
KLHL35
|
1.126807
|
1.071441
|
1.185034
|
3.41E-06
|
FOXD1
|
1.189012
|
1.095268
|
1.290778
|
3.60E-05
|
TYRO3
|
1.577604
|
1.243019
|
2.00225
|
0.000178
|
FGF19
|
1.092626
|
1.042115
|
1.145587
|
0.000244
|
HOXC11
|
1.290783
|
1.122404
|
1.484421
|
0.000345
|
SIX2
|
1.129378
|
1.056494
|
1.20729
|
0.000351
|
ZIC5
|
1.441737
|
1.174138
|
1.770324
|
0.000479
|
MIR6835
|
1.367411
|
1.144605
|
1.633587
|
0.000564
|
FXYD4
|
1.103671
|
1.041967
|
1.169029
|
0.000778
|
SCG2
|
1.144913
|
1.057974
|
1.238996
|
0.000783
|
PKIB
|
1.09021
|
1.035758
|
1.147526
|
0.000954
|
KREMEN2
|
1.316959
|
1.117901
|
1.551461
|
0.000991
|
CCDC78
|
1.310785
|
1.112871
|
1.543897
|
0.001193
|
IL23A
|
1.072684
|
1.028007
|
1.119302
|
0.001227
|
HAND1
|
1.064342
|
1.024794
|
1.105416
|
0.001248
|
UPK2
|
1.298285
|
1.107812
|
1.521508
|
0.001261
|
OTX1
|
1.62325
|
1.205182
|
2.186342
|
0.001431
|
LINC02474
|
1.290539
|
1.098895
|
1.515606
|
0.001872
|
FEZF1-AS1
|
1.219718
|
1.07338
|
1.386006
|
0.00232
|
LINC00941
|
1.481484
|
1.15003
|
1.908467
|
0.002352
|
AC022144.1
|
1.281687
|
1.09177
|
1.504641
|
0.002422
|
KRT6B
|
1.010332
|
1.003592
|
1.017117
|
0.002614
|
CXCL5
|
1.004519
|
1.001571
|
1.007475
|
0.002642
|
SLC7A11
|
1.122323
|
1.040905
|
1.210109
|
0.00267
|
KCNJ14
|
3.356977
|
1.513859
|
7.444087
|
0.002878
|
PTMAP4
|
1.019718
|
1.006701
|
1.032904
|
0.002894
|
ERFE
|
1.2022
|
1.064423
|
1.357811
|
0.003024
|
TNNI3
|
1.350007
|
1.105976
|
1.647882
|
0.003176
|
NTN1
|
1.437744
|
1.127424
|
1.83348
|
0.003425
|
TP73
|
1.416373
|
1.11812
|
1.794183
|
0.003909
|
CLDN23
|
0.869051
|
0.789565
|
0.956539
|
0.004132
|
GNG8
|
1.095393
|
1.028951
|
1.166126
|
0.004319
|
TLX1
|
1.381873
|
1.098554
|
1.738259
|
0.005729
|
AC092112.1
|
1.085882
|
1.024198
|
1.151281
|
0.005758
|
SCARNA22
|
0.306391
|
0.131481
|
0.713983
|
0.006135
|
MNX1-AS1
|
1.15757
|
1.042099
|
1.285837
|
0.006351
|
PRAME
|
1.100824
|
1.027258
|
1.179658
|
0.006488
|
AC004080.1
|
1.096815
|
1.025262
|
1.173362
|
0.007258
|
LRMP
|
1.262019
|
1.064141
|
1.496693
|
0.007486
|
MDFI
|
1.093243
|
1.023913
|
1.167267
|
0.007655
|
NPTX1
|
1.232221
|
1.056334
|
1.437396
|
0.007875
|
AJUBA
|
1.178079
|
1.042487
|
1.331306
|
0.008616
|
SRCIN1
|
1.387746
|
1.084955
|
1.77504
|
0.009075
|
MYC
|
0.991024
|
0.984326
|
0.997768
|
0.009164
|
AC037487.3
|
1.433162
|
1.093202
|
1.878842
|
0.009188
|
VGF
|
1.023501
|
1.005659
|
1.04166
|
0.009628
|
TNNT1
|
1.052653
|
1.012541
|
1.094354
|
0.009634
|
Table 2
13 key genes were established by Cox proportional hazard model
id
|
coef
|
HR
|
HR.95L
|
HR.95H
|
MIR6835
|
0.325258
|
1.384388
|
1.109466
|
1.727434
|
TYRO3
|
0.367504
|
1.444125
|
1.053595
|
1.979411
|
SRCIN1
|
0.418341
|
1.519439
|
1.082098
|
2.133536
|
HOXC11
|
0.13459
|
1.144067
|
0.963941
|
1.357853
|
CLDN23
|
-0.1232
|
0.884083
|
0.786796
|
0.993401
|
UPK2
|
0.205045
|
1.22758
|
0.991952
|
1.51918
|
TNNI3
|
0.236534
|
1.26685
|
1.007453
|
1.593037
|
KLHL35
|
0.159435
|
1.172848
|
1.099408
|
1.251194
|
KRT6B
|
0.009257
|
1.0093
|
1.001587
|
1.017073
|
HAND1
|
0.083391
|
1.086967
|
1.03609
|
1.140342
|
IL23A
|
0.092089
|
1.096462
|
1.041242
|
1.15461
|
LINC02474
|
0.255147
|
1.290651
|
1.044759
|
1.594415
|
SIX2
|
0.079144
|
1.08236
|
1.005181
|
1.165465
|
3.3 External validation in testing set
The signature’s predictive performance was further validated in the testing set. Using the same formula established in the training set, the risk scores of each patient in the testing set was calculated based on the relative expression levels of 13 genes. In the testing set, the AUC of the 13-gene signature used to predict patient survival was 0.702 (Fig. 3A). The Kaplan-Meier curve generated from the testing set showed that the survival outcome of low risk group patients was significantly better than that of high risk group patients(Fig. 3B), which indicated that it has moderate diagnostic performance. The risk distribution plot and heat map of gene expression in the testing set were displayed(Fig. 3C-E). These results suggested that the prognostic model has better sensitivity and specificity.
3.4 External comparing of gene expression in Oncomine datasets
However, HOXC11, KRT6B, SRCIN1, TNNI3, TYRO3 were relatively highly expressed in COAD tissue. Besides, the protein expression of CLDN23, HAND1, IL23A, KLHL35, SIX2, UPK2 was not significantly different between COAD tissue and normal lung tissue (Fig. 4).
3.5 Survival analysis of the prognostic 13 candidate genes signature in COAD
In order to further investigate the specific relationship between the 13 individual genes with OS in colon cancer patients, a comprehensive survival analysis was performed using the Kaplan-Meier method. The results showed that five genes (SIX2 MIR6835 LINC02474 CLDN23 HOXC11) were significantly associated with OS, while the other candidate genes were no significantly correlated with OS in COAD (Fig. 5).
3.6 Pathway enrichment analysis
To further investigate the correlated pathway of each candidate gene in COAD, KEGG pathway enrichment analysis of these genes was obtained with GSEA. The upregulated and downregulated pathways of each single gene were shown in GESA multiple pathway graph (Fig. 6).The GESA pathway results suggested that most pathways are related to the occurrence, metabolism, proliferation and invasion of the tumor cells.
3.7 Identifying the mutation status of candidate genes in colon cancer
The genetic alterations of 13 candidate genes were examined to determine their role in colon cancer genes in GEPIA database. Among the 110 COAD patients, a total of 30 (27.27%) patients had changes (cbioportalCPTAC-2 Prospective, Cell 2019 cohort) (Fig. 7). The alteration rates of CLDN23, TYRO3 and SRCIN1 in this dataset were 9%, 9% and 10% respectively. Frequent genetic alterations indicated that these genes play a vital role in the development of colon cancer.