Summary of datasets
There were 20,501 genes in 454 tumor and 41 normal samples for the dataset of gene expression of COAD in TCGA. The genes were removed if they did not express in more than 50% samples, and then 17,914 genes were left for further study. The first validation set (GSE21510) included 123 tumor samples and 25 normal samples. The second validation set (GSE39582) included 585 tumors among which 579 tumors had survival information and 19 patients had adjacent nontumor tissues.
The clinical data were matched to the gene expression profile. Among the 454 patients, 450 patients had clinical information with 395 patients alive and 45 patients dead. However, due to missing data of our selected confounders, only 258 samples were kept for the multivariate Cox regression analysis. The overall information of the 258 patients was listed in Table S1.
The summary of the mutation was drawn by maftools (Figure S1). There were 9 types of mutations in the MAF file. The number distribution for each type of mutation was shown by a bar plot and the one with the maximum frequency was Missense_Mutation; SNP was the most common variant type; the most common SNV type was C>T; variants per sample distribution were presented by a stacked barplot; variant types were displayed as a boxplot summarized by Variant_Classification; the top 10 genes (TTN, APC, MUC16, SYNE1, TP53, FAT4, KRAS, RYR2, PIK3CA, and ZFHX4) with the most mutations were shown by a stacked barplot (Figure S1).
STRING PPI data with lab score>300 includes 15,436 genes and 217,626 interactions. After mapping 17,914 genes with expression information on PPI, a background network was constructed for the study, which involved 13,235 genes and 164,115 interactions.
SSN analysis reveals a functional network for cancer
Through the sample-specific network (SSN) method, SSNs were constructed for every tumor and normal sample with all 41 normal samples as reference samples (Figure 1). Then, specific SSN for each tumor sample was constructed by deleting edges presented in SSNs of any normal samples. Finally, a functional network involving 1063 genes and 1440 edges was formed for COAD by collecting edges that appeared in more than 90% specific SSNs of tumor samples (Figure 2A). The 1063 genes and 1440 edges in the functional network were regarded as functional genes and interactions, respectively. A histogram plot was drawn to show the distribution of node degrees in the functional network (Figure S2). Particularly, 185 genes with node degree over 3 were chosen as core functional genes (Figure 2A).
Enrichment analysis of functional genes
Functional genes were submitted for further enrichment analysis of GO and KEGG pathways with DAVID, respectively. The GO analysis of functional genes suggested that they were significantly enriched in rRNA processing, negative regulation of transcription from RNA polymerase II promoter, positive regulation of transcription from RNA polymerase II promoter, positive regulation of transcription, DNA-templated, G1/S transition of mitotic cell cycle, canonical Wnt signaling pathway and so on (Figure 2B and Supplementary Table S2). In the KEGG pathway analysis, functional genes were significantly enriched in pathways in cancer, proteoglycans in cancer, cell cycle, Hippo signaling pathway, and Wnt signaling pathway (Figure 2C and Supplementary Table S3). From both functional and pathway enrichment analysis, we can see that our identified functional genes are related to cancer.
GO and KEGG pathway analysis were also carried out for the 185 core functional genes. And it was shown for GO analysis that they were significantly enriched in rRNA processing, positive regulation of telomerase RNA localization to Cajal body, positive regulation of telomere maintenance via telomerase, positive regulation of protein localization to Cajal body, positive regulation of transcription from RNA polymerase II promoter and so on (Supplementary Table S4). KEGG pathway enrichment analysis suggested that the core functional genes were mainly related to ribosome biogenesis in eukaryotes, cell cycle, HTLV-I infection, Wnt signaling pathway, progesterone-mediated oocyte maturation and so on (Supplementary Table S5). The results suggested that the 185 core functional genes are likely to play important roles in COAD.
Furthermore, the p-values were calculated using the hypergeometric distribution (1) for different top N ranked functional gene sets enrichment analysis with the five known cancer gene sets including the curated gene sets in pathway in cancer, colorectal cancer, cancer gene census, pan-caner driver genes, and cancer driver genes. The different functional gene sets which were ranked by node degree include: top 11 functional genes with node degree over 19; top 52 functional genes with node degree over 9; top 79 functional genes with node degree over 7; top 109 functional genes with node degree over 5; top 185 functional genes with node degree over 3; top 457 functional genes with node degree over 1; and all 1063 functional genes. The results showed that almost every functional gene set was enriched in all the five known cancer gene sets except that the top 52 functional genes set was not enriched in colorectal cancer and pan-caner driver genes sets; top 108 functional gene set was not enriched in colorectal cancer set (Figure 3A). Particularly, the 1063 functional genes were enriched in pathway in cancer (p-value=0), colorectal cancer (p-value=3.26×10−5), cancer gene census (p-value=3.17×10−10), pan-caner driver genes (p-value=1.16×10−7), and cancer driver genes (p-value=3.94×10−10) (Figure 3A). A Venn diagram was drawn to show the comparison of 1063 functional genes and the five known cancer gene sets (Figure 3B). The results showed that 526 genes in pathway in cancer were obtained from KEGG, 92 of which appeared in functional genes; eighty-six genes in colorectal cancer were obtained from KEGG, 15 of which appeared in functional genes; six hundred and ninety-nine known cancer genes were obtained from the Cancer Gene Census database, 77 of which appeared in functional genes; four hundred and thirty-five pan-cancer driver genes were obtained from a pan-cancer study, 50 of which showed in functional genes; two hundred and ninety-nine driver genes were obtained from a comprehensive study of driver genes, 44 of which displayed in functional genes. And the 185 core functional genes were also enriched in pathway in cancer, colorectal cancer, cancer gene census, pan-caner driver genes, and cancer driver genes with p-value of 1.42×10−6, 7.60×10−3 1.47×10−6, 8.92×10−8, 1.84×10−5, respectively. A Venn diagram was drawn to show the comparison of 185 core functional genes and known cancer gene sets (Figure 3C). From the results, we are confident to conclude that our identified functional genes are indeed correlated with cancers.
Classification between tumor and normal samples by functional genes
To investigate the ability of the 185 core functional genes to classify normal and tumor samples, we used an SVM model with 5-fold cross-validation to discriminate tumor samples from normal samples based on the expression profile of 185 core functional genes for both the training dataset (TCGA dataset) and independent validation dataset (GSE21510 dataset). ROC curves were drawn to show the prediction accuracy of the 185 core functional genes to discriminate tumor samples from normal samples (Figure 4A-B). The results showed that the 185 core functional genes had a high area under the curve (AUC) for both the training dataset (AUC = 0.99994) and validation dataset (AUC = 1), indicating that the 185 core functional genes were potential biomarker candidates for COAD diagnosis.
Furthermore, hierarchical clustering was also performed using gene expression data of the 185 core functional genes for both the TCGA and validation datasets. The clustering results showed a high degree of separation of the tumor and normal samples by using the 185 core functional genes (Figure 4C-D) in both the training and validation datasets. And the heatmap for 1063 functional genes in TCGA dataset was also drawn in Figure S3, which showed similar results with the 185 core functional genes. The classification results further confirmed that functional genes could be used as diagnostic biomarkers for COAD.
SSN analysis uncovers major subtypes of COAD
Using gene expression of the core functional genes, consensus clustering method obtained 2 to 10 clusters. Then the k=6 clustering solution was selected for further investigation. For the k=6 clustering solution formed six different subtypes, referred to here as “c1” through “c6” (Table 1). The six subtypes of COAD included: c1 subtype with 38 cases (comprising 8.37% of tumor samples); c2 subtype with 138 cases (30.40%); c3 subtype with 99 cases (21.81%); c4 subtype with 85 cases (18.72%); c5 subtype with 38 cases (8.37%) and c6 subtype with 56 cases (12.33%) of COAD cases. The six subtypes could provide useful information about personalized medicine.
The above subtypes were each characterized by different molecular patterns. For each of the six subtypes, the top 100 upregulated DEGs and top 100 downregulated DEGs were identified by comparing each subtype with the rest subtypes. For these detected top up- and down-regulated DEGs, the one which appeared in only one subtype was kept for each subtype. Finally, there were in all 1003 DEGs, including 161 up- and 157 down-regulated DEGs for subtype c1, 101 up- and 19 down-regulated DEGs for subtype c2, 108 up- and 31 down-regulated DEGs for subtype c3, 6 up- and 76 down-regulated DEGs for subtype c4, 51 up- and 130 down-regulated DEGs for subtype c5, 22 up- and 141 down-regulated DEGs for subtype c6. The heat map of the 1003 DEGs displayed different expression patterns for different subtypes (Figure 5A). Finally, we used R packages clusterProfiler to compare these representative DEGs for each subtype by their enriched biological processes and KEGG pathways, with the cutoff of p-value<0.05. As illustrated in Figure S4, representative DEGs for different subtypes related to different biological processes, such as representative DEGs for subtype c1 related to cell cycle, subtype c2 related to the regulation of GTPase activity, subtype c3 related to the regulation of cell division, subtype c4 related to the regulation of mRNA polyadenylation, subtype c5 related to autophagy and subtype c6 related to development. Furthermore, as shown in Figure 5B, representative DEGs for different subtypes related to different KEGG pathways. Representative DEGs for subtype c1 were enriched in DNA replication, cell cycle, mismatch repair, and p53 signaling pathway and so on; representative DEGs for subtype c2 were enriched in mTOR signing pathway and MAPK signaling pathway; representative DEGs for subtype c3 were enriched in spliceosome, antigen processing and presentation, estrogen signaling pathway and mRNA surveillance pathway; representative DEGs for subtype c4 were enriched in viral carcinogenesis and spliceosome, and so on; representative DEGs for subtype c5 were enriched in Rig-I-like receptor signaling pathway, FoxO signaling pathway, autophagy-animal, insulin signaling pathway, toxoplasmosis, and focal adhesion, and so on; and representative DEGs for subtype c6 were enriched in glycosaminoglycan biosynthesis-heparan sulfate, tight junction, circadian rhythm, ECM-receptor interaction and so on. Therefore, the six subtypes showed different pathological mechanisms, which implied that they should be treated with different methods.
Most samples in subtype c4-c6 were retrospective samples, while many samples in subtype c2 were not retrospective samples (Figure 5C). Most patients in subtype c1 showed mss (MicroSatellite stability), while many patients in subtype c3 showed msi-h (MicroSatellite Instability-High) feature (Figure 5C). Many patients in both subtype c2 and c3 were white people (Figure 5C). The six subtypes had a totally different meaning of tumor stages (Figure 5C). The mutations of OBSCN (Obscurin), MYCBP2 (MYC Binding Protein 2), RYR2 (Ryanodine Receptor 2) and TTN (Titin) were most frequent in subtype c3 (Figure 5C). Copy loss of TCF7L2 (Transcription Factor 7 Like 2) and RPL18P1 (Ribosomal Protein L18 Pseudogene 1) were frequent in subtype c1, while copy loss of ARHGEF28 (Rho Guanine Nucleotide Exchange Factor 28), BIN2P2 (Bridging Integrator 2 Pseudogene 2) and SLC25A5P9 (Solute Carrier Family 25 Member 5 Pseudogene 9) were frequent in subtype c6 (Figure 5C). It will provide recommendations for the treatment of the six subtypes of COAD with these identified subtype-specific clinical and somatic alteration features.
Survival analysis was performed on 450 tumor samples with clinical data (Figure 6A). Significant survival differences between the six subtypes were observed (Figure 6A, p-value = 3.05×10−3, log-rank), suggesting that the classification showed biological significance. To further validate the results, survival analysis was further performed in an external dataset (GSE39582), which also showed survival differences between six subtypes (Figure 6B, p-value = 2.21×10−4, log-rank). Both results suggested that patients in different subtypes had different survival rates, which may help doctors develop rational treatments for patients based on the subtypes to which they belong.
Specific functional genes and edges associated with survival
To further investigate the potential of functional genes and interactions as prognosis biomarkers for COAD. All 1063 functional genes and 1440 functional interactions were analyzed for their prognostic significance of overall survival. For each functional gene, all COAD patients were classified into the low- or high- expression group, according to the median expression level. For each functional edge, patients were also classified into the low- or high- group based on the median level for the functional edge. Functional genes and interactions with p-value less than 0.05 in the log-rank test for both univariate analysis and multivariate analysis were selected as prognosis biomarkers for COAD. Survival analysis suggested that 12 functional genes and 13 functional interactions were associated with the overall survival of patients of COAD (Figure 7 and Figure S5, Table S6-S7), which demonstrates that they could be prognosis biomarkers for COAD.