Overview of prognostic genes. To systematic study of prognostic genes, we obtained 25 small prognostic gene sets (ranging from 3 to 330) from 23 high quality literatures (Table 1). Similar to previous study [19], these genes had very small overlap and network connections. Only 14 genes were repeatedly mentioned 3 times in these small gene sets (see Supplementary Table 1 for details). Taking into account the number of gene sets and cancer types, we combined the gene sets with the smaller number of genes and finally got nine large prognostic gene sets (PGS), which consisted of 1,439 prognostic genes (PG) after normalizing gene names and removing duplications. For comparison, we also selected four other gene sets: cancer gene set (CA), essential gene set (ES), housekeeping gene set (HK), and metastasis-angiogenesis gene set (MA) (Supplementary Table 2). To investigate their network properties, we employed two protein-protein interaction (PPI) networks, HPRD and String, which both exhibited power-law node-degree distributions (Figure 1A and B)[23]. Figure 1C shows that cancer prognostic genes are discretely distributed in the HPRD network. Only three out of the 14 genes which appeared three times above had directly connected edges in the HPRD network.
Four network centralities of prognostic genes. For prognostic genes, we first investigated the four network centralities: Degree, Betweenness, Closeness, and Eigenvector. They are used to measure the importance of a node in a given network from different perspectives. Larger values of the four centralities indicate more importance in the network [12]. Based on the HPRD and String networks, we calculated the four centralities for all 1,439 prognostic genes, background (mean of all nodes in the network), and four other gene sets. The results were shown in Figure 2. Like ES, degree and betweenness of PG was lower than the background, while CA and MA were obviously higher than the background in two PPI networks in Figure 2A-D. However, in Figure 2E-H, Closeness of PG and four other gene sets were considerably higher than the background, while Eigenvector of PG was different from CA and MA, and its values were always lower than the background in the HPRD and String networks. Eigenvector of CA and MA, as well as degree and betweenness of HK, showed inconsistency in the both networks, probably due to the String network consists of more notes and edges[24].
Overall, the results clearly showed that: (1) CA had very similar centralities to MA and both gene sets had significantly higher centralities than other three gene sets including PG (except eigenvector in the String network, FDR-adjusted p-values of t tests were much smaller than 0.001 in all other cases). This illustrated that PG was significantly different from cancer-related genes in terms of four network centralities; (2) Except Closeness, the other three centralities of PG were below the average of the whole network. This showed that, unlike cancer genes, prognostic genes did not occupy key positions in the network [18, 25]; (3) The four centralities of PG were not significantly different from those of ES (FDR-adjusted p-values of t test were greater than 0.1 for all four centralities).
Four network measures of prognostic gene sets. Most of cancer prognostic biomarkers often act as functional units in a gene set. Therefore, it was necessary to examine the network topological properties of gene sets. To this end, we first calculated clustering coefficient (CC) for nine PGS, four other gene sets and random gene sets. CC measures the tendency that the nodes in a graph cluster together. Larger CC values indicate that the nodes are more likely to form clusters in a network[26]. Figure 3A and B shows their distributions of CC. In the HPRD network, nine PGS had slightly larger CC than the random gene sets (p-value of KS test was not significant). CA and MA also had larger CC than HK and ES. For the String network with higher density, on the one hand, nine PGS had significantly smaller CC than the random gene sets (KS test, p-value < 0.05), which showed that genes within the nine PGS were more sparsely connected compared to random gene set in the network. On the other hand, HK had significantly larger CC than all the other gene sets (p-value of permutation test smaller than 0.001). This was probably due to the fact that edges were more likely to be formed between HK in the String network since HK had consistent expression patterns [27]. And it can also be clearly demonstrated by comparing the degree of HK in the two networks (Figure 2A and B).
Through the investigation of CC, we failed to obtain the significant common properties of the PGS in the network. So, we proposed three other measures, intraset distance (IAD), interset distance (IED) and genset-distribution in modules (GDM), to examine the network properties of gene sets in the network. IAD and IED were used to portray the network distance within a gene set and between two gene sets, respectively (see the methods section for more details). Their calculations were based on shortest path (SP), which can reflect the ability of network information transfer[28]. Figure 3C and D shows that IAD of nine PGS are significantly smaller than the random gene sets in both networks, indicating that there is a more compact network structure within PGS. In the four other gene sets, CA and MA had obviously smaller IAD than PGS compared to HK and ES, and considering two networks together, the ES was not the closest one to PGS in the IAD distribution.
Similarly, we found that IED between PG themselves were significantly smaller than those between PGS and the random gene sets (Figure 3E and F). The result indicates that the nine PGS are not spatially loose, but rather closely connected. Simultaneously, we also found that IED between PGS and the four other gene sets were significantly smaller than those of PGS themselves. One possible reason was that they were derived from different cancer types. Among them, we found that IED between PG and CA or MA was smaller than the other two gene sets, which may suggest that PGS is more closely related to cancer (Figure S1A and B). In addition, we can easily see that whether it was IAD and IED, the distances in the String network were smaller compared to the HPRD network.
Next, we used genset-distribution in modules (GDM) to investigate the distribution of PGS within and between modules of network. Figure 3G and H shows that GDM of nine PGS are significantly larger than random in both networks, demonstrating that they are more likely to be distributed within the module. We also found that GDM of MA were the largest of the four other gene sets, and a change in the relative position of CA and HK in the two networks. This may be due to the different modules that were derived from different networks, and the complexity of gene sets themselves.
Functional analysis of prognostic gene sets. We performed functional enrichment analysis for nine PGS based on GO terms and KEGG pathways database using Fisher test. However, more than half of the gene sets were not enriched with any significant functional terms. Genes with same or similar functions are more inclined to be in the same module of a network [29]. We then examined functions of network modules with two or more PGS. Interestingly, when we compared these functions of the modules from different networks, we found that the intersections of their functions were mostly related to cancer. Figure 4 shows the intersection of the function of module #4 of the String network and module #5 and #7 of the HPRD network. Most of functional terms could be attributed to hallmarks of cancer [30]. They included “Extracellular matrix organization”, “Leukocyte migration”, “Collagen metabolic process”, “Transforming growth factor beta receptor signaling pathway”, etc. In particular, among them, “extracellular matrix organization”was the most significant GO term. Researchers have found that its remodeling directly affects tumor growth, development, and progression[31]. And “transforming growth factor beta” (TGF-β), the main pathway of another functional terms, have been evaluated as prognostic or predictive markers for cancer patients [32]( the lower half of Figure 4).