Significant somatic mutations in LUAD
According to the gene mutation data of 515 LUAD patients in The Cancer Genome Atlas (TCGA) (29), we identified 20 significant somatic mutated genes (SMGs) by MutSigCV (q-value < 0.1, Figure 1). Most of these SMGs like TP53, KRAS, KEAP1, STK11, EGFR, NF1, BRAF, SMARCA4, etc. have already been identified in the other studies, especially the study conducted on the previously collected 230 TCGA-LUAD patients (10). The significant mutations of COL11A1 (q-value = 8.8e-06, mutation rate = 21%) is rarely identified in the previous LUAD studies. However, expressions of COL11A1 have shown associations with prognostic factors, pathological stages, and lymph node metastasis in non-small cell lung cancer (NSCLC) in some previous studies (30-33). The mutations of COL11A1 might also be one of the driven factors for LUAD.
To evaluate the most direct mutational effects, we compared the expressions of these SMGs in the mutated and wild type samples. We found that some of the SMGs can lead to significant expressional alterations. For instance, the mutations of RBM10 will significantly reduce its expressions while mutations of EGFR might improve the expression levels (Figure S1A). However, only three of the expressional alterations (SMAD4, CDKN2A and TCEAL5) can lead to significant prognosis influence (Figure S1B).
Genome-wide identification of prognosis relevant genes for LUAD
The genomic mutations can not only influence the functions of the mutated genes, but may also generate remarkable effects on down-stream cascades, thus leading to the final impacts on clinical phenotypes, like the prognostic outcomes. To gain a comprehensive understanding on the prognosis impacts, we attempted to get a genome-wide estimation of the prognosis relevant genes for LUAD based on a Cox proportional hazards (Coxph) model (see Materials and methods). Through this analysis, we observed that a great number of genes were associated with LUAD prognosis (Figure 2A). The high expression levels of some genes like USP4, DTNBP1 may contribute to better survival rates (these genes were termed as favourable ones), while some of the others (termed as un-favourable ones) such as AHSA1 and MESDC2 may lead to worse survivals (Figure 2A and 2B). AHSA1 is a co-chaperone of HSP90AA1, and a previous study has revealed that it is involved in the proliferation, migration, and invasion processes of osteosarcoma (34). Here, we observed that higher expression of AHSA1 was associated with a worse survival rate (Figure 2B). USP4 is a deubiquitinating enzyme which may inhibit p53 by deubiquitinating an important p53 ubiquitin ligase ARF-BP1 (35). Correspondingly, many studies have identified the oncogene effects of USP4 (36). Here, on the contrary, we found that higher expressions of USP4 in LUAD may lead to better prognosis. This may be caused by the highly heterogeneity of cancers and the alternative deubiquitinating substrates for USP4. Consequently, the prognosis effects of these identified genes may be conditional.
Meanwhile, although more favourable genes were observed than un-favourable ones, both kinds of genes exhibited a large exploring space (Figure 2C). This indicates the complicated molecular patterns underlying LUAD pathology and implies the importance to identify the most meaningful genes which may play the master roles in regulating the disease progression.
Construction of a potential causal regulatory network for prognosis-relevant genes
To understand the relations among all of the prognosis relevant genes and to help identify the master prognosis-influencing genes, a potential causal regulatory network was estimated (see Materials and methods). Nodes in the network were all with significant prognostic impacts (P<0.01 and | coef | > 0.2 based on Cox survival analysis) (Figure 3A). The causal structure of this network was inferred from the transcriptional expressions of genes across TCGA-LUAD patients, where directed edges describe the identified direct causal effects, and bi-directed edges represent uncertainty in the constructed network (23). According to this network, we found that most of the nodes with the same type of prognosis impacts (either favourable or un-favourable) gathered together while nodes with reversed impacts were relatively separated in the network. Since the network described the potential causal structure, the survival impacts of many nodes may be in-directly generated from their up-stream or down-stream items. Consequently, it is more important to identify which prognosis-relevant genes play the master roles rather than just examining the prognosis relevance.
To distinguish the master regulators, we further investigated on the node importance in the causal network. Here we calculated the summarized node degree by subtracting the number of in-coming edges from the number of out-coming edges for certain node. We found that the summarized degrees for a large portion of nodes were within the range of -5 to 5 (Figure 3B). The other nodes which were with larger absolute values of summarized degrees were hub nodes of the network (only names of the hub nodes were displayed on Figure 3A). These nodes are more likely to play master roles in influencing LUAD prognosis, since the expression alterations of multiple prognosis relevant genes were highly associated with these hub nodes. For instance, GAPDH, CCNA2, WWP2 and PSMD2 were all hub nodes (summarized degree >5 or summarized degree <-5), and they may lead to remarkable prognosis influences by either passing its alteration to a series of down-stream partners (see sub-graphs for GAPDH and CCNA2, Figure 3C) or gathering the influences from a collection of up-stream elements (see sub-graphs for WWP2 and PSMD2, Figure 3C). The network structure may also indicate potential regulatory relationships between nodes. GAPDH, ALDOA and PKM2 all play functions in glycolysis (37), here, we also observed that they aggregated in the same sub-graph, implying that the data-driven network may suggest meaningful biological associations.
Clustering analysis reveals LUAD-subtypes
To understand the whole molecular and prognostic impact generated by the above identified master prognosis impacting genes, we clustered the LUAD patients into two groups (termed as C1 and C2 respectively) based on the mRNA-level expressions of all of the hub nodes in the causal regulatory network. We further evaluated the importance of each master gene, the top-10 important master genes for the two subtypes included CCNA2, CBX7, TMEM48, SPC25, GAPDH, WDHD1, PSMD2, ERO1L, DDX52, ARNTL2 (Figure 4A, only the top-50 important genes were shown in the central heat map). These two clusters showed significantly different expressional patterns, especially for genes in the mTOR signaling pathway, lysosome, and PPRA signaling pathways (Figure 4B).
Meanwhile, potential SMGs which may be related with the expressional differences between the two identified clusters were also identified. Mutations of SMARCA4, KEAP1, TP53 or COL11A1 were significantly enriched in C1 (Figure 4A). These mutations were related with the significantly differential expressions for specific genes across the two clusters. For instance, GAPDH was highly expressed in C1, and its expression levels in patients with mutations in TP53, KEAP1 or SMARCA4 were significantly higher than wild type patients (Figure 4A). Notably, we also observed that patients in C1 were with significantly worse prognosis than C2 (Figure 5A, P-value = 2.9*10-8), and the significance was more remarkable than those obtained from random gene sets with the same number of genes (Figure 5B). Taken together, the master prognosis regulating genes help identify two meaningful subtypes for LUAD which showed significantly differential patterns in genomic and transcriptional levels.
Validating the prognosis differences between the identified LUAD subtypes
To estimate the reliability and robustness of the identified LUAD subtypes, we utilized additional two GEO datasets to further validate the expressional and prognostic patterns of these two subtypes. Firstly, a subtype predictor was trained based on the expressional profiles of the identified top-10 important master genes in the TCGA-LUAD cohorts (see details in Materials and methods). Based on this predictor, other patients having similar expressional profiles with C1 or C2 will be annotated with the corresponding sub-type labels. Then, this predictor was applied on two independent LUAD cohorts (GSE30219 and GSE31210). Thus, patients in the independent cohort were also annotated into the two subtypes (C1 and C2) based on their expressional profiles, and survival analysis showed these two groups were also with significantly differential survival outcomes just as observed in TCGA-LUAD, where C1 was with significantly worse prognosis than C2 (Figure 5C and 5D). This verifies the robustness of the identified expression-based LUAD subtypes in independent cohorts.
Promising Drugs for the identified LUAD-subtypes
Since the survival differences were highly related with the expressional alterations in the identified clusters, drugs targeting on the most important genes for classifying the two subtypes may generate distinctive effects on the two subtypes. Based on this hypothesis, we attempted to obtain potential drug-gene interactions for the top-30 important genes (see Materials and methods). As results, we found a collection of drugs like Estriol, Ethinyl and Folic acid (28) may be associated with genes like GAPDH, CCNA2 and PSMD2 (Figure 5E, and three of them were among the top-10 important genes for the two subtypes) which may be highly contributable to the survival differences between the two identified subtypes. For LUAD patients, more attention should be paid on these drugs since patients from different subtypes may have distinctive responses to these drugs.