Results of WGCNA
The clinical and genetic features of included AML cases in BeatAML and TCGA database were shown in Table 1. The expression data of 421 non-APL AML patients were inputted into WGCNA. No outliers were detected after all samples were hierarchically clustered using average distance and Pearson’s method, the dendrogram for which was shown in Supplementary Figure 1. The lowest soft threshold power = 6, by which the scale free R2 > 0.85 (Figure 1A). The calculation based on TOM-based dissimilarity divided the whole gene set into 37 gene modules (Figure 1B), after we merged the modules with dissimilarity less than 20% by setting the mergeCutHeight as 0.20. 400 randomly selected genes were grouped into modules and generate the heatmap of topological overlap (Figure 1C), indicating high topological overlap degree of co-expression network in individual modules. Moreover, the correlation of co-expressed modules was demonstrated by module eigengenes adjacency heatmap (Figure 1D). Finally, the relationship of modules and traits was shown in Figure 1E. whereas we noticed that the status of gene mutation combinations will also be of prognostic value rather than the single gene, such as NPM1 and FLT3. Therefore, we performed correlation analysis for modules with different combinations of NPM1/FLT3-ITD mutation status (Figure 1F). 3 pairs of module-trait were identified to be significant correlation, and were studied sequentially, including ‘lightyellow’ module (R2 = 0.41/ p = 2e-18 for FLT3-ITD, R2 = 0.66/ p = 4e-53 for NPM1 mutation), ‘saddlebrown’ module (R2 = 0.6/ p = 2e-43, for RUNX1 mutation) and ‘lightgreen’ module (R2 = 0.41/ p = 2e-10, for TP53 mutation). The ‘violet’ module was initially identified as significantly correlated with NPM1 mutation (R2 = 0.46/ p = 3e-23), while the following detail analysis on NPM1/FLT3-ITD combinations indicated that this module was not significantly related to any combination. So, the ‘violet’ module was excluded from further analysis. According to the criteria of hub genes, 973 hub genes were identified for further analysis.
The ‘lightyellow” co-expressed module
The ‘lightyellow’ module included 143 genes and was positively correlated with FLT3-ITD (R2 = 0.41, p = 2e-18) and NPM1 mutation (R2= 0.66, p = 4e-53) respectively. The further analysis on combinations of NPM1 mutation and FLT-ITD, the ‘lightyellow’ module was only found to be related to NPM1 mutation, regardless of the presence of FLT3-ITD. Despite the correlation coefficients didn’t meet the criteria, the potential negative correlations were detected for the ‘lightyellow’ module with ELN2017 risk stratification, complex karyotype, RUNX1-RUNX1T1, CBFB-MYH11, CEBPA biallelic mutation, ASXL1, RUNX1 and TP53 by p value less than 0.01.
The results of ORA for genes in the ‘lightyellow’ module were shown in Figure 2A. According to GO analysis, the genes were significantly enriched in biological processes like positive regulation of transcription, negative regulation of cell differentiation, etc. And the genes were enriched in molecular functions like RNA polymerase II regulatory region sequence-specific DNA binding, DNA binding, etc. Based on KEGG pathway analysis, the genes were enriched in transcriptional misregulation in cancer. The Reactome analysis indicated the genes were enriched in activation of HOX genes, activation of HOX genes during differentiation, etc.
Moreover, the genes of ‘lightyellow’ module were inputted into STRING analysis, resulting in the PPI network shown in Supplementary Figure 2. Top 10 genes with highest connectivity degrees among the PPI network were MEIS1, HOXA5, HOXA3, HOXA7, HOXA6, HOXA10, HOXB3, HOXA9, PBX3 and HOXB4 (Figure 2B).
The ‘saddlebrown’ co-expressed module
This module included 60 genes and was significantly correlated with RUNX1 mutation (R2 = 0.6/ p = 2e-43). The results of ORA for genes in the ‘saddlebrown’ module were shown in Figure 3A, meanwhile the relationship of enriched Reactome pathways were shown in Figure 3B. GO analysis revealed that genes were significantly enriched in the following molecular functions: cytokine-mediated signaling pathway, positive regulation of immune response, etc. The biological processes analysis indicated the genes were enriched in MHC protein complex binding, MHC class II protein complex binding, etc. Based on cell component analysis, the genes were enriched in endocytic vesicle, lysosomal membrane, etc. KEGG pathway demonstrated the genes were significantly enriched in hematopoietic cell lineage, antigen processing and presentation, phagosome, etc. According to Reactome analysis, the genes were enriched in interferon signaling, PD-1 signaling, etc.
Furthermore, the genes of ‘saddlebrown’ module were inputted into STRING analysis, resulting in the PPI network shown in Supplementary Figure 3. Top 10 genes with highest connectivity degrees among the PPI network (Figure 3C) included HLA-DRB5, HLA-DMA, HLA-DRB1, CD74, HLA-DQA1, HLA-DPB1, CIITA, HLM-DMB, HLA-DRA and HLA-DQB1.
The ‘lightgreen’ co-expressed module
This module included 352 genes and was significantly correlated with TP53 mutation (R2 = 0.41/ p = 2e-10). The results of ORA for genes in the ‘lightgreen’ module were shown in Figure 4A, meanwhile the relationship of enriched Reactome pathways were shown in Figure 4B. By GO analysis on biological processes, the genes were identified to be enriched in cell cycle phase transition, mitotic nuclear division, etc. Molecular function analysis based on GO database, indicated the genes were enriched in RNA binding, protein binding, etc. And cell component analysis according to GO database, revealed the genes were enriched in chromosome, ribosome, etc. KEGG pathway analysis demonstrated the genes were enriched in DNA replication, p53 signaling pathway, etc. In Reactome analysis, the genes were enriched in Mitotic G2-G2/M phases, transcriptional Regulation by TP53, etc.
Furthermore, the genes of ‘lightgreen’ module were inputted into STRING analysis, resulting in the PPI network shown in Supplementary Figure 4. Top 10 genes with highest connectivity degrees among the PPI network (Figure 4C) included CCNA2, CDK1, BUB1, NCAPG, KIF11, BUB1B, CCNB1, TOP2A, CDC20 and AURKA.
Results of LASSO penalized regression analysis
The hub genes were inputted into LASSO penalized regression analysis, to fit OS of AML patients to the prediction model. After 1000 times of iteration between training and testing set in TCGA AML cohort, an optimized model of 6 gene with non-zero coefficient were identified, including NFKB2, NEK9, HOXA7, APRC5L, FAM30A and LOC105371592. The prediction model for AML OS were established by the 6-gene expression signature, the coefficients of which were listed in Table 2.
The association of 6-gene expression signature with traditional risk factors of AML
The risk scores were calculated for 6-gene expression signature for the individual patients in TCGA AML cohort. The risk scores of patients in the adverse ELN2017 risk group, were significantly higher than that of favorable group (t = 2.041, df = 250, p = 0.0423, Figure 5A). meanwhile, AML patients with MDS history, had higher risk scores than that without MDS history (t = 2.268, df = 394, p = 0.0239, Figure 5B). The relapsed AML patients had higher risk scores than de novo AML patients (t = 2.347, df = 394, p = 0.0194, Figure 5C).
The validation of 6-gene signature model
The diagnostic utility was evaluated in the training set, testing set and 2 external independent validation sets. After the cut-off values for risk scores were calculated in each cohort and divided patients into low and high risk groups, the Kaplan-Meier plots were used to compare the OS between groups by log-rank test. The OS in low risk group was significantly longer than that in high risk group in all 4 sets (Figure 6A-D). In the TCGA training set, median OS was not reached in low risk group, and 8.08 months in high risk group (HR = 4.203, 95%CI 2.251-7.847, p < 0.0001). The median OS was also not reached in low risk group, and 9.30 months for high risk group (HR = 3.342, 95%CI 1.817-6.145, p = 1e-4). Similar results were uncovered in GSE12417 (HR = 3.188, 95%CI 1.720-5.910, p < 0.0001) and GSE37642 (HR = 2.489, 95%CI 1.658-3.737, p < 0.0001). The distribution of risk scores, survival time and gene-survival heatmap were shown for the training set (Figure 7) and the testing set (Figure 8). As AUC is a very crucial indicator for utility of a prognostic model, the time dependent ROC analysis was performed for the 4 set (Figure 9A-D). The 5-year AUC for the training and testing set were 0.822 and 0.824 respectively, whereas it is 0.79 and 0.743 in GSE12417 and GSE37642, respectively, which demonstrated the superiority and robustness in expression datasets generated from different platforms. The results of multivariate COX regression analysis were shown in Table 3, indicating the risk factor was an independent risk factor for OS of AML patients.