mRNAsi, EREG-mRNAsi, and mDNAsi in AML
To obtain a stemness index, we first employed a previously existing stemness index model based on the OCLR machine learning method. We then obtained three stemness indices, including mRNAsi, EREG-mRNAsi, and mDNAsi. The mRNA expression based-index was reflective of gene expression, the DNA methylation based-index mDNAsi was reflective of DNA methylation, and the epigenetic regulation based-index EREG-mRNAsi was reflected in the epigenetic features. To explore the role of these three stemness indices in AML overall survival, by the Kaplan-Meier survival analysis, we observed the mRNAsi and mDNAsi were not associated with the overall survival of AML (Fig. 1a, b), but in patients with lower EREG-mRNAsi had better overall survival than that higher EREG-mRNAsi (Fig. 1c). Additionally, we evaluated the correlation between mRNAsi, mDNAsi, EREG-mRNAsi, and the clinical characteristics of AML. In terms of FAB-category, the results showed that there was a statistical correlation between mRNAsi, mDNAsi, and FAB-category (Fig. 1d, g). Moreover, mRNAsi was associated statistically with gender (Fig. 1e), age (Fig. 1f). According to the risk-category of AML, risk-category was statistically significant (Fig. 1h). These results reflected that the stemness index was tightly associated with LSCs in AML.
Relationship of stemness index and somatic mutation in AML/immune microenvironment
Analyzing the somatic mutation data in AML, we found a statistical association with mRNA and somatic mutation in AML, including WT1 mutation and CEBPA mutation (Fig. 2a, b). Moreover, mDNAsi was positively correlated to NPM1 and DNMT3A mutation (Fig. 2c, f); but IDH-1 and IDH2 mutation were negatively associated with the mDNAsi (Fig. 2d, e). In Fig. 2 g, the EREG-mRNAsi of TP53 mutation was higher than those of the TP53 wild type. To identification of the relationship of stemness index and immune microenvironment, we calculated the correlation between the stemness index and individual types of immune cells, PD-L1 expression, and immune score (Fig. 3).
Key modules related to stemness index in AML
We identified 5276 differential expression genes by R package limma, including downregulated 2648 genes and upregulated 2628 genes (Figure S1a, 1b). Then, all DEGs were used to construct a weighted gene co-expression network to identify key modules that contained key genes related to the stemness index. In our research, we choose the soft threshold power of β was 7 (scale-free R2 = 0.88) to establish the scale-free network and the gene cluster tree was acquired to identify the modules related to the stemness index of AML according to the topological overlap value (Fig. 4a). The Module-trait relationships were obtained by the correlation between modules and clinical phenotypes, which made it easier to identify highly correlated modules and stemness index (Fig. 4b). The pink modules had a significantly positive correlation with mRNAsi (MEpink: r = 0.79, p = 7e-28), then was the tan modules (MEtan: r = 0.69, p = 3e-19); EREG-mRNAsi had the highest association with the pink modules (MEpink: r = 0.55, p = 2e-11). Additionally, the red module was the highest negative correlation with the pink modules (MEred: r=-0.45, p = 9e-08), and the scatter diagrams showed the significant genes in modules related to the stemness index (Fig. 4c-f).
Analysis of key genes in key modules
To explore the genes in key modules, the standard of screening key genes in modules was defined as cor.MM > 0.8 and cor.GS > 0.5. Thus, the pink and tan modules were selected for subsequent analyses to find key genes. Finally, we identified 25 key genes (DKC1, TRMT10C, HSP90AB1, CCT2, GART, NPM1, UBA2, NOP58, CCT4, NIFK, and so on) in the pink modules and 29 key genes (CCNA2, RAD51AP1, CLSPN, TOP2A, ORC1, XRCC2, UBE2T, TUBB, and so forth) in the tan modules. We further extracted the key genes expression matrix to plot boxplot (Fig. 5a, b) and heatmap (Figure S2, S3), which showed the key genes were significantly upregulated in AML. We also identified the correlation of the key genes, which demonstrated the relevance of the genes in the pink and tan module (Fig. 5c, d). The protein-protein interaction relationships between key genes were analyzed using STRING, and a wide-ranging and strong relationship between key genes was shown in figure (S4a, b).
Functional annotation of key genes
The Gene Ontology (GO) and KEGG analyses were applied to complete the functional enrichment analysis of key genes. As figure(6a) showed that the key genes in the pink module mainly enriched in ribonucleoprotein complex biogenesis, RNA localization, ribosome biogenesis, regulation of protein localization to nucleus, positive regulation of telomerase activity, and so forth, while in the tan module the key genes enriched in DNA replication, DNA conformation change, nuclear division, organelle fission, and so on (Fig. 6c). In figure(6b), the key genes of the pink module enriched in Ribosome biogenesis in eukaryotes, Purine metabolism, One carbon pool by folate, and so on, while in the tan module the key genes were associated with cell cycle, DNA replication, Mismatch repair, Nucleotide excision repair and so forth (Fig. 6d).
Prognostic values of key genes in AML
To gain an incisive view of key genes in leukemia stem cells, we downloaded a dataset GSE76008  from GEO to analyze their expression levels in LSCs- and LSCs + samples. In figure S5, the majority of key genes were differentially expressed in LSCs- and LSCs + AML samples, which showed that these genes were enriched in the LSCs + group. And we further explored the key genes from the pink (Figure S6a) and tan (Figure S6b) module in the Oncomine database, which revealed that expression levels of the key genes in tumor and normal samples differed greatly in many types of tumor. To investigate the role of these key genes in AML, we conducted a survival analysis of these genes by using the Kaplan–Meier curve in two GEO datasets GSE12417 and GSE37642. In Fig. 8, we can see the genes (DSCC1, HSPD1, LRPPRC, MRPL3, NCAPG, NOCL1, NPM1, RCF3, UBA2) were significantly associated with the overall survival; in figure S7, the key genes (CEBPZ, CLSPN, TYMS, MCM4, DSCC1, GART, HSPD1, TUBB, UBA2, CCT2) were statistically significant with the overall survival.