Machine Learning and Integrative Analysis Reveals Stemness Index-associated Signature in Acute Myeloid Leukemia

Background: The rapidly growing research in leukemia has provided unequivocal proof that cancer stem cells exist in acute myeloid leukemia. However, the mechanisms of the growth and self-renewal of acute myeloid leukemia stem cells (LSCs), which play an important role in the development of AML, were still elusive. Methods: We rst used a trained stemness index model based on a recently described machine learning method one-class logistic regression (OCLR) to score AML samples. We then obtained the mRNA expression based-index (mRNAsi), DNA methylation based-index (mDNAsi), and epigenetic regulation based-index (EREG-mRNAsi). With the acquired three stemness indices, we investigated the relationship between the stemness index and the tumor immune microenvironment/somatic mutation, clinical characteristics of AML patients. Furthermore, a weighted gene co-expression network analysis (WGCNA) and protein-protein interaction (PPI) network were performed to obtain key genes associated with the stemness index. Bedsides, GO and KEGG enrichment analysis was performed to explore the biological functions. Finally, we validated the key genes by applying it to the Oncomine and GEO datasets. Result: The results suggested that EREG-mRNAsi might accurately predict overall survival (OS) in AML patients from TCGA, while the mRNAsi and mDNAsi was not signicantly statistically associated with the OS. Furthermore, the mRNAsi was signicantly correlated with the clinical characteristics of the FAB category, age, gender, immune score, WT1, and CEBPA mutation. Additionally, the EREG-mRNAsi was signicantly associated with the cytogenetics risk category, TP53 mutation. The mDNAsi was signicantly associated with the FAB-category, NPM1, IDH1, IDH2, and DNMT3A mutation. Several key genes related to the stemness index of AML were indicated to be possible and promising therapeutic targets for inhibiting LSCs in AML. Conclusions: Our research based on machine learning and WGCNA analysis revealed a set of key genes contributed to the stemness index of AML. These genes may become therapeutic targets to inhibit the LSCs


Background
Although most patients initially respond to chemotherapy, half of the patients' relapse and some patients succumb to the disease within 5 years of diagnosis. Relapse after the achievement of remission from standard induction chemotherapy is the key barrier to cure acute myeloid leukemia (AML), and some patients have no response to induction therapy [1]. Many AML patients show evidence of a hierarchical cellular organization, with a minor fraction of self-renewing leukemia stem cells (LSCs) at the apex of this hierarchy which supports the disease to outgrow and maintain [2][3][4][5]. Stemness, de ned as the potential for self-renewal and differentiation from the cell of origin, was originally attributed to normal stem cells that possess the ability to give rise to all cell types in the adult organism [6]. LSCs were leukemia cells that possess the ability to give rise to all tumor cell types [7,8]. AML is a heterogeneous disease that resides within a complex microenvironment, which composed a diverse, integrated ecosystem of undifferentiated cells, myeloid cells, erythroid cells, lymphoid cells, and LSCs [4]. LSCs were considered to be responsible for AML growth, maintenance, and clinical outcomes, were often resistant to conventional chemotherapy [9][10][11][12][13]. Therefore, it is necessary to explore the molecular mechanism of LSCs and nd the potential biomarkers for the diagnosis and treatment of AML.
In the present research, we integratively analyzed cancer stemness in a cohort of AML samples (n = 151) from The Cancer Genome Atlas (TCGA) database. First of all, we applied a trained stemness index model based on the previously described one-class logistic regression (OCLR) machine-learning method [6], which contains mRNA expression-based signature, DNA methylation-based signature, and epigeneticbased signature, to quantify AML stemness to acquire three independent stemness indexes. mRNA expression based-index was re ective of gene expression; DNA methylation based-index was re ective of DNA methylation, and the epigenetic regulation based-index was re ected in the epigenetic features. Then, we evaluated correlations between the three stemness index and clinical and molecular features, and we identi ed a stemness signature that might be useful for guiding the prognostic status of AML. Besides, we found the genes related to stemness indexes by weighted gene co-expression network analysis (WGCNA) [14][15][16]. Furthermore, we assessed the key genes by Gene Ontology (GO) enrichment analysis [17] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [18]. Finally, we validated the key genes in GEO and Oncomine database.

Data download and processing
In this research, we download the RNA sequencing data and the corresponding clinical data of AML from the TCGA database (https://portal.gdc.cancer.gov/). Due to the TCGA do not contains normal samples of AML, we download all data of normal whole blood samples from GTEx Analysis Release V8 (https://www.gtexportal.org/home/). A complete description of the donor genders, multiple ethnicity groups, wide age range, the biospecimen procurement methods, and sample xation was described in GTEx o cial annotation. By using data from Ensembl (https://uswest.ensembl.org/index.html) [19], we transferred the ensemble ids to gene symbols. The expression levels of 337 normal whole blood samples and 151 AML samples were presented by fragments per kilobase of transcript per million mapped reads (FPKM) normalized estimation and log2-based transformation. Then, "limma" package version 3.45.0 (http://www.bioconductor.org/) [20] was used in R language, version 4.0.0 (https://cran.r-project.org/) to merge the two datasets into one with normalization and the differentially expressed genes (DEGs) were calculated, and the genes with FDR < 0.05, log|Fold change|>1.4 were selected as the signi cant DEGs. To remove the batch effect, we used the Combat function in R package "sva" [21].
Machine learning based-identi cation of stemness index for AML The stemness index was calculated by building a predictive model using an OCLR algorithm on pluripotent stem cell samples from the Progenitor Cell Biology Consortium dataset (PCBC) [22]. AML samples were strati ed by the mRNAsi, mDNAsi, and EREG-mDNAsi, which is based on the stemness index model, were utilized for the integrative analyses. The detailed methodologic information can be referred to as a previous report by Malta et al. [6].

Relationship of stemness index and clinical characteristic/somatic mutation/immune microenvironment
The prognostic roles of the stemness index were identi ed in the R package survival, and the correlation between the stemness index and clinical characteristic was also analyzed in R. To investigate the relationship of the somatic mutation and stemness index, we used the R package "maftools" [23]. By using CIBERSORT (a gene expression-based deconvolution algorithm) [24], we scored 22 immune cell types for their relative abundance in the AML samples. For the TCGA AML sample, we computed the associations between the stemness index and the estimated fractions of individual immune cell types. By applying ESTIMATE [25], we calculated the individual immune score to predict the level of in ltrating immune cells in the AML sample. We calculated the correlations between the stemness index and immune score or PD-L1 expression.

Construction of weighted gene co-expression network analysis
The co-expression network was constructed based on the DEGs by the "WGCNA" package (version 1.69) [14] in R. First, the normal samples were deleted and the cases were checked for missing data. Second, the cases were clustered according to the gene expression levels, reducing the outlier. Merging the data with the mRNAsi, mDNAsi, and EREG-mRNAsi data, the prepared input was used in the following analysis. For a scale-free gene co-expression network, an appropriate parameter β value was selected as a soft threshold to construct the adjacency matrix. A topological overlap matrix (TOM) was then created from the adjacency matrix to measure the network connection between genes, de ned as the sum of adjacent genes, with a minimum of 50 genes per gene module, and module dissimilarity was calculated to construct a dendrogram. We selected two modules with the highest correlation of mRNAsi, and performed correlation analysis for each merged module, ltering out the key genes correlating with both the module and mRNAsi index.

Pathway enrichment and functional annotation of key genes
We selected the ClusterPro ler package [26] in R to construct Gene Ontology (GO) function annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis to visualize the biological function of key genes. A p-value < 0.05 was considered statistically signi cant.

Construction of protein-protein interaction (PPI) network analysis and Co-expression analysis of key genes
We applied an online tool, Search Tool for the Retrieval of Interacting Genes (STRING), to assess the protein-protein interaction (PPI) among key genes. Then, by using the Corrplot package in R to calculate the Pearson correlations at the transcription level, we identi ed the co-expression relationship between key genes.
Validated key genes in GEO and Oncomine database We selected the GEO datasets (www.ncbi.nlm.nih.gov/geo/), GSE76008, which was used to verify the mRNA expression of key genes. The Oncomine database (http://www.oncomine.org) was selected to validate the key genes in the pan-cancer. GSE12417 and GSE37642 were used to identify the overall survival of key genes.

Statistical analysis
All statistical analysis was carried out in R version 4.0.0 and corresponding packages. The Wilcox test function in R was applied to evaluate the difference in the stemness index between the two groups. The Kruskal test in R was used to test the correlation between the stemness index and two or more independent samples. For all data, *P < 0.05, **P < 0.01, ***P < 0.001.

Results mRNAsi, EREG-mRNAsi, and mDNAsi in AML
To obtain a stemness index, we rst employed a previously existing stemness index model based on the OCLR machine learning method. We then obtained three stemness indices, including mRNAsi, EREG-mRNAsi, and mDNAsi. The mRNA expression based-index was re ective of gene expression, the DNA methylation based-index mDNAsi was re ective of DNA methylation, and the epigenetic regulation basedindex EREG-mRNAsi was re ected in the epigenetic features. To explore the role of these three stemness indices in AML overall survival, by the Kaplan-Meier survival analysis, we observed the mRNAsi and mDNAsi were not associated with the overall survival of AML (Fig. 1a, b), but in patients with lower EREG-mRNAsi had better overall survival than that higher EREG-mRNAsi (Fig. 1c). Additionally, we evaluated the correlation between mRNAsi, mDNAsi, EREG-mRNAsi, and the clinical characteristics of AML. In terms of FAB-category, the results showed that there was a statistical correlation between mRNAsi, mDNAsi, and FAB-category (Fig. 1d, g). Moreover, mRNAsi was associated statistically with gender (Fig. 1e), age (Fig. 1f). According to the risk-category of AML, risk-category was statistically signi cant (Fig. 1h). These results re ected that the stemness index was tightly associated with LSCs in AML.

Relationship of stemness index and somatic mutation in AML/immune microenvironment
Analyzing the somatic mutation data in AML, we found a statistical association with mRNA and somatic mutation in AML, including WT1 mutation and CEBPA mutation (Fig. 2a, b). Moreover, mDNAsi was positively correlated to NPM1 and DNMT3A mutation (Fig. 2c, f); but IDH-1 and IDH2 mutation were negatively associated with the mDNAsi (Fig. 2d, e). In Fig. 2 g, the EREG-mRNAsi of TP53 mutation was higher than those of the TP53 wild type. To identi cation of the relationship of stemness index and immune microenvironment, we calculated the correlation between the stemness index and individual types of immune cells, PD-L1 expression, and immune score (Fig. 3).
Key modules related to stemness index in AML We identi ed 5276 differential expression genes by R package limma, including downregulated 2648 genes and upregulated 2628 genes ( Figure S1a, 1b). Then, all DEGs were used to construct a weighted gene co-expression network to identify key modules that contained key genes related to the stemness index. In our research, we choose the soft threshold power of β was 7 (scale-free R 2 = 0.88) to establish the scale-free network and the gene cluster tree was acquired to identify the modules related to the stemness index of AML according to the topological overlap value (Fig. 4a). The Module-trait relationships were obtained by the correlation between modules and clinical phenotypes, which made it easier to identify highly correlated modules and stemness index (Fig. 4b). The pink modules had a signi cantly positive correlation with mRNAsi (MEpink: r = 0.79, p = 7e-28), then was the tan modules (MEtan: r = 0.69, p = 3e-19); EREG-mRNAsi had the highest association with the pink modules (MEpink: r = 0.55, p = 2e-11). Additionally, the red module was the highest negative correlation with the pink modules (MEred: r=-0.45, p = 9e-08), and the scatter diagrams showed the signi cant genes in modules related to the stemness index (Fig. 4c-f).

Analysis of key genes in key modules
To explore the genes in key modules, the standard of screening key genes in modules was de ned as cor.MM > 0.8 and cor.GS > 0.5. Thus, the pink and tan modules were selected for subsequent analyses to nd key genes. Finally, we identi ed 25 key genes (DKC1, TRMT10C, HSP90AB1, CCT2, GART, NPM1, UBA2, NOP58, CCT4, NIFK, and so on) in the pink modules and 29 key genes (CCNA2, RAD51AP1, CLSPN, TOP2A, ORC1, XRCC2, UBE2T, TUBB, and so forth) in the tan modules. We further extracted the key genes expression matrix to plot boxplot (Fig. 5a, b) and heatmap ( Figure S2, S3), which showed the key genes were signi cantly upregulated in AML. We also identi ed the correlation of the key genes, which demonstrated the relevance of the genes in the pink and tan module (Fig. 5c, d). The protein-protein interaction relationships between key genes were analyzed using STRING, and a wide-ranging and strong relationship between key genes was shown in gure (S4a, b).

Functional annotation of key genes
The Gene Ontology (GO) and KEGG analyses were applied to complete the functional enrichment analysis of key genes. As gure(6a) showed that the key genes in the pink module mainly enriched in ribonucleoprotein complex biogenesis, RNA localization, ribosome biogenesis, regulation of protein localization to nucleus, positive regulation of telomerase activity, and so forth, while in the tan module the key genes enriched in DNA replication, DNA conformation change, nuclear division, organelle ssion, and so on (Fig. 6c). In gure(6b), the key genes of the pink module enriched in Ribosome biogenesis in eukaryotes, Purine metabolism, One carbon pool by folate, and so on, while in the tan module the key genes were associated with cell cycle, DNA replication, Mismatch repair, Nucleotide excision repair and so forth (Fig. 6d).
Prognostic values of key genes in AML To gain an incisive view of key genes in leukemia stem cells, we downloaded a dataset GSE76008 [13] from GEO to analyze their expression levels in LSCs-and LSCs + samples. In gure S5, the majority of key genes were differentially expressed in LSCs-and LSCs + AML samples, which showed that these genes were enriched in the LSCs + group. And we further explored the key genes from the pink ( Figure S6a) and tan ( Figure S6b) module in the Oncomine database, which revealed that expression levels of the key genes in tumor and normal samples differed greatly in many types of tumor. To investigate the role of these key genes in AML, we conducted a survival analysis of these genes by using the Kaplan-Meier curve in two GEO datasets GSE12417 and GSE37642. In Fig. 8, we can see the genes (DSCC1, HSPD1, LRPPRC, MRPL3, NCAPG, NOCL1, NPM1, RCF3, UBA2) were signi cantly associated with the overall survival; in gure S7, the key genes (CEBPZ, CLSPN, TYMS, MCM4, DSCC1, GART, HSPD1, TUBB, UBA2, CCT2) were statistically signi cant with the overall survival.

Discussion
In this study, by applying a newly reported stemness index model-based OCLR machine-learning method to the AML samples, we obtained three distinct stemness indexes and then employed these indexes to assess the stemness features of AML based on their molecular and clinical information. Regarding the relevance of the stemness index in prognosis, there was a statistical correlation between EREG-mRNAsi and the overall survival of AML patients. However, mRNAsi and mDNAsi were no statistical correlation with the overall survival of AML patients. Using CIBERSORT and ESTIMATE analysis, we obtained insight into the interaction of the stemness index and the immune microenvironment. Moreover, we identi ed 16 prognostic genes that could effectively predict the survival of AML patients and revealed the signi cant correlations between the stemness index and the somatic mutation in AML, including TP53, IDH1, IDH2, DNMT3A, NPM1, CEBPA, WT1 mutation. These results imply that AML patients might have a high degree of heterogeneity concerning the stemness phenotype.
In colon and breast cancer, stem cell signature expression correlates inversely with patient survival [27,28]. Similarly, stem cell signatures shared by leukemia and hematopoietic stem cells predict clinical outcomes in AML patients [10]. These studies revealed that for multiple tumors, including AML, patients whose cancer exhibits higher expression levels of stem cell genes experience signi cantly worse clinical outcomes. In this study, we identi ed 5276 DEGs and used WGCNA to classify them into 14 gene modules based on their correlation with the stemness index. GO and KEGG functional enrichment analysis showed that 54 key genes were mainly involved in DNA replication, DNA conformation change, nuclear division, positive regulation of the establishment of protein localization to the telomere, and positive regulation of telomerase activity, which indicated the genes mainly involved in cell proliferation, cell cycle. The result was consistent with the in nite proliferation characteristic of stem cells. Many human tumor cancers, including AML, were characterized by robust telomerase activity relative to the normal cellular counterpart [29,30], and telomerase had been recorded for the development of chronic myeloid leukemia induced by BCR-ABL [31]. These ndings had identi ed telomerase as a potential therapeutic target in AML [32].
To verify the accuracy of our analysis results, stemness-related key genes in multiple tumor tissues were overexpressed in the Oncomine and GEO database. Among all key genes, 9 genes (UBA2, RFC3, NPM1, NOLC1, NCAPG, MRLP3, DSCC1, HSPD1, LRPPRC) from the GEO dataset GSE12417 and 10 genes (CCT2, UAB2, TUBB, HSPD1, GART, DSCC1, MCM4, TYMS, CLSPN, CEPBZ) from GSE37642 were signi cantly correlated with the overall survival of AML patients. After a comprehensive literature review in Pubmed, we found RCF3 may have a role AML by promoting the G1/S progression and was required for chromatin loading of proliferating cell nuclear antigen (PCNA) [33], which was the key genes in the tan module. MCM4 is part of the complex of MCM2-7 that is important for licensing of DNA origins before the S phase and also serves as the core of the replicative helicase that unwinds DNA at replication forks [34]. Thymidylate synthase (TYMS) played an important role in the G1/S transition; knockdown of TYMS reduced proliferation, promoted apoptosis, facilitated cell cycle progression from G1 to S phase [35].
These key genes were mainly involved in DNA replication, mitotic nuclear division and chromosome segregation.
Furthermore, we identi ed the association between somatic mutation, immune microenvironment, and the stemness index. We found mRNAsi had a negative association with the immune score for AML patients, suggesting that LSCs in AML patients may repress immune cells by affecting the activation of immune cells, which was similar to Van Galen's results [29]. The absence of PD-L1 expression in AML might explain partly why the stemness index had no signi cant associations with PD-L1 expression, indicating that the potential of immunotherapy with PD-L1 inhibitors seems limited in AML patients, which is consistent with the clinical data. Analyzing somatic mutation data, we identi ed high mRNAsi was associated positively with mutation in WT1 and CEBPA. High mDNAsi and low mDNAsi were statistically correlated NPM1, DNMT3A mutation, and IDH1 and IDH2 mutation respectively. Besides, high EREG-mRNAsi was associated with TP53 mutation. This result is similar to the previous report that TP53 mutation clones isolated from the AML patients cells also show increased chemoresistance potential and have a higher population of LSCs maker positive cells [36]. However, the mechanisms underlying our ndings have not been elucidated here, further improvements were expected to result in a profound understanding of the mechanism details of the key genes that need various basic experiments to further explore in AML patients.

Conclusions
In summary, a total of 54 genes were identi ed to be closely correlated to the AML stemness index and 16 genes showed signi cant prognostic values of the overall survival. Besides, the main signaling pathway related to stemness in AML was DNA replication, cell cycle, and telomerase activity, which suggests that the key genes in these pathways might be therapeutic targets to eradicate the LSCs in AML. Since the conclusions were based on bioinformatic analysis, further biological and basic researches were needed to validate our results in the future. The data used to support the ndings of this study are included in the article. The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests. Authors' contributions: QX and HZ, drafting and revision of the submitted article.
DC and BF, constructive suggestions.
SY and ZL, critically reviewing.
YH and TG design of the study and revision of the submitted article.