The core genome M5C plays an important role in methylation modi cation and immune in ltration of acute myelocytic leukemia samples

Qiang Wen Cancer Hospital of University of Chinese Academy of Sciences, Zhejiang Province, China ShouJun Wang HangZhou FuYang Hospital of Traditional Chinese Medicine,Zhejiang Province, China Lili Hong The First A liated Hospital of Zhejiang Chinese Medical University, Hang Zhou, China Xianfu Sheng The First A liated Hospital of Zhejiang Chinese Medical University, Hang Zhou, China Xiaofen Zhuang HangZhou FuYang Hospital of Traditional Chinese Medicine,Zhejiang Province, China Yu Chen Hangzhou Medical College, Hang Zhou, China haifeng zhuang (  zhuanghaifeng5@163.com ) The First A liated Hospital of Zhejiang Chinese Medical University, Hang Zhou, China


Introduction
AML is one of the most common hematological malignancies in adults, which is highly heterogeneous, and the biological characteristics of AML originated from different stages of progenitor cells are not the same (1). Physical factors (such as X-ray, gamma ray and other ionizing radiation), chemical factors (such as occupational benzene exposure, long-term use of alkylating agent, etc.), and genetic factors are all the high-risk factors of AML (2). However, how these factors lead to the occurrence of AML,and its speci c pathogenesis are not fully understood.
There are more than 100 chemical modi cations carried by RNA in eukaryotes, about 60% of which are RNA methylation modi cation (3). Among them, m5C is the most common methylation modi cation ,and highly abundant and stable in tRNA and rRNA. Methylation of m5C RNA plays an important role in regulating total protein synthesis and cell fate (4). Therefore, activation of RNA methylation or inhibition of tRNA cleavage is essential for survival of tumor-initiating cells in response to cytotoxic stress (5). Although m5C has been found to be associated with the development of different types of tumors, its relationship with AML is poorly understood. In the process of this study, we hope to investigate the role of m5C methylation-related genes in acute leukemia,by creating a new m5C-score model to de ne high and low risk groups for AML, which shed more lights on the AML prognostic mechanism at the molecular level.

Data extraction and processing
All the data we used in our study are publicly accessible at TCGA and NCBI GEO (accession number: GSE37642, GSE12417) database (search terms: TCGA-LAML). The UCSC cancer browser (https://xenabrowser.net/datapages/? cohort=GDC%20TCGA%20Acute%20Myeloid%20Leukemia%20(LAML)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) was also utilized to download CNV, clinical follow-up, TCGA RNA-Seq and SNP 6.0 chip data. The download date is November 3, 2020. Additionally, mutation annotation le (MAF) was collected based on the GDC client, GSE37642 (6)and GSE12417(7) expression patterns, and clinical follow-up data were obtained from GEO database. 417 AML samples who had su cient follow-up data were screened for TCGA RNA-Seq data, and randomized as two groups, namely, test set (n=209) and training set (n=208); in addition, both GSE37642 (n=417) and GSE12417 (n=163) data sets were adopted to be external veri cation sets Figure1 . Table 1 shows the details of samples of every group. For GEO data processing, we rst download the mini format les from the GEO platform. According to the background le, the probe ID is converted to the gene symbol. The average value of multiple genes corresponding to a single probe was calculated, and the probes correspond to multiple genes were eliminated. Then, the expression spectrum matrix was further normalized.

Univariate Cox proportional hazard regression analysis
Univariate Cox proportional hazard regression analysis was conducted in order to identify genes whose expression levels were markedly correlated with patient overall survival (OS) in training set, at the threshold of P<0.01.
Analysis of CNV data GISTIC has been extensively adopted for detecting focal and broad (probably overlapping) recurrent events (8). Consequently, the GISTIC 2.0 was adopted for CNV data from TCGA, so as to found the signi cantly deleted or ampli ed genes, at the thresholds of p<0.05 and fragments that had >0.1 deletion or ampli cation length.

Analysis of gene mutation
For identifying genes with signi cant mutation, Mutsig 2.0(9) software was used for recognizing those genes with signi cant mutation based on the MAF of TCGA mutation data, at the threshold of P<0.05.

Prognosis-related gene signature construction
First of all, the lasso cox regression was adopted for re ning those above-identi ed prognostic genes using the glmnet function of R package (10). Secondly, the MASS function of R package was utilized to carry out stepwise regression analysis in accordance with the Akaike information criterion for obtaining the eventual 5-gene risk model. Typically, the formula was shown below: RiskScore = 0.2283059*ITGA4 0.1575680*IGLL1 0.2686156*LAPTM4B 0.1220958*HIST1H2AE 0.1472148*HOPX Then, the risk score values were z-score normalized, and samples with the processed z-score value of >0 were classi ed as the high-risk group, while those of <0 were the low-risk group.

Functional enrichment analyses
The cluster Pro ler (v3.8.1)(11) was adopted to perform Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) pathway enrichment analyses on genes, so as to recognize those enriched KEGG pathways and GO terms among the three categories, including cellular component (CC), molecular function (MF), and biological processes (BP). Notably, the false discovery rate (FDR) value of < 0.05 indicated statistical signi cance. The expression matrix of genes between different samples was rst converted into the expression matrix of gene sets, to evaluate which metabolic pathways were enriched. The correlations between the risk score and pathways were further calculated using Pearson correlation analysis. Signaling pathways with correlation coe cient of >0.35 were considered to be related to the risk score.

Statistical methods
The median risk score of every data set was adopted to be the threshold to plot the Kaplan-Meier (KM) curves, and then the survival risks were compared in high-risk group with those in low-risk group. On the other hand, the feasibility of using the gene markers as the factors to independently predict prognosis was examined through multivariate Cox regression analysis. P<0.05 indicated statistical signi cance. The R version 3.6.0 was adopted for all statistical analyses.

Results
Identi cation and functional analysis of m5C modi ed isoforms Determination of m5C modi ed subtypes Firstly, we extracted the expression levels of m5C regulatory factors from the geo expression pro le matrix, where nsun2, nsun4, tet2 and alyref genes did not exist , so we nally extracted the expression pro les of 9 m5C genes for subsequent analysis. Then, consensusclusterplus (v1.48.0; parameters: reps = 100, pitem = 0.8, pfeature = 1, distance = Euclidean ") was used for consistent clustering. Km and Euclidean distance were used as clustering algorithm and distance measure respectively; k = 3 was selected as the optimal clustering according to CDF value and delta area (Figure 2A-C).We further analyzed the prognostic relationship among the three groups, and the results showed that C1, C2, and C3 were signi cantly different ( Figure 2D, log rank, P < 0.0001).

Relationship between m5C modi ed subtypes and clinical features
We compared the distribution of different clinical features in the three molecular subtypes to see whether clinical features differed in the different subtypes. The results showed that: 1) C1, C2, C3 had signi cant difference in survival rate; 2) C2 and C3 had signi cant difference in age group and runx1-runx1t1 fusion group; 3) C1, C2, C3 had no signi cant difference in Runx1 mutation group. (Figure 3A-D) Correlation analysis of immune in ltration of m5C modi ed subtypes To identify the relationship between the immune cell scores of molecular subtypes, we used mcpcounter to evaluate the scores of 10 kinds of immune cells and the ssgsea method of gsva package to evaluate the scores of 28 kinds of immune cells (cell markers from references) pmid:28052254 . And then we compared the difference of immune score in molecular subtypes. The results are shown in Figure 4A-B. Meanwhile, we analyzed the expression of 9 m5c modi cation-related genes in three subtypes ( Figure 4C). It can be seen from the gure that there are signi cant differences in the expression of m5C gene in different subtypes.

Identi cation of differentially expressed genes
The DEGs between C1~C3 C2~C1and C2~C3 were calculated by limma package, and ltered according to the threshold FDR < 0.05 and | log2fc > log2 (2). The volcano diagram of C1~C3 up-regulated and down-regulated differentially expressed genes was shown in Figure 5A, including 3 up-regulated genes and 45 down-regulated genes; There were 133 up-regulated and 28 down-regulated differentially expressed genes in Figure 5B from C2~C1. The results showed that the main up-regulated differential expression genes were between C1~C2; the volcano diagram of C2~C3 up-regulated and down-regulated differentially expressed genes was shown in Figure 5C, of which 21 were upregulated and 52 were down-regulated. The main differential expression genes were down-regulated. The detailed differential expression genes were found in S1.xlsx. Functional analysis of differentially expressed genes among m5C molecular subtypes Furthermore, we used webgestaltr (v0.4.2) to perform the KEGG pathway and GO functional enrichment analysis of 230 differentially expressed genes among C1~C3, C2~C1 and C2~C3 molecular subtypes. For the go functional annotation of AML differential genes, 248 (P< 0.01) with signi cant difference in BP were annotated. The annotation results of the rst 15 genes were shown in Figure 6A; 29 items (P < 0.01) with signi cant difference from CC, and the results of the rst 15 items were shown in Figure 6B; 26 items (P< 0.01) with signi cant difference from MF were noted. The results of the rst 15 items were shown in Figure 6C; See S2,CSV for more details. For AML differential gene KEGG pathway enrichment, annotated to a signi cant pathway (P < 0.05).

Construction of prognostic risk model based on m5C phenotype correlation
Random grouping of training set samples Firstly, 417 samples in GSE37642 data set were divided into training set and validation set. The nal training set data was 208 samples, and the validation set data was 209 samples. Finally, training set and validation set of GSE37642 data were shown in Table 2. Chi square test was used to test the training set and validation set samples. The results showed that our group had no preference and there was no signi cant difference between groups (P > 0.05).
Single factor risk analysis of training set Using training set data, univariate Cox proportional risk regression model was performed for differentially expressed genes between subtypes (230 in total) as well as survival data using the R package survival coxph function. P < 0.01 was selected as the threshold value for ltering, with a total of 18 genes with differences at the end. The univariate Cox analysis results were shown in S3.txt.
Multi factor risk analysis of training set At present, 18 differentially related genes have been identi ed. Lasso regression was used to further compress the 18 genes to reduce the number of genes in the risk model. We used the R software package glmnet to carry out lasso Cox regression analysis. First, we analyze the change track of each independent variable, as shown in Figure 7A. We can see that with the gradual increase of lambda, the number of independent variable coe cient tending to 0 also gradually increased. We used 10-fold cross validation to build the model, and analyzed the con dence interval under each lambda, as shown in Figure 7B, from which we can see that when lambda = 0.04845916, the model is optimal. For this purpose, we selected 9 genes at lambda=0.04845916 as the target genes for the next step.
The step method in stats package starts from the most complex model and removes one variable in turn to reduce AIC. The smaller the value, the better the model . It shows that the model has enough tting degree with fewer parameters. Using this algorithm, we nally reduced 9 genes to 5 genes, which are ITGA4 IGLL1 LAPTM4B HIST1H2AE and HOPX.
The prognostic KM curve of the ve genes is shown in Figure 8. It can be seen that all the ve genes can signi cantly reduce the high and low risk of GSE37642 training set samples (P < 0.05). The nal 5-gene signature formula is as follows: RiskScore = 0.2283059*ITGA4 0.1575680*IGLL1 0.2686156*LAPTM4B 0.1220958*HIST1H2AE 0.1472148*HOPX Construction and evaluation of risk model We calculated the risk score of each sample according to the expression level of the samples, and drew the risk score distribution of the sample, as shown in Figure 9A. It can be seen from the gure that the death rate of the samples with high risk score is signi cantly greater than that of those with low score, which indicates that the high risk score samples have a worse prognosis. The changes in the expression of ve different signature genes with increasing risk values identi ed that high expression of LAPTM4B, HIST1H2AE and HOPX were associated with high risk. They were risk factors, and ITGA4 and IGLL1 were protective factors.
Further, we used the R software package timeroc to analyze the prognosis classi cation of riskscore. We analyzed the classi cation e ciency of one year, three years and ve years respectively, as shown in Figure 9B. Finally, we performed zscale on riskscore, and classi ed the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 9C. It can be seen that there is a signi cant difference between them (P < 0.0001). 106 samples are classi ed as high-risk group and 102 samples as low-risk group.
Veri cation of risk model Internal data set to verify the robustness of 5-gene signature In order to determine the robustness of the model, we used the same model and the same coe cient as the training set in GSE37642, calculate the riskscore of each sample according to the expression level of the samples, and draw the riskscore distribution of the samples.
The riskscore distribution of GSE37642 validation set is shown in Figure 10A. It can be seen from the gure that the proportion of death in samples with high RiskScore is signi cantly higher than that of samples with low RiskScore, which is consistent with the performance of GSE37642 training set. Further, we used the R software package timeroc to analyze the prognosis classi cation of RiskScore, and analyzed the classi cation e ciency of one year, three years and ve years, as shown in Figure 10B. Finally, we performed zscale on riskscore, and classi ed the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group. The KM curve is drawn, as shown in Figure 10C. It can be seen that there is a very signi cant difference between them (P < 0.0001). 105 samples are classi ed as high-risk group and 104 samples as low-risk group.
The RiskScore distribution of GSE37642 data set is shown in Figure 11A. It can be seen from the gure that the proportion of death in the samples with high RiskScore is signi cantly higher than that with low RiskScore, which is consistent with the performance of TCGA training set. Further, we used the R software package timeROC to analyze the prognosis classi cation of high and low risk groups of RiskScore. We analyzed the classi cation e ciency of prognosis prediction in one year, three years and ve years, as shown in Figure 11B. Finally, we performed zscale on RiskScore, and classi ed the samples with RiskScore greater than zero as high-risk group. The KM curve is drawn for samples less than zero, as shown in Figure 11C. It can be seen that there is a very signi cant difference between them (P < 0.0001). 127 samples are classi ed as high-risk group and 130 samples as low-risk group.
External data sets to verify the robustness of 5-gene signature We use the same model and the same coe cient as the training set in the external validation data sets GSE12417 andTCGA-LAML, and then calculate theRiskScore of each sample according to the expression level of the samples, and draw the RiskScore distribution of the samples.
The RiskScore distribution of independent external validation dataset of GSE12417 is shown in Figure 12A. It can be seen from the gure that the proportion of death in the samples with high RiskScore is signi cantly higher than those with low RiskScore, which is consistent with the performance of GSE37642 training set. Further, we used the R software package timeROC to analyze the prognosis classi cation of RiskScore. We analyzed the classi cation e ciency of one year, two years and three years, as shown in Figure 12B. Finally, we performed zscale on RiskScore, and classi ed the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 12C. It can be seen that there is a signi cant difference between them (P < 0.001). 81 samples are classi ed as high-risk group and 82 samples as low-risk group.
The RiskScore distribution of independent external validation data set of TCGA-laml is shown in Figure 13A. It can be seen from the gure that the proportion of death in the samples with high RiskScore is signi cantly higher than that with low RiskScore, which is consistent with the performance of GSE37642 training set. Furthermore, we used the R software package timeROC to analyze the prognosis classi cation of RiskScore. We analyzed the classi cation e ciency of one year, three years and ve years, as shown in Figure 13B. Finally, we performed zscale on RiskScore, and classi ed the samples with RiskScore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 13C, from which we can see that there is a very signi cant difference between them (P < 0.001). Among them, 68 samples are classi ed as high-risk group and 72 samples as low-risk group.

Risk model and prognosis analysis of clinical features
Further, we conducted the RiskScore analysis of 5-gene markers, and found that 5-gene signature could signi cantly distinguish the high and low risk groups by age, RUNX1-RUNX1T1 fusion and RUNX1 mutation ( Figure 14A-F, P < 0.05). This further indicates that our model still has good predictive ability in different clinical signs.
The expression of risk score in different clinical characteristics and different molecular subtypes By comparing the distribution of RiskScore among clinical characteristics groups, we found that there were signi cant differences in age, RUNX1-RUN1T1 fusion and RUNX1 mutation ( Figure 15A-D,P < 0.05). At the same time, we compared the difference of risk scores in molecular subtypes. The results showed that the risk score of C2 subtype with a poorer prognosis was signi cantly higher than that of C3 subtype with a better prognosis.

The relationship between RiskScore and biological function
In order to observe the relationship between the risk score and biological function of different samples, we selected the gene expression pro le corresponding to these samples, used R software package gsva to conduct single sample GSEA analysis, and calculated the ssgsea scores of each sample for each function. We further calculated the correlation between these functions and risk score, and selected the function with correlation greater than 0.3, as shown in Figure 16A-F, from which we can see the junction pathway of KEGG_ ABC_ TRANSPORTERS KEGG_ TIGHT_ was positively related with the sample's risk score,while KEGG_ NON_ SMALL_ CELL_ LUNG_ CANCER KEGG_ Glioma and KEGG_ PYRIMIDINE_ were negatively related.
Single factor and multi factor analysis of 5-gene signature In order to identify the independence of 5-gene signature model in clinical application, we used univariate and multivariate Cox regression to analyze the relevant HR, 95% CI of HR and P value in the clinical information carried by GSE37642 data. We systematically analyzed the clinical information of GSE37642 patients, including age, RUNX-RUNXT1 fusion, RUNX1 mutation and riskscore ( Figure 17).
In the GSE37642 dataset, univariate Cox regression analysis found that riskscore was signi cantly associated with survival, and the corresponding multivariate Cox regression analysis found the same result (HR = 1.57, 95% CI = 1.38 -1.79, P < 1e-5) .
The above results show that our model 5-gene signature has good predictive performance in clinical application value.

Construction of nomogram by riskscore and clinical features
According to the results of univariate and multivariate analysis, we constructed nomographic model with clinical features age, Runx1 mutaion and riskscores. We used all data sets of GSE37642 to construct nomogram ( Figure 18A). From the model results, the risk score feature has the greatest impact on the survival rate prediction, indicating that the 5-gene based risk model can better predict the prognosis. Furthermore, we use the calibration curve to evaluate the prediction accuracy of the model, such as Figure 18B. We can observe that the predicted calibration curves of 1, 3 and 5 years nearly overlapped with the standard curve, which indicates that the model has good predictive performance. In addition, we also used DCA (decision curve) to evaluate the reliability of the model, such as Figure 18C. It can be observed that the bene ts of riskscore and nomogram are signi cantly higher than those of extreme curve, where nomogram is higher than riskscore, and age and Runx1 mutaion are close to the extreme curve. The result indicates that riskscore and nomogram have good reliability.

Discussion
AML is a malignant clonal and proliferative disease derived from myeloid hematopoietic stem / progenitor cells (12). It is characterized by abnormal proliferation of primitive and immature myeloid cells and ineffective hematopoiesis of bone marrow. Its clinical manifestations are anemia, hemorrhage, infection, fever, organ in ltration and metabolic abnormalities (13). Physical factors, chemical factors, and genetic factors are all the high-risk factors of AML (14). However, how these factors lead to the occurrence of AML and its speci c pathogenesis are not fully understood.
Post transcriptional methylated tRNA of nsun2 on cytosine-5 (m5C) inhibited total protein synthesis (17). Methylation of m5C RNA plays an important role in regulating the total protein synthesis and cell fate (18). Therefore, activation of RNA methylation or inhibition of tRNA cleavage is essential for survival of tumor initiating cells in response to cytotoxic stress (19). Although m5C has been found to be associated with the development of different types of tumors, its relationship with AML is poorly understood (20).
In the process of this study, we designed a research protocol. In the rst step, the expression level of m5C regulatory factor was extracted from the GEO expression pro le matrix, and the results showed that the expression level of m5C regulatory factor was higher than that of m5C regulatory gene. The results showed that: 1) C1, C2, C3 had signi cant difference in survival rate; 2) C2 and C3 had signi cant difference in age group and RUNX1-RUNX1T1 fusion group; 3) C1, C2 and C3 were signi cantly different in RUNX1-RUNX1T1 fusion group; 3) C1, C2 and C3 were signi cantly different in RUNX1. There was no signi cant difference in mutation group. It can be seen that the gene expression of C1, C2 and C3 subtypes in m5C molecule is different from each other, and it is related to the clinical characteristics of patients.
In the second step, we used mcpcounter to evaluate the scores of 10 immune cells and GSVA package of ssgsea method for 28 immune cells ;From our research data, we can see that M5c related genes can in ltrate into AML, and then affect the function of immune cells. However, the speci c mechanism needs to be further veri ed by cell experiments and animal experiments. At the same time, we also analyzed the expression of 9 m5c modi cation-related genes in three subtypes. DEGs after interaction between C1-C3, c2-c1 and C2-C3 molecular subtypes were calculated by limma package. The results showed that the major down-regulated differential expression was mainly between C1 / C3, the main upregulation differential expression between C1 / C2, and the main downregulation difference between C2 / C3. Furthermore, we analyzed the KEGG pathway and go function enrichment of 230 differentially expressed genes among C1-C3, c2-c1 and C2-C3 molecular subtypes by R software package webgestaltr (v0.4.2). Among them, 248 genes with signi cant difference in BP (P < 0.01) and 29 genes with signi cant difference in CC (P < 0.01) were annotated < 26 items with signi cant difference in MF (P < 0.01). For AML differential gene KEGG pathway enrichment, annotated to a signi cant pathway (P < 0.05). The results showed that the changes of immune cells of three subtypes of m5C were different from each other, and the target cells of m5C were different.
Finally, 417 samples from GSE37642 dataset were used to construct a prognostic risk model based on m5C phenotype. A total of 18 genes with signi cant differences were selected and further compressed by lasso regression to reduce the number of genes in the risk model. Tibshirani (1996)) method is a kind of compressed estimation. By constructing a penalty function, a more re ned model is obtained, which makes it compress some coe cients and set some coe cients to zero (21). Therefore, it retains the advantage of subset contraction, and it is a biased estimation to deal with multicollinearity data,which can achieve the selection of variables and parameter estimation at the same time, and better solve the multicollinearity problem in regression analysis. With this algorithm, we nally reduce the number of genes to ve, which are ITGA4, IGLL1, LAPTM4B, HIST1H2AE and HOPX. According to the prognosis KM curve of the ve genes, it can be seen that all of the ve genes could signi cantly reduce the high and low risk of GSE37642 training set samples (P < 0.05); the high expression of LAPTM4B, HIST1H2AE and HOPX were identi ed as risk factors, and ITGA4 and IGLL1 were protective factors. Finally, through the internal and external data, the robustness of 5-gene signature was veri ed; further, we conducted the risk score analysis of 5-gene markers, and found that 5-gene signature could signi cantly distinguish the high-risk group from the high-risk group by grouping age, RUNX1-RUNX1T1 fusion and RUNX1 mutation. At the same time, we compared the difference of risk scores in molecular subtypes. The results showed that the risk score of C2 subtype with a poorer prognosis was signi cantly higher than that of C3 subtype with a better prognosis. We have developed a 5-genes signature prognostic strati cation system, which has good AUC in both training set and independent validation set, and it is a model independent of clinical characteristics. Nonetheless, certain limitations must be noted. First of all, some samples did not have complete clinical follow-up data, so it was impossible to examine the feasibility of using these biomarkers in distinguishing patient prognosis according to their additional health status factors. Secondly, bioinformatic analysis results alone were not enough, and experiments must be performed for veri cation. Consequently, more large-scale experimental and genetic studies are warranted for veri cation.

Conclusions
Based on consistent clustering of genes related to m5c modi cation, AML was classi ed into three subtypes, which showed signi cant differences in prognosis between subtypes.
The limma package analysis was used to identify differential genes among subtypes, and nally 230 DEGs were obtained, based on which we constructed a 5-gene prognostic risk model.
The 5-gene signature has strong robustness and can play a stable predictive performance in external validation data sets (GSE12417, TCGA-LAML).
In conclusion, in this study, we developed a 5-genes signature prognostic strati cation system with good AUC in both training set and independent veri cation set, and it is a model with independent clinical characteristics. Therefore, we suggest using this classi er as a molecular diagnostic test to assess the prognosis risk of patients with acute leukemia.   A: BP annotation map of differentially expressed genes of molecular subtypes; B: CC annotation map of differentially expressed genes of molecular subtypes; C: MF annotation map of differentially expressed genes of molecular subtypes.

Figure 7
A: The horizontal axis represents the log value of the independent variable lambda, and the vertical axis represents the coe cient of the independent variable; B: the con dence interval under each lambda.