Identification and functional analysis of m5C modified isoforms
Determination of m5C modified subtypes
Firstly, we extracted the expression levels of m5C regulatory factors from the geo expression profile matrix, where nsun2, nsun4, tet2 and alyref genes did not exist , so we finally extracted the expression profiles of 9 m5C genes for subsequent analysis. Then, consensusclusterplus (v1.48.0; parameters: reps = 100, pitem = 0.8, pfeature = 1, distance = Euclidean ") was used for consistent clustering. Km and Euclidean distance were used as clustering algorithm and distance measure respectively; k = 3 was selected as the optimal clustering according to CDF value and delta area (Figure 2A-C).We further analyzed the prognostic relationship among the three groups, and the results showed that C1, C2, and C3 were significantly different (Figure 2D, log rank, P < 0.0001).
Relationship between m5C modified subtypes and clinical features
We compared the distribution of different clinical features in the three molecular subtypes to see whether clinical features differed in the different subtypes. The results showed that: 1) C1, C2, C3 had significant difference in survival rate; 2) C2 and C3 had significant difference in age group and runx1-runx1t1 fusion group; 3) C1, C2, C3 had no significant difference in Runx1 mutation group.(Figure 3A-D)
Correlation analysis of immune infiltration of m5C modified subtypes
To identify the relationship between the immune cell scores of molecular subtypes, we used mcpcounter to evaluate the scores of 10 kinds of immune cells and the ssgsea method of gsva package to evaluate the scores of 28 kinds of immune cells (cell markers from references) pmid:28052254 ). And then we compared the difference of immune score in molecular subtypes. The results are shown in Figure 4A-B. Meanwhile, we analyzed the expression of 9 m5c modification-related genes in three subtypes (Figure 4C). It can be seen from the figure that there are significant differences in the expression of m5C gene in different subtypes.
Analysis of differentially expressed genes among m5C modified subtypes
Identification of differentially expressed genes
The DEGs between C1~C3、C2~C1and C2~C3 were calculated by limma package, and filtered according to the threshold FDR < 0.05 and | log2fc > log2 (2). The volcano diagram of C1~C3 up-regulated and down-regulated differentially expressed genes was shown in Figure 5A, including 3 up-regulated genes and 45 down-regulated genes; There were 133 up-regulated and 28 down-regulated differentially expressed genes in Figure 5B from C2~C1. The results showed that the main up-regulated differential expression genes were between C1~C2; the volcano diagram of C2~C3 up-regulated and down-regulated differentially expressed genes was shown in Figure 5C, of which 21 were up-regulated and 52 were down-regulated. The main differential expression genes were down-regulated. The detailed differential expression genes were found in S1.xlsx.
Functional analysis of differentially expressed genes among m5C molecular subtypes
Furthermore, we used webgestaltr (v0.4.2) to perform the KEGG pathway and GO functional enrichment analysis of 230 differentially expressed genes among C1~C3, C2~C1 and C2~C3 molecular subtypes. For the go functional annotation of AML differential genes, 248 (P< 0.01) with significant difference in BP were annotated. The annotation results of the first 15 genes were shown in Figure 6A; 29 items (P < 0.01) with significant difference from CC, and the results of the first 15 items were shown in Figure 6B; 26 items (P< 0.01) with significant difference from MF were noted. The results of the first 15 items were shown in Figure 6C; See S2,CSV for more details. For AML differential gene KEGG pathway enrichment, annotated to a significant pathway (P < 0.05).
Construction of prognostic risk model based on m5C phenotype correlation
Random grouping of training set samples
Firstly, 417 samples in GSE37642 data set were divided into training set and validation set. The final training set data was 208 samples, and the validation set data was 209 samples. Finally, training set and validation set of GSE37642 data were shown in Table 2. Chi square test was used to test the training set and validation set samples. The results showed that our group had no preference and there was no significant difference between groups (P > 0.05).
Single factor risk analysis of training set
Using training set data, univariate Cox proportional risk regression model was performed for differentially expressed genes between subtypes (230 in total) as well as survival data using the R package survival coxph function. P < 0.01 was selected as the threshold value for filtering, with a total of 18 genes with differences at the end. The univariate Cox analysis results were shown in S3.txt.
Multi factor risk analysis of training set
At present, 18 differentially related genes have been identified. Lasso regression was used to further compress the 18 genes to reduce the number of genes in the risk model. We used the R software package glmnet to carry out lasso Cox regression analysis. First, we analyze the change track of each independent variable, as shown in Figure 7A. We can see that with the gradual increase of lambda, the number of independent variable coefficient tending to 0 also gradually increased. We used 10-fold cross validation to build the model, and analyzed the confidence interval under each lambda, as shown in Figure 7B, from which we can see that when lambda = 0.04845916, the model is optimal. For this purpose, we selected 9 genes at lambda=0.04845916 as the target genes for the next step.
The step method in stats package starts from the most complex model and removes one variable in turn to reduce AIC. The smaller the value, the better the model . It shows that the model has enough fitting degree with fewer parameters. Using this algorithm, we finally reduced 9 genes to 5 genes, which are ITGA4、IGLL1、LAPTM4B、HIST1H2AE and HOPX.
The prognostic KM curve of the five genes is shown in Figure 8. It can be seen that all the five genes can significantly reduce the high and low risk of GSE37642 training set samples (P < 0.05). The final 5-gene signature formula is as follows:
RiskScore = -0.2283059*ITGA4-0.1575680*IGLL1+0.2686156*LAPTM4B+0.1220958*HIST1H2AE+0.1472148*HOPX
Construction and evaluation of risk model
We calculated the risk score of each sample according to the expression level of the samples, and drew the risk score distribution of the sample, as shown in Figure 9A. It can be seen from the figure that the death rate of the samples with high risk score is significantly greater than that of those with low score, which indicates that the high risk score samples have a worse prognosis. The changes in the expression of five different signature genes with increasing risk values identified that high expression of LAPTM4B, HIST1H2AE and HOPX were associated with high risk. They were risk factors, and ITGA4 and IGLL1 were protective factors.
Further, we used the R software package timeroc to analyze the prognosis classification of riskscore. We analyzed the classification efficiency of one year, three years and five years respectively, as shown in Figure 9B. Finally, we performed zscale on riskscore, and classified the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 9C. It can be seen that there is a significant difference between them (P < 0.0001). 106 samples are classified as high-risk group and 102 samples as low-risk group.
Verification of risk model
Internal data set to verify the robustness of 5-gene signature
In order to determine the robustness of the model, we used the same model and the same coefficient as the training set in GSE37642, calculate the riskscore of each sample according to the expression level of the samples, and draw the riskscore distribution of the samples.
The riskscore distribution of GSE37642 validation set is shown in Figure 10A. It can be seen from the figure that the proportion of death in samples with high RiskScore is significantly higher than that of samples with low RiskScore, which is consistent with the performance of GSE37642 training set. Further, we used the R software package timeroc to analyze the prognosis classification of RiskScore, and analyzed the classification efficiency of one year, three years and five years, as shown in Figure 10B. Finally, we performed zscale on riskscore, and classified the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group. The KM curve is drawn, as shown in Figure 10C. It can be seen that there is a very significant difference between them (P < 0.0001). 105 samples are classified as high-risk group and 104 samples as low-risk group.
The RiskScore distribution of GSE37642 data set is shown in Figure 11A. It can be seen from the figure that the proportion of death in the samples with high RiskScore is significantly higher than that with low RiskScore, which is consistent with the performance of TCGA training set. Further, we used the R software package timeROC to analyze the prognosis classification of high and low risk groups of RiskScore. We analyzed the classification efficiency of prognosis prediction in one year, three years and five years, as shown in Figure 11B. Finally, we performed zscale on RiskScore, and classified the samples with RiskScore greater than zero as high-risk group. The KM curve is drawn for samples less than zero, as shown in Figure 11C. It can be seen that there is a very significant difference between them (P < 0.0001). 127 samples are classified as high-risk group and 130 samples as low-risk group.
External data sets to verify the robustness of 5-gene signature
We use the same model and the same coefficient as the training set in the external validation data sets GSE12417 andTCGA-LAML, and then calculate theRiskScore of each sample according to the expression level of the samples, and draw the RiskScore distribution of the samples.
The RiskScore distribution of independent external validation dataset of GSE12417 is shown in Figure 12A. It can be seen from the figure that the proportion of death in the samples with high RiskScore is significantly higher than those with low RiskScore, which is consistent with the performance of GSE37642 training set. Further, we used the R software package timeROC to analyze the prognosis classification of RiskScore. We analyzed the classification efficiency of one year, two years and three years, as shown in Figure 12B. Finally, we performed zscale on RiskScore, and classified the samples with riskscore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 12C. It can be seen that there is a significant difference between them (P < 0.001). 81 samples are classified as high-risk group and 82 samples as low-risk group.
The RiskScore distribution of independent external validation data set of TCGA-laml is shown in Figure 13A. It can be seen from the figure that the proportion of death in the samples with high RiskScore is significantly higher than that with low RiskScore, which is consistent with the performance of GSE37642 training set. Furthermore, we used the R software package timeROC to analyze the prognosis classification of RiskScore. We analyzed the classification efficiency of one year, three years and five years, as shown in Figure 13B. Finally, we performed zscale on RiskScore, and classified the samples with RiskScore greater than zero as high-risk group and those with less than zero as low-risk group . The KM curve is drawn, as shown in Figure 13C, from which we can see that there is a very significant difference between them (P < 0.001). Among them, 68 samples are classified as high-risk group and 72 samples as low-risk group.
Risk model and prognosis analysis of clinical features
Further, we conducted the RiskScore analysis of 5-gene markers, and found that 5-gene signature could significantly distinguish the high and low risk groups by age, RUNX1-RUNX1T1 fusion and RUNX1 mutation (Figure 14A-F, P < 0.05). This further indicates that our model still has good predictive ability in different clinical signs.
The expression of risk score in different clinical characteristics and different molecular subtypes
By comparing the distribution of RiskScore among clinical characteristics groups, we found that there were significant differences in age, RUNX1-RUN1T1 fusion and RUNX1 mutation (Figure 15A-D,P < 0.05). At the same time, we compared the difference of risk scores in molecular subtypes. The results showed that the risk score of C2 subtype with a poorer prognosis was significantly higher than that of C3 subtype with a better prognosis.
The relationship between RiskScore and biological function
In order to observe the relationship between the risk score and biological function of different samples, we selected the gene expression profile corresponding to these samples, used R software package gsva to conduct single sample GSEA analysis, and calculated the ssgsea scores of each sample for each function. We further calculated the correlation between these functions and risk score, and selected the function with correlation greater than 0.3, as shown in Figure 16A-F, from which we can see the junction pathway of KEGG_ ABC_ TRANSPORTERS、KEGG_ TIGHT_ was positively related with the sample’s risk score,while KEGG_ NON_ SMALL_ CELL_ LUNG_ CANCER、KEGG_ Glioma and KEGG_ PYRIMIDINE_ were negatively related.
Single factor and multi factor analysis of 5-gene signature
In order to identify the independence of 5-gene signature model in clinical application, we used univariate and multivariate Cox regression to analyze the relevant HR, 95% CI of HR and P value in the clinical information carried by GSE37642 data. We systematically analyzed the clinical information of GSE37642 patients, including age, RUNX-RUNXT1 fusion, RUNX1 mutation and riskscore (Figure 17).
In the GSE37642 dataset, univariate Cox regression analysis found that riskscore was significantly associated with survival, and the corresponding multivariate Cox regression analysis found the same result (HR = 1.57, 95% CI = 1.38 – 1.79, P < 1e-5) .
The above results show that our model 5-gene signature has good predictive performance in clinical application value.
Construction of nomogram by riskscore and clinical features
According to the results of univariate and multivariate analysis, we constructed nomographic model with clinical features age, Runx1 mutaion and riskscores. We used all data sets of GSE37642 to construct nomogram (Figure 18A). From the model results, the risk score feature has the greatest impact on the survival rate prediction, indicating that the 5-gene based risk model can better predict the prognosis. Furthermore, we use the calibration curve to evaluate the prediction accuracy of the model, such as Figure 18B. We can observe that the predicted calibration curves of 1, 3 and 5 years nearly overlapped with the standard curve, which indicates that the model has good predictive performance. In addition, we also used DCA (decision curve) to evaluate the reliability of the model, such as Figure 18C. It can be observed that the benefits of riskscore and nomogram are significantly higher than those of extreme curve, where nomogram is higher than riskscore, and age and Runx1 mutaion are close to the extreme curve. The result indicates that riskscore and nomogram have good reliability.