Removal of batch effect
The detailed characteristics of the datasets included in the analysis, including GSE17800, GSE79962 and GSE3585, was shown in Table 1.A total of 11,779 genes were jointly detected by both microarray platforms of the dataset. Principal component analysis (PCA) was performed to validate whether the batch effect among the datasets included in this study was successfully removed. PCA‐plot was drawn based on the top two principal components (PCs) in PCA.Before the process of batch effect removing, heart samples from patients with DCM were clustered by batches, indicating that there was significant batch effect caused by different platforms and different experiment conditions among the datasets (Figure 1A). The PCA‐plot based on PCA of the normalized meta-cohort data revealed that the batch effect between GSE17800, GSE79962 and GSE3585 was clearly removed (Figure 1B).
Consensus clustering of DCM cases
After the batch effect was successfully removed, the merged dataset was employed to conduct molecular subgroup analysis by consensus clustering. The cluster consensus score of each subgroup was higher than 0.7 only in the three categories (Figure 2A). In addition, CDF curve showed that the CDF score was the largest in the three categories (Figure 2B). Both evidences suggested that three molecular subgroups were more robust than others in DCM patients. Therefore, heart tissue samples would be clustered into three molecular subgroups according to the consistency score and the CDF curve. In the consensus matrix, we observed that there is a high similarity of gene expression patterns within each molecular subgroup (Figure 2C). Ultimately, we adopted consensus clustering algorithm to divide 56 heart tissue samples from patients with into three molecular subgroups based on the gene expression patterns.
The differences of clinical characteristics in the three molecular subgroups
DCM cases in subgroup 1, subgroup 2 and subgroup had different gene expression patterns. To further investigate the clinical characteristics of three groups, the age, BMI, LVEF and LVIDD were analyzed in detail in DCM cases from GSE17800 dataset. We found that patients in subgroup 2 had lower LVEF than patients in subgroup1 and subgroup3 with statistical difference (Figure 3D). However, the results of age, BMI, and LVIDD statistics showed that there was no significant difference among three groups (Figure 3E). As a result, not only did gene expression differs, but the severity of the disease also varied among three subgroups of DCM cases. As shown in Table 2, the analysis of variance (ANOVA) on age and our molecular classification was performed, indicating that the molecular classification in the present study was an age-independent indicator for the severity of DCM.
Based on Pairwise differential expression analysis, we identified 605, 697, and 1557 specific differentially expressed genes in subgroups 1, subgroups 2, and subgroups 3 compared with other subgroup (Benjamin-Hochberg adjusted p<0.05, absolute difference of mean>0.2) (Table 3). We also compared the gene expression profile of each molecular subgroup with that of control individuals. There was 1236, 1388 and 2617 differentially expressed genes in subgroups 1, subgroups 2, and subgroups 3 compared with the control individuals (Table 3). To further reveal the differences in gene expression patterns and the resulting functional differences among molecular subgroups of DCM, WGCNA was performed based on the specific differentially expressed genes in each group. We carried out WGCNA analysis based on topological overlaps and scale-free network and created a hierarchical clustering tree based on the dynamic-hybrid cut (Figures 4A). According to the results of scale-free topology criterion, we selected 8 as the soft-thresholding power (R2 = 0.89; Figures 4B). Ultimately, a total of 9 co-expressed modules were identified for further research. Figures 4C showsthe cluster dendrogram of the modules and the clustering of module eigengenes was shown in Figures 4D. Figures 5 shows the identified nine WGCNA modules, of which the corresponding subgroups are shown in Table 3. To further study the relationship between WGCNA modules and clinical features of patients with DCM, the correlation coefficients between WGCNA models and clinical features were calculated. As shown in Figure 5,agewas correlated positively with module blue, and negatively correlated with module brown, module black, module turquoise, module red and module pink. LVEF was positively correlated with module pink, and negatively corelated with module black and module grey. BMI was positively corelated with module grey, module blue, module brown, and module black, and negatively corelated with module pink and module yellow. These results show that the WGCNA modules was associated with clinical features of patients with DCM. Moreover, we performed GO functional enrichment analysis based on the genes in different WGCNA modules. Figure 6 shows the biological process terms enriched in different modules. The enriched terms in cellular component and molecular function are shown in Figure S1 and Figure S2. We also conducted KEGG pathway analysis and identified pathways enriched in different WGCNA modules (Figure 7). Above all, these results of enrichment analysis demonstrated each molecular subgroup had its specific functional gene modules that could function in modulating DCM onset or progression.
Identification of biomarkers based on machine learning algorithms
Considering the patients in subgroup 2 had more severe condition, two machine learning algorithms of LASSO regression and SVM-RFE algorithm were adopted to screen out biomarkers. According to the specific differentially expressed genes in subgroup 2, we screened out 28 key gene significantly related to molecular classification using LASSO algorithm (Figure 8A). In addition, 28 genes were identified as biomarkers based on the SVM-RFE algorithm (Figure 8B). The 7 overlapping genes, including TCEAL4, ISG16, RWDD1, ALG5, MRPL20, JTB and LITAF,were finally selected as biomarkers (Figure 9A).
Diagnostic effectiveness of biomarkers
ROC curve was adopted to evaluate the diagnostic effectiveness of biomarkers of subgroup 2. The results of ROC curve indicated that all of the biomarkers have a favorable diagnostic effectiveness in discriminating DCM cases in subgroup 2, with an AUC of 0.979 (95% CI 0.932–1.000) in TCEAL4, AUC of 0.869 (95% CI 0.750–0.968) in ISG15, and AUC of 0.939 (95% CI 0.850–0.996) in RWDD1, AUC of 0.955 (95% CI 0.888–1.000) in ALG5, AUC of 0.874 (95% CI 0.701–1.000) in MRPL20,AUC of 0.966 (95% CI 0.917–0.998) in JTB and AUC of 0.953 (95% CI 0.888–0.996) in LITAF (Figure 9B-H).