Supervised classification
There are 57820 gene expression values for each individual that can be regarded as features in the classification. Using all of these genes as features is not applicable, leading to the high dimensional data and will reduce the performance of the conventional machine learning approaches. With the aim of reducing dimensionality and enhancing the classification accuracy , we determined DEGsbetween two states and applyied them as the primary feature vector in each binary classifier. Thus, DEGs between healthy and T2DM were explored, which resulted in 247 differentially expressed genes. These 247 genes were used as a feature of classification, and classifiers’ accuracy was investigated using this subset. SVM, KNN, NN, NB, and RF classifiers tested and SVM showed the best performance in our analysis as are shown in table 1.
To achieve a near-optimal feature subset and improve classification accuracy, feature selection applied based on a combination of GA and SVM methods. This method employs a GA as the feature selector and the SVM algorithm as the classifier. The genetic algorithm terminated either when the fitness score is represented at least 95% accuracy or the maximum number of generations is reached. Therefore, a subset of features with which SVM classifier can distinguish T2DM from normoglycemic subjects with approximately 90 percentages accuracy will be found. The GA-SVM procedure repeated 100 times and 100 feature subsets with the prediction accuracy around 95 percentages obtained. Then, features have been ranked according to the frequency with which each gene has participated in these 100 subsets. Our analysis revealed that 26 genes with at least 80% frequency could improve classification accuracy to 94 percent. These top-ranked genes were extracted and classification performance assessed with classifiers (Table 2).
In addition, to evaluate the SVM classifier with top-ranked genes, classification repeated 100 times with 10-fold cross-validation and accuracy, sensitivity, and specificity were calculated. Figure S1 in Additional file 1 shows the box plot of this evaluation.
This top-ranked list comprises of important genes including CERK, FGFBP3, ETV5, E2F8, MAFB, and 10 non-coding genes, which were explored for their functionality and importance in disease. The complete list of genes with Ensemble ID can be found in Additional File 2.
Unsupervised classification
T2DM is a complex multifactorial disease with different abnormalities. Thus it will be possible to find different groups of gene expression patterns, which all lead to the insulin-resistance phenotype. Therefore, in this study, we asked the questions that 1) do the diabetic participants show different patterns of gene expression or not, and 2) is it possible to categorize T2DM samples into distinct sub-groups with specific gene expression abnormalities? Thus, to answer the questions mentioned above, the unsupervised hierarchical clustering algorithm was exploited using the measure of Euclidean distance and complete linkage method. The top three clusters were isolated and studied (Figure 3). Cluster 1 to 3 consist of 18, 18, and 27 individuals, respectively.
In addition, to evaluate the SVM classifier with top-ranked genes, classification repeated 100 times with 10-fold cross-validation and accuracy, sensitivity, and specificity were calculated. Figure S1 in Additional file 1 shows the box plot of this evaluation.
This top-ranked list comprises of important genes including CERK, FGFBP3, ETV5, E2F8, MAFB, and 10 non-coding genes, which were explored for their functionality and importance in disease. The complete list of genes with Ensemble ID can be found in Additional File 2.
Unsupervised classification
T2DM is a complex multifactorial disease with different abnormalities. Thus it will be possible to find different groups of gene expression patterns, which all lead to the insulin-resistance phenotype. Therefore, in this study, we asked the questions that 1) do the diabetic participants show different patterns of gene expression or not, and 2) is it possible to categorize T2DM samples into distinct sub-groups with specific gene expression abnormalities? Thus, to answer the questions mentioned above, the unsupervised hierarchical clustering algorithm was exploited using the measure of Euclidean distance and complete linkage method. The top three clusters were isolated and studied (Figure 3). Cluster 1 to 3 consist of 18, 18, and 27 individuals, respectively.
To study biological differences between clusters, metabolic modeling of each cluster, as well as DEGs between each cluster and normal samples were explored. It was found that differences in gene expression and pathways between healthy and all newly diagnosed diabetic patients are low, while clustering of patients and analysis between each cluster and healthy individuals lead to the finding of more DEGs and more perturbed pathways. Results showed that each cluster has specific dysregulated genes and pathways which do not exist in the other two clusters. A heat map representation of the gene expression in 3 clusters is shown in Figure 4. In addition, pathway enrichment analysis of DEGs with absolute log2 fold change of more than 0.9 in each cluster was performed. The results can be found in Table S1-3 of Additional File 1.
The analysis demonstrated that cluster 1 had the most number of perturbed pathways and dysregulated genes between the three clusters. Dysregulation of several genes in cluster 1 including down-regulation of DDIT4L, subunits of cytochrome c oxidase, several mitochondrial genes, ADIPOQ, and up-regulation of several inflammatory genes such as GADD45G, TGFB1, CARD9, IGHA2, IGHG2, IGHA1, IGHD, and MIF genes were found. Down-regulation of several genes encoding mitochondrial genes and subunits of cytochrome c oxidase(COX) can reflect mitochondrial dysfunction and oxidative stress. Down-regulation of the adiponectin gene also was found in cluster 1. At the metabolic modeling level, perturbation in pathways related to Inositol phosphate metabolism, Pentose phosphate pathway, Tyrosine metabolism, Folate metabolism, Acylglycerides metabolism, Glutathione metabolism, ROS detoxification, Glycerolipid metabolism, Acyl-CoA hydrolysis, Fatty acid activation, Beta oxidation of fatty acids, Sphingolipid metabolism, Glycerophospholipid metabolism, Chondroitin/heparan sulfate metabolism, purine and pyrimidine metabolism, Carnitine shuttle, TCA, oxidative phosphorylation, Omega-3, and Omega-6 fatty acid metabolism, and Glycosphingolipid metabolism was observed.
Cluster 2 displayed no significant perturbed pathway in metabolic modeling. Although, a change in expression of various genes like as overexpression of SPP1, TNFRSF11B, FRK and down-regulation of PRKAG3 and ATP2A1 was found here. We compared the phenotypic features of people in each cluster with healthy ones. Table 3 shows the average value of each feature in different clusters. In addition, the box plots of fasting glucose and fasting insulin values in each diabetic cluster and normoglycemic group are shown in Figures S2 and S3 of Additional file 1. We also provide more information about differences of clinical features between each pair of clusters in Additional file 1 Table S4 and S5. This analysis revealed that this cluster is very close to the healthy state in terms of blood glucose and insulin levels.
Cluster 3 had the least number of DEGs against healthy individuals although perturbation in the expression of various important genes like as MSTN, ErbB3, EGR1, CIDEC, and HK2 was found in this cluster. At the metabolic level, the perturbation in glucose metabolism was found. Dysregulation of Branched-chain amino acids (BCAAs) metabolism, Glycolysis, pyruvate metabolism, Tricarboxylic acid cycle, and glyoxylate/dicarboxylate metabolism and several exchange and transport reactions were observed. The complete list of DEGs and perturbed reactions in each cluster can be found in Additional File 2.