Supervised classification
There are 57820 gene expression values for each individual that can be regarded as features in the classification. Using all of these genes as features are not applicable, leading to the high dimensional data and will reduce the performance of the conventional machine learning approaches. With the aim of reducing dimensionality and enhancing accuracy in the classification, it was decided to find differentially expressed genes between two states and apply them as the primary feature vector in each binary classification. Thus, Differentially expressed genes (DEGs) between healthy and T2DM were explored which resulted in 247 differentially expressed genes. These 247 genes were used as a feature of classification and classifiers accuracy was investigated using this subset. Support Vector Machine (SVM), k-Nearest Neighbour (KNN), Neural Network (NN), Naïve Bayes (NB) and a Random Forest (RF) classifiers tested and SVM showed the best performance in our analysis as are shown in Table 1.
Table 1
Discriminative performance of different classifiers between healthy and diabetic patients when 247 differentially expressed genes used as the features
Method | AUC | ACC | F1 | Precision | Recall |
SVM | 0.889 | 0.838 | 0.806 | 0.788 | 0.825 |
NN | 0.877 | 0.812 | 0.772 | 0.766 | 0.778 |
RF | 0.837 | 0.766 | 0.71 | 0.721 | 0.698 |
NB | 0.801 | 0.734 | 0.717 | 0.634 | 0.825 |
KNN | 0.758 | 0.727 | 0.58 | 0.784 | 0.46 |
Area Under the ROC curve (AUC), accuracy (ACC), F1 score, precision, and Recall was reported. |
To achieve a near-optimal feature subset and improve classification accuracy, feature selection applied based on a combination of GA (Genetic algorithm) and SVM method. This method employs a GA as the feature selector and the SVM algorithm as the classifier. The genetic algorithm terminated either when the fitness score is represented at least 95% accuracy or the maximum number of generations is reached. Therefore, a subset of features with which SVM classifier can distinguish T2DM from normoglycemic subjects with approximately 90 percentages accuracy will be found. The GA-SVM procedure repeated 100 times and 100 feature subsets with the prediction accuracy around 95 percentages obtained. Then, features have been ranked according to the frequency with which each gene has participated in these 100 subsets. Our analysis revealed that 26 genes with at least 80% frequency can improve classification accuracy to 94 percent. These top-ranked genes were extracted and classification performance assessed with classifiers (Table 2).
Table 2
Performance of different classifiers when the 26 top-ranked genes used as feature subset
Method | AUC | ACC | F1 | Precision | Recall |
SVM | 0.958 | 0.942 | 0.927 | 0.950 | 0.905 |
NN | 0.966 | 0.903 | 0.878 | 0.900 | 0.857 |
NB | 0.896 | 0.818 | 0.791 | 0.746 | 0.841 |
RF | 0.836 | 0.799 | 0.735 | 0.796 | 0.683 |
KNN | 0.829 | 0.721 | 0.538 | 0.833 | 0.397 |
In addition, to evaluate the SVM classifier with top-ranked genes, classification repeated 100 times with 10-fold cross-validation and accuracy, sensitivity and specificity were calculated. Figure 1 shows the box plot of this evaluation.
This top-ranked list comprises of important genes including CERK, FGFBP3, ETV5, E2F8, MAFB and 10 non-coding genes, which were explored for their functionality and importance in disease. The complete list of genes with Ensemble ID can be found in Additional File 1.
Unsupervised classification
T2DM is a multifactorial complex disease with different abnormalities thus it will be possible to find different groups of gene expression patterns which all lead to the insulin-resistant phenotype. Therefore, in this study, we asked the questions that 1) do the diabetic participants show different patterns of gene expression or not and 2) is it possible to categorize T2DM samples into distinct sub-groups with specific gene expression abnormalities? Thus, to answer the above-mentioned questions, unsupervised hierarchical clustering performed with the measure of Euclidean distance and complete linkage method. The top three clusters were isolated and studied (Fig. 2).
To study biological differences between clusters, metabolic modeling of each cluster, as well as differentially expressed genes between each cluster and normal samples were explored. It was found that differences in gene expression and pathways between healthy and all newly diagnosed diabetic patients are low, while clustering of patients and analysis between each cluster and healthy individuals lead to the finding of more differentially expressed genes and more perturbed pathways. Results showed that each cluster has specific dysregulated genes and pathways which do not exist in the other two clusters.
The analysis demonstrated that cluster 1 had the most number of perturbed pathways and dysregulated genes between the three clusters. Dysregulation of several genes in cluster 1 including down-regulation of DDIT4L, subunits of cytochrome c oxidase, several mitochondrial genes, ADIPOQ and up-regulation of several inflammatory genes such as GADD45G, TGFB1, CARD9, IGHA2, IGHG2, IGHA1, IGHD, and MIF genes were found. Down-regulation of several genes encoding mitochondrial genes and subunits of cytochrome c oxidase(COX) can reflect mitochondrial dysfunction and oxidative stress. Down-regulation of the adiponectin gene also was found in cluster 1. At the metabolic modeling level, perturbation in pathways related to Inositol phosphate metabolism, Pentose phosphate pathway, Tyrosine metabolism, Folate metabolism, Acylglycerides metabolism, Glutathione metabolism, ROS detoxification, Glycerolipid metabolism, Acyl-CoA hydrolysis, Fatty acid activation, Beta oxidation of fatty acids, Sphingolipid metabolism, Glycerophospholipid metabolism, Chondroitin/heparan sulfate metabolism, purine and pyrimidine metabolism, Carnitine shuttle, TCA, oxidative phosphorylation, Omega-3 and Omega-6 fatty acid metabolism, and Glycosphingolipid metabolism was observed.
Cluster 2 displayed no significant perturbed pathway in metabolic modeling. Although, a change in expression of various genes like as overexpression of SPP1, TNFRSF11B, FRK and down-regulation of PRKAG3 and ATP2A1 was found here. We compared the phenotypic features of people in each cluster with healthy ones. Table 3 shows the average value of each feature in different clusters. In addition, the box plots of fasting glucose and fasting insulin values in each diabetic cluster and normoglycemic group are shown in Figs. 3 and 4. This analysis revealed that this cluster is very close to a healthy state in terms of blood glucose and insulin levels.
Table 3
Average fasting plasma glucose, fasting serum insulin, BMI and waist/hip ratio (WHR) in each diabetic cluster and Healthy group
| Glucose (mmol/L) | Insulin (mu/l) | BMI | WHR |
Healthy | 5.62 | 6.87 | 26.35 | 0.92 |
Cluster 1 | 7.17 | 10.19 | 29.03 | 0.99 |
Cluster 2 | 6.86 | 7.79 | 28.58 | 0.95 |
Cluster 3 | 7.39 | 12.93 | 30.13 | 1.02 |
Cluster 3 had the least number of DEGs against healthy individuals although perturbation in the expression of various important genes like as MSTN, ErbB3, EGR1, CIDEC, and HK2 was found in this cluster. At the metabolic level, the perturbation in glucose metabolism was found. Dysregulation of Branched-chain amino acids (BCAAs) metabolism, Glycolysis, pyruvate metabolism, Tricarboxylic acid cycle and glyoxylate/dicarboxylate metabolism and several exchange and transport reactions were observed. The complete list of differentially expressed genes in each cluster with associated GO terms can be found in Additional File 1.