We proposed a model to predict Type-2 diabetes using machine learning algorithm (MLA). Type-2 diabetes is chronic stage of the diabetic, which is insulin depended [6]. Dataset used in this paper was collected from an online survey. Total 952 participants were involved in this survey. Among these 372 were Females and 580 are males. Age of each participant was above 18 and another dataset PIDD was also used for the comparison. Using six different MLA, Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Naive Bayes and Random Forest. The dataset was split into 75:25 for testing and training by bi-classification method, In the resultant highest accuracy was 94.10% from the RF. Amelie Viloria et al [7].The motive of this study was to propose a methodology for the diabetes diagnostic disease using SVM for the prediction of diabetes diagnosis, using major instances of the human body like DM, Age, BMI, BG for the diagnosis of the diabetes disease. The dataset was collected from the Colombia hospital having 500 patients record including males and females. The evaluation of this dataset output variables are three, Yes diabetes, No diabetes and Free disposition to diabetes. The 80% dataset used to train non-linear SVM for classification in a patient and 20% dataset used for validation. 10-cross validation methods were used to authenticate computational model and confusion matrix method used for the measurements. The highest accuracy obtained from SVM was 92.2% recorded.
Deepti Sisodia and Dilip Sing Sisidia projected a model to predict of diabetes using classification algorithms [8] to give maximum accuracy of most efficiently in result. There are three machine learning classification algorithms like Decision Tree (DT), Support Vector Machine (SVM) and Naıve Bayes (NB) were used to classify the data and find the result. The data was collected from Pima Indians Diabetes Dataset (PIDD) were includes the strength of 786 Instants and 8 attributes ware used. The results of two types of data, one is positive show “1” and second is negative while show “0” using WEKA tool for analysis of data. All the algorithms results evaluate for pre-processing and applied F-measures and recalled operations. The maximum accuracy is 76.30% from Naıve Byes among them. Swapna G et al [9]. In this paper projected a methodology for Diabetes Detection using Deep Learning Algorithms (DLA). This model used 3 different DLA, its name was Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and the combinations of these two CNN-LSTM which was called hybrid (CNN). The input data of this algorithms was Heart Rate Variability (HRV) and was derived from the Electro Cardiogram (ECG) .The data evaluation for the classification using Support Vector Machine (SVM) techniques. In the result gain best performance of CNN and LSTM-CNN is 0.03% and 0.06% before applied SVM. The maximum accuracy of this classification system to diagnose diabetic using ECG waves is 95.7%. Soman KP etal. The aim of this paper was to proposed architecture to Detect of Diabetes Automated using CNN and LSTM Network and Heart Rate Signals (HRV) [10]. For the classification of data extraction were used Deep Learning Techniques (DPL). The input data was split into two types which one was used for testing and the other was used for training. The 5-Fold Cross Validation Method was used for confusion matrix. The accuracy of cross validation was found 93.6% and gave detail accuracy 95.1%. In the resultant output value 0 means diabetes and 1 mean no diabetes. Aishwarya Mujumdar and Dr. Vaidehi V are the authors of this paper were proposed diabetes prediction model by using MLA. DPM had five different type of modules which is Dataset Collection, Data Pre-Processing, Clustering, Build Model, and Evaluation [11]. The data collection of the data set was Pima Indian Diabetic Dataset which is 800 recorded people and 8 attributes. Data pre-processing cleared all missing values as after this phase clustering modules were used K-means clustering algorithms on data set. In this model algorithms includes Support Vector Classifier, AdaBoost Classifier, Decision Trees Classifier, Random Forest Classifier, Extra Tree Classifier, Logistic Regression, K-Nearest Neighbor, Gaussian Na¨ıve Bayes, Gradient Boost classifier. In the last evaluation of the classification accuracy was confusion matrix and f1-score. The resultant maximum accuracy was 98.8% gain form the AdaBoost Classifier.
Han Wu et al [12]. Motive of this journal paper was the prediction of type-2 diabetes mellitus using data mining techniques. T2DM is that type of diabetes patient which is non-insulin dependent. This study prepared a novel model for the prediction of T2DM using two mining classification algorithms, Logistic Regression and K-means Cluster. The Pima Indian Diabetes data set (PIDD) used for the classification and prediction [13]. The data set is containing 768 recorded patients were 268 positive tested and 500 are negative tested instances. WEKA toolkit was used for analysis of data for pre-processing. The experimental process K-fold validation method used for the verify performance of a model. The evaluation of the models collect data through online questionnaire which contained 384 instances which was divided into two groups one is 68 positive and 316 negatives. The model obtained 3.04% above results with other researcher results. The maximum accuracy obtained from this model was 94%. Quan Zou et al [14]. The intention of this study predicting diabetes mellitus with machine learning techniques. Two data sets were used in this paper, the data was collected from hospital in Luzhou, China, this data set is split into two parts: the healthy people and diabetes affected people and another data set was used PIDD there have all patients are female were age is 21-year-old. Among these data set samples were randomly selected [15]. Three different classification algorithms Decision Tree, Random Forest and Neural Network were applied. The results of algorithms were compared collected data set from hospital in Luzhou China and PIMDD. The maximum accuracy in results show that the prediction of DM could be reached highest accuracy 0.8084 from RF.
Tanha Mehboob Alam et al. The objective of this research paper was preparing a model for early prediction of diabetes using data mining techniques [16]. Data set used in this paper, was originally taken from the National Institute of Diabetes, Digestive and Kidney disease (publicly available). In this model inconsistency, noise and missing values of the data set were removed by using three methodologies, Data preprocessing, Data cleaning and Data reduction. Three models were used in this paper which was Artificial Neural Network (ANN), Random Forest (RF) and K-means clustering. The maximum prediction accuracy was 75.7% derived from ANN. Changsheng Zhu et al [17]. The goal of this study was to design data mining-based model for early diagnosis and prediction of diabetes using PCA (Principal Component Analysis) and K-means clustering techniques. The data obtained from PIDD data set was of total 768 samples female patient 500 tested was negative object and 268 is positive tested object. Total 8 attributes of the data set were one class is label. Applying pre-processing method for removing noisy, in consistence and missing values of the data set. Two algorithms K-means clustering and LR applying for classifications [18]. The best accuracy of this model was 89.0%.Tejas and M. Chawan focuses on the topic of diabetes, a chronic disease with a significant impact on global healthcare. It highlights the alarming statistics provided by the International Diabetes Federation, emphasizing the growing number of individuals living with diabetes worldwide [19]. The review acknowledges the challenges in early prediction of diabetes due to its complex interdependence on various factors and its adverse effects on multiple organs. It then introduces the application of data science methods, particularly machine learning, in the medical field to improve predictions and diagnostic capabilities. The review outlines the aim of the project, which is to develop a system that combines different machine learning techniques, including SVM, Logistic regression, and ANN [20], to achieve accurate early prediction of diabetes. The goal is to contribute to the development of effective techniques for earlier detection of this disease. Authors Jobeda Jamal Khanam and Simon Y. Foo conducted research on diabetes prediction, utilizing data mining, machine learning algorithms, and neural network method [21]. The study focused on the Pima Indian Diabetes dataset obtained from the UCI Machine Learning Repository, which included information on 768 patients and nine attributes. Applying seven machine learning algorithms, the researchers found that Logistic Regression (LR) and Support Vector Machine (SVM) performed well in predicting diabetes. Additionally, a neural network model with two hidden layers achieved an accuracy of 88.6% after testing different epochs. Diabetes is a pervasive global health concern that affects individuals worldwide, resulting in elevated blood sugar levels and various complications. Despite efforts to develop accurate diabetes prediction models, researchers face significant challenges due to the scarcity of suitable data sets and prediction approaches [22] to address these issues, this study employs big data analytics and machine learning methods to explore predictive analytics in healthcare.
The primary objective is to construct an intelligent framework specifically tailored for diabetes prediction. By utilizing decision tree-based random forest and support vector machine models, the researchers propose the Intelligent Diabetes Mellitus Prediction Framework (IDMPF) [23]. This framework is the result of a comprehensive literature review, offering promising results with an accuracy rate of 83%. The findings of this study have implications for healthcare professionals, stakeholders, and researchers involved in diabetes prediction research and development [24].
Table 1
Home ground of Relevant Literatures
Years | Techniques | Datasets | Results |
2018 | DT, SVM, NB | PIDD | 76.30% |
2018 | RNN, LST, CNN, CNN-LSTM, SVM | ECG Signals | 95.7% |
2018 | CNN, LST, CNN-LSTM | ECG Signals | 95.1% |
2018 | K-means, LR | Private + PIDD | 95.42% |
2018 | DT, RF, NN | Private + PIDD | 0.8084 |
2019 | SVM, ABC, DT, RF, ET, LR, KNN, GNB, GB | Private + PIDD | 98% |
2019 | ANN, RF, K-means | NIDD | 75.7% |
2019 | PCA, K-means, LR | PIDD | 97.40% |
2020 | LR, KNN, SVM, NB, DT, RF | Private + PIDD | 94.4% |
2020 | SVM | Private | 95.36% |
2021 | LR, SVM | PIDD | 88.6% |
2022 | RF, SVM, KNN, LR | PIDD | 86% |
These findings tabele.1 contribute to a broader understanding of the subject matter and provide a foundation for further analysis and exploration. By synthesizing and comparing the results from different studies, important patterns, trends, and correlations can be identified. This allows for a comprehensive view of the topic, enabling researchers to draw meaningful conclusions and make informed decisions based on the collective knowledge gained from these diverse sources.