3.1 Dataset used for this proposed work.
For preparing the useful and valuable database given in this report, the literature survey of diabetes was analyzed and further research that can enhance the performance is being carried out in this related field [11]. In order to carry forward this research and to enhance its performance we seek advice from concerned diabetologist and discuss our work. After detail study, we come to conclusion that there are 10 parameters which plays vital role in detection of disease. By reason of these attributes a rich dataset has been prepared of around four hundred people across wide geographical. While preparing this dataset both the quality of data and assortment of dataset has been taken care of. This dataset is divided into two parts- diabetic and non-diabetic. The dataset presented in this paper has been taken from society consisting urban, rural areas majorly upper & lower class. The people from various age sections, with adjacent eating habits, copious smokers, non-smokers category, drinkers and non-drinkers etc. Age group variation is from minimum 5 years and maximum age of 78 years. On top of it, in order to evaluate these results discrete values are assigned 0 and 1 for analyzing the result and to maintain the uniformity in the record taken.
After a detailed literature study about the selected disease followed by consultation with the medical expert, ten different physiological parameters were identified that act as significant tasks pertaining to diabetes. These parameters were Age, Family history, Weight, Gender, Drinking, Smoking, Thirst, Frequency of urination, Height and Fatigue. On basis of selected medical disorder, the primary data was obtained from multiple healthcare center and by door-to-door collection using questionnaires in order to perform various tasks viz. learning and cross validation phases which are obtained by using expert systems and its techniques.
The parameters selected, are given in table 1 as shown below, also represents a data of interval-frequency about numerous parameter values.
Table 1: Details of the various parameters used and their analysis
parameter |
Description |
Range of values |
Analysis of data |
Age |
Age of the Person |
5 to 78 |
Age 5 to age 20: 30 Age 21 to age 35: 131 Age 36 to age 50: 142 Age 51 to age 78 : 97 |
Gender |
Gender of the person |
0 or 1 |
Male: 190 (represented by 1) Female: 210 (represented by 0) |
Smoking |
Whether the person is smokeor not |
0 or 1 |
Smokers: 68 Non-Smokers: 332 |
Drinking |
Drinker or non drinker |
0 or 1 |
Drinkers: 79 Non-Drinkers: 321 |
Urination |
How many times person urinates in day |
1-15Times |
1-5: 195 6-10: 153 11-15: 52 |
Thirst |
How many times person drinks |
1-15Times |
1-5: 112 6-10: 196 11-15: 92 |
Height |
Height of a person |
60-185 cm |
60 – 95: 7 96 – 125: 9 126 – 155: 119 156 – 185: 265 |
Fatigue |
Healthy levels of fat mass for a fit person |
0 or 1 |
Fatigue(Yes): 276 Fatigue(No): 124 Min-5% in men, 12%in women Max-25% in men, 32%in women Average-15 to 18% in men, 22 to 25% in women |
Weight |
Weight of the person |
15 to 96 |
15 – 36: 13 37 – 56: 110 57 – 76: 244 77 – 96: 33 Average weight-62 kg Overweight-34.7 % |
Family History |
Any person in family is diabetic or not |
0 or 1 |
Family History(Yes): 116 Family History(No): 284 |
Diabetic |
If a person is diabetic or not |
0 or 1 |
Diabetic: 149 Non-Diabetic: 251 |
Out of all attributes which are considered for this research, Age of the person plays the foremost role in analyzing as it had been observed in past records that the people whose age is ranging between 30-35 are at high risk of type II diabetes but it had been seen that people whose age is above 35 are prone to type I diabetes. Furthermore, due to unhealthy lifestyle in today’s generation it has been seen that it is more spreading in children than in adults. Another parameter- Family history also plays an important role in detecting the disease, as it has been frequently seen that if any person has family background or there is heredity in their family whether a child or an adult of that family is more prone to
have that disease. Generally, people are ailing from disease are unable to maintain healthy lifestyle as a result of which their pancreas becomes incapable of producing enough insulin in their body required to support the body’s glucose quantity. As a result of which, that person feels tired, change in weight, rise of frequent thirst are the common symptoms which a diabetic person feels. The research in this related medical domain shows that various methods have been proposed which shows that if any person has much more intake of meal than required and has increased hunger or
3.2 Considered algorithms for study.
The diabetes diagnoses design, it is powered by using four algorithms namely.
- Artificial Neural Networks
- K-nearest neighbor
- Naïve Bayes
- Support of vector machine.
Artificial Intelligence algorithms can be used in various domains like medical domain, in designing AI based games such as ‘Alpha Go’, AI based vehicle like ‘Spirit, Robotic surgeon, AI based chess player ‘Deep Blue’. All these sections of society in which AI has been used shows an incredible performance which reflects that machine can
3.2.1 Artificial Neural Networks
Artificial neural network is inspired genetic technology network of neurons. This algorithm is proven great in modeling the methodological data, which is efficient way of encapsulating, reprehending and capable of mimicking complicate datasets among entries as well as the possible outcomes by acting as a numerous analogous estimation. In Artificial Neural network, the networks combined via various appetite, generally has high level of ammonia in valproic acid therapy [12]. The study shows that the persons which are suffering from excess obese or being excessively overweight which is clear evidence of diabetes. However, by applying various procedures of bariatric surgery on person it has been observed by the researchers that the number of persons which undergo this surgery not only lose weight but also kept it more than three years and more. This incredibly method can also be used for the curing of diabetic patients [13]
think and can take intelligent decisions like humans [14]. The brief narration of four algorithms which are described above are elaborated below-surgeon, AI based chess player ‘Deep Blue’. All these sections of society in which AI has been used shows an incredible performance which reflects that machine can think and can take intelligent decisions like humans [14]. The brief narration of four algorithms which are described above are elaborated below-
layers out of which one is input, second is hidden and third yields the outcome to achieve desired result. In this method, the neurons weights are adjusted repeatedly until the desired outcome is achieved. While Figure 3 represents the structure of a typical artificial neuron.
A neural network can be single layer or multilayer i.e., it consists of one or more hidden layers. Figure 4 depicts a neural network with two hidden layers; the input layers nodes are passive doing nothing but simply forwarding the values from input to multiple outputs whereas the hidden and output layer nodes are active nodes and do actual processing.
Weights which are considered as entries moves in forward direction i.e., moving from input to hidden layer by adjusting weights and then yields the output. The ANN performance is improved by utilizing error propagation algorithm. In this error back propagation learning is applied into two phases
3.2.2 Naïve bayes
In this type of algorithm, it is being founded on three basic concepts namely- Naive Bayes algorithm is based on well- known 3 concepts- preceding, possibility and prediction where preceding means record of past data, which is attained
during the incident, preceding means the probability of that incidence to see if the same can occur in future or not.
Prediction= Preceding * Feasibility / possibility Mathematically interpreted as-
(Probability) (B Given A) = (Prior Probability) * Probability (A and B)/ Probability (A)
In this algorithm, learning is done by presuming that these attributes are separate in given class.
[25]. This algorithm is founded on basis of supposition close to the independence regarding parameters and for this because it is called as naive. In accordance with the theorem, this assumption of N could be analyzed on the groundwork ofN also for the fact of assumption made by K. Given by the formula-
P (N|K)
P (N): Past likelihood assumption for N.
P (k): Past likelihood for unclassified data K.
P (N/K): Assumption of N given by K. P (K/N): Assumption of K given by N.
This estimated given data may be possibility. This enhanced classifier, outperformed among all other classification methods in terms of its feasibility, efficiency, accuracy, correctly classified and incorrectly classified data. Naive bayes algorithm is based on supervised learning which means learning with teacher by using past data record the future prediction is being made on basis of past incidences [15] .
3.2.3 k-NEAREST NEIGHBOR
KNN algorithms are very easy to acknowledge as KNN algorithm works astonishing in practice. During this method, two groups are formed which are located next to each other and respond to input vector. In case of one or two dimension it becomes simple to respond for unknown dataset by Self organizing maps that groups the input data into clusters. This algorithm is used for classifying the unknown data and can predict problems related to regression for eg- if there are N no. of training vectors, by using this algorithm and by taking the value of k (say k=3), it chooses the first 3 closest neighbor which are close to unclassified dataset and by taking majority votes it predict the class of test data. This algorithm is also known as lazy learner, which is being used for training the database and then stored on querying similar data between test data and training data records. KNN model considerably enhance the ability of this technique by using only few attributes in order to classify the data. Recent study illustrates that mean rate is decreased by 92.34% by using this model [16]. The figure 1 depicts the flow diagram of k-nearest neighbor, where ‘k ’ is the number of test data. Some of the commonly used distance metrics are Euclidean, Manhattan, Minkowski. But in case of this concerned topic, Euclidean distance is mainly used, higher the value of K it becomes hard in order to recognise among the classified and unclassified datasets.
3.2.4 Support vector machine
Support Vector Machine is a type of machine learning algorithm which is supervised type used for grouping of different type of datasets. These algorithms are founded on systematic risk by minimizing principal and with help of studying statistical learning theory, a hyperplane is used to divide two different classes of datasets. For resolving this purpose SVC and SVC rotates around the perception of a “margin”- i.e. a line which dissect the training and testing data by implying a margin on both its sides. If the margin is increased, it creates maximum possible length among the line and instances are marked on either side of hyperplane that be used to reduce an bound on the error. Data working on two types i.e. linear separable and linear non-separable data. Likewise, in former case, only one hyperplane is needed for separating the data but in the case of latter more than one hyperplane are needed. Though, in case of SVM
there is nonlinearity in boundaries of arbitrary complexity, we limit ourselves, in this paper, to the linearity of SVM. The study presented in this paper, is taken by surveying of machine learning algorithms and diagnosing the disease with support vector machine. In this work, research has been reviewed which shows that this algorithm shows a good performance in medical domain and can provide more accurate accuracy in comparison with other algorithms of Artificial intelligence which are being used in medical domain. In previous years i.e., before 2005, this algorithm wasn’t being used in most of the fields the reason behind this was due it its incapability in predicting the accurate results. But the recent study, it has been observed that support vector machine has shown a tremendous growth in all traits whether it is in medicine, automation, image, sensors, in the design of various games, aerospace. This algorithm is now being used in all traits of research field and also play an important role in it. Due to its accurate predictions, it has proved beneficial in various traits and would be useful for future research due to its outstanding capability in predicting the estimated outcomes [17].
Attributes can be plotted each side of plane which lessens upper portion of class and calculate the error. Fig 5 depicts the hyperplane which divides the two sections of different classes.
In this algorithm, it is subdivided into two planes- hyper- plane and line. In hyper plane or let say in case of first scenario thumb rule is being used so that it can recognize correct line which means selecting a plane to classify star, circle and maximizing the farness between the imminent data point which is called as Margin/ line. For improving the model performance of SVM some parameters are used such as "Kernel”, “Gamma” and “C”.
3.3 PROPOSED MODEL
This model represents the proposal of implementing the ensemble methodology i.e. a novel method used for improvement and predict the performance of an Expert Systems used for diagnosis of diabetes.
Ensemble Method
In Ensemble method, the possible outcome of each algorithm is taken and by means of taking majority votes we can predict the data. By using this technique, it increases the chances of accuracy and efficiency of each classifier in such a way that it can be proved as an efficient tool used for prognosis of diabetes. In this method, different algorithms results are club together to predict the result in interface. Let us assume, in case if a particular result gets incorrectly classified by an individual algorithm, then in that case the error is easily rectified by other algorithms by taking the major ensemble classification method task is completed by constructing a broad numbers of data processing techniques during insight phase & predicting the outcome by calculating the mode between the algorithms used in it. This technique would enhance the results by combining the results of other four algorithms any taking maximum voting itredicts its output [18]. rity votes which are taken by individual classifiers.
OBJECTIVES& THEIR DESIRED RESULTS
The objectives of the proposed study are as under:
1. To study all the existing methods and algorithms in machine learning, understand their working, advantages, disadvantages and applications so that we can analyze which helps in recognizing these attributes and decide which attribute is more suitable in the field of medical diagnosis.
DESIRED RESULT
This expert system based proposed ensemble model which is being used for prognosis of diabetes. When comparison is made it works efficiently with all AI algorithms. By using this proposed model the accuracy and efficiency is tested by using 10 fold cross validation. In addition to this, for choosing correct classification and incorrect classification WEKA tool is being used for better results.
Table 2: Obtained outcome from various algorithms of performance metrics
The performance of the expert based systems is analyzed by calculating the average faults between the output data and the desired output data which is done in testing phase. Out of these four classifiers which are used in this manuscript, Artificial Neural Networks outperformed by predicting the most accurate results approximately of 96.00% continued in order as naïve bayes (95.00%), SVM (94.00%), J-48 Graft (91.49%), KNN (91.23%), END (91.23 %), Decorate (91.23%), Random forest (90.97%),
Bagging (89.69%), Multiple class classifier (89.69%), Multi- boosted Classifier (88.65%), User Classifiers (88.65%), Decision Stump (88.65%) and Random tree(88.40%). The performance of these algorithms can be increased if no. of incidence in these datasets are increased and by involving numerous objective that plays an important role in diagnosing the disease like diabetes.
2. To perform a detailed study of diabetes and its types to study which algorithm is best suited to prognosis of diabetes after collection of relevant data for training and
DESIRED RESULT
According to the survey which is carried in field of medical domain and the concerned work in this domain. We seek advice from diabetologist and discuss the problem with them testing and to analyse, understand and compare these algorithms on different parameters such as performance, reliability, and validity across different datasets etc.
we come to conclusion that there are ten parameters which plays main role in detection of diabetes and have importance used for the manipulation of disease.
3. A proposed prognostic framework can be developed that could aid a medical doctor in diagnosis of diabetes. To test the proposed framework and verify its authenticity and validity can be done by using 10-fold cross validation.
DESIRED RESULT
The desired result like efficiency, accuracy has been achieved by using this methodology and it is observed that
this is a beneficiary tool which can be used for initial level of screening.
Future Scope
This work can be further enhanced by including some clinical and genetic features also as parameters. Relevant data collection about clinical and genetic features can be made and optimize database can be used for training, testing and validation of intelligent system whose performance can be better than the system proposed. Expert system based on AI (as the one proposed in this thesis) should be encouraged and made easily available to people so that initial self-diagnosis can also be done by people who have symptoms relevant to diabetes