A Hybrid Analytic Model for the Effective Prediction of Different Stages in Chronic Kidney Ailments

Chronic kidney disease (CKD) is a gradual loss of kidney function over the period of time and it is irrevocable once functionality reaches the critical state. Detecting the various stages of CKD helps to reduce the progression of the disease. Accurate prediction of CKD stages is one of the urgent needs in the medical industry and it can be effectively done by adopting machine learning (ML) techniques. The primary objective of the present research is to develop an effective classification model for the accurate prediction of CKD stages based on the patient’s health profile as well as the clinical test reports. Here, a hybrid ML strategy is employed that integrates random forest (RF) and AdaBoost (AB) techniques through a voting classifier (VC). The standard CKD dataset with 400 tuples and 25 parameters is used for the proposed investigation. The modification of diet in renal disease (MDRD) equation is used to extract an additional feature known as “estimated Glomerular Filtration Rate (eGFR)” for the prediction of the CKD stage. Pre-processing is carried out on the CKD dataset to fill the missing values by considering the skewness of the parameters and the issue of data leakage is also well addressed. Medically important features are considered and Correlation analysis is carried out to select the appropriate features for the model building process. The proposed Hybrid Ensemble Model (HEM) aids in lowering the bias and variance. HEM model efficiency is assessed using the performance metrics such as cross validation score (CVS), accuracy, precision, recall, F1 measure, Mean Squared Error (MSE), bias and variance and it is compared with the state-of-the-art classification schemes. The outcomes of the analysis reveal that the proposed HEM ensures that the CKD stage prediction is more accurate with 99.16%, 100%, 100% in reduced feature set I, set II, set III and with cross validation score of 97.85%, 99.28%, and 99.64% with reduced features set I, set II and set III respectively.


Introduction
Chronic kidney disease (CKD) is a progressive loss of kidney function over a specific time period [1]. Kidney is an organ that maintains the balance of minerals and electrolytes in a human body. CKD develops gradually and it is not possible to reverse its state to initial condition but can be controlled [2]. The diagnosis of CKD starts with ethnicity, heredity, blood pressure (bp), diabetes level (su), potassium level (pot), blood urea nitrogen (bun), serum creatinine (sc) and Glomerular Filtration Rate (gfr) [3]. However, in general the blood as well as urine samples are preferred for the testing purposes.
The stages of CKD shall be identified by the way of accurate measurement of the various levels of kidney function. Creatinine level of the patient (i.e.)., the amount of unwanted substances in the blood would determine the eGFR value, and it should be noted that the lesser eGFR results in the greater risk of kidney disease. The factors such as age, sc, race and gender have a great impact on the eGFR value [4]. The rate of kidney ailments varies among various demographic groups and the higher creatinine level lowers the eGFR value. Various ranges of eGFR values are used to fix the stages of kidney ailments. In general, the eGFR level < 60 mL/min prevails continuously for more than three months or Albumin Creatinine Ratio (ACR) > 30 mg/g is considered to be a key symptom of CKD. The five stages of CKD can be identified by calculating the eGFR levels using the MDRD study equation. CKD Stage 1 is classified as minimal kidney disease for which the eGFR > 90 mL/min. Mild decrease in kidney function is classified as Stage 2 CKD and the eGFR is in the range of 60-89 mL/min. Stage 3A is the moderate stage of CKD with eGFR about 45-59 mL/min and Stage 3B is the moderate stage of CKD with eGFR of 30-44 mL/min. Stage 4 is the severe CKD with eGFR level is 15-29 mL/min and Stage 5 is end stage renal disease with eGFR < 15 mL/min.
Most of the research works reported herein so far utilized the benchmark dataset available in the University of California Irvine (UCI) repository for the effective prediction of CKD using statistical as well as ML algorithms [5]. It should be noted that very few works have reported the prediction of severity levels of CKD [6] [7]. In this article, a multi-classifier prediction model is proposed for the effective prediction of CKD as well as its severity by extracting eGFR as a main feature in the dataset [8]. Further, the data leakage problem is also well addressed and correlation analysis is carried out to select the appropriate features to enhance the performance of the proposed HEM [9] [10]. In addition, certain medically important features are extracted manually to form a new subset to train the HEM [1] [2].

Related Works
For the past two decades, several research works were reported for effective classification and prediction of various diseases at the early stages. Many data mining techniques and ML algorithms are utilized for prediction in healthcare applications viz., Support Vector Machine (SVM) algorithm is utilized by many researchers to detect the presence of diabetes [11], Alzheimer disease [12] etc., based on the clinical reports and laboratory test records of the patients. Probabilistic Neural Network (PNN) algorithm was used by Dessai et al. (2013) towards heart disease prediction [13]. Also, the digitization of medical records (i.e.)., Electronic Health Records (EHR) leads to accurate prediction of diseases using Artificial Intelligence (AI) techniques, which in turn helps the medical practitioners to react towards effective treatment to their patients. It was inferred that a significant research gap exists towards the accurate prediction of CKD and its severity by considering the data leakage aspects as well [14][15][16][17][18].
Accurate prediction of the progression of CKD stage is really a challenging task because the disease will not indicate any symptom at the early stages. The medical treatment and diet prescription are completely depending on the severity of CKD and its rate of progression. Physicians solely depend on the clinical and laboratory test records of the CKD patients for the identification of different stages. The patient's demographic information and clinical test reports such as blood pressure, blood sugar, serum creatinine, coronary artery disease, race, albumin creatinine ratio and especially the eGFR play a vital role in the prediction of various stages of the CKD [4]. Currently, eGFR is used to estimate the level of kidney function and with the continuous estimation of this measure for a stipulated period of time the stage of the disease is defined.
The benchmark dataset for CKD analysis is usually large and probably with missing values and hidden features, which may be essentially required for accurate prediction [5]. Preliminary study on the prediction of various stages of CKD shall be carried out by generating the complete dataset by considering the above stated attributes including race, gender and eGFR of the patients using an online GFR Calculator [19]. El-Houssainy et al. (2019) used efficient data mining techniques to extract hidden information from the patient data and hence the accuracy of prediction of CKD stage was improved [6]. Here, four classifiers were applied towards the accurate prediction of CKD stages and observed that PNN yields better results for predicting the severity of CKD.
Elhoseny et al. (2019) introduced a hybrid density-based feature selection with Ant Colony Optimization (ACO) algorithm to eliminate the redundant features prior to the classification in a benchmark CKD dataset [20]. The proposed intelligent framework consists of pre-processing, optimal feature selection and classification of CKD. Here, the application of ACO in the classification process involves a structural scheme, generation and pruning of rules, and a heuristic function to enhance the predictive results with pheromone update. The proposed hybrid algorithm is simulated in the MATLAB environment by considering the clinical features that influence CKD and the performance metrics such as False Positive Rate, False Negative Rate, accuracy, specificity, sensitivity, F-score and Kappa value that are evaluated and compared with other existing classifiers. This algorithm is efficient towards the identification of CKD with a significant invention in classification accuracy using fewer features.
Gabriel et al. (2019) introduced a Neural Network (NN) based classifier to predict the risk of developing CKD in the Colombian demographic by considering two population groups (people diagnosed with & without CKD) [21]. This model predicted the likely course of the medical condition of CKD using the test dataset with more accuracy. The accuracy of prediction is justified by an example Case-Based Reasoning (CBR) with appropriate explanation. The demographic data and medical care information obtained through previous diagnoses of about 20,000 people with CKD and 20,000 people without CKD are used as a test dataset for training and validating the proposed NN-CBR twin system model. Considering this dataset with larger features, the proposed NN model with 5 layers (including 3 hidden layers) predicts the risk of developing CKD with an accuracy of 95%.
Bilal Khan et al. (2020) carried out an experimental analysis on various ML techniques with an objective to determine the best classifier (i.e.)., to predict CKD or NOTCKD accurately from the dataset of kidney patients acquired from UCI repository [22]. The dataset has been pre-processed to supplant the missing values with the mean of the existing values. Seven ML techniques such as Naïve Bayes (NB), Logistic Regression (LOG), Multi-Layer Perceptron (MLP), J48 Decision Tree (DT), SVM, NB Tree and Composite Hypercube on Iterated Random Projection (CHIRP) have been employed on the UCI dataset. The outcomes are examined by N-fold cross validation procedure utilizing the classification error rates, Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error and Root Relative Squared Error. From the overall analysis, it is observed that CHIRP outperforms in terms of reduced error rates and improved accuracy (99.75%).
Qin et al. (2020) proposed an integrated classifier model that combines LOG and RF by perceptron over the CKD dataset obtained from the UCI ML repository to improve the prediction accuracy [23]. Initially, the dataset was tuned using K-Nearest Neighbour (KNN) imputation, where the numerical missing values are filled by the median and the categorical missing values are filled with the mode of K-samples to maintain similar physiological measurements among people with similar physical conditions. Further, LOG, RF, SVM, KNN, NB and Feed Forward Neural Network (FFNN) are evaluated using complete and tuned CKD dataset for optimal feature selection. The independent model or the integration of one model with the other which yields better performance has been identified and misjudgement analysis is carried out on such models. As LOG computes optimum adjusted r-squared value while RF contributes in the reduction of Gini Index, the experimental results show that the integrated model with sigmoid activation function performs well in CKD diagnosis by achieving an average accuracy of 99.83%.
Hosseinzadeh et al. (2020) have utilized smart multimedia medical devices and sensors for remote monitoring of kidney function [6]. The authors proposed an IoT based model for effective prediction of CKD and its severity using the multimedia data acquired from various IoT devices. DT classifier is adopted for the prediction of CKD and its stages by appropriately selecting the features based on the clinical observations and using the results of previous studies for CKD prediction. The performance of J48 classifier is compared with SVM, MLP and NB classifiers and it has been proved that J48 is more accurate and sensitive with specificity. Further the feature sets, which have more influencing parameters on CKD are selected to reduce the execution time for prediction.
Krishnamurthy et al. (2021) developed a ML model using the dataset, which consists of patient's details who are having simultaneous presence of two or more diseases [24]. The dataset obtained from Taiwan's National Health Insurance Research Database is analysed using various classifiers including Convoluted Neural Networks (CNN) and observed that the CNN performed well as a result of a fivefold-cross-validation process used for the assessment of performance metrics.
Lot of research works are reported in the literature for the prediction of CKD as presented in Table 1.
It is inferred from the literature that rather than using a single algorithm, if multiple classifiers are grouped, prediction accuracy will be improved with the combined effects of overall classifiers as shown in Table 1. The prediction becomes better with proper assigning of weights via voting process. In this work, a HEM is proposed to make CKD stage prediction and additionally required features are extracted using MDRD equation to predict the severity of CKD. Exploratory data analysis is carried out on the training dataset to prevent data leakage. Further, the missing values are filled by evaluating the skewness of the variables. Two reduced subsets of features have been formulated using correlation analysis and medically important features are selected from the training dataset. The ML algorithms such as RF, AB and Voting classifiers are applied for better classification and prediction of the stages of CKD.  [20] Gabriel et al. [21] Bilal Khan et al. [22] Jiongming et al.

Proposed Methodology
In the proposed work, it is planned to utilize the benchmark dataset from the UCI repository. The dataset has been splitted into training and test datasets in the ratio of 70:30 respectively in a stratified manner. The training dataset is pre-processed to build a HEM by combining the RF and AB ensemble classifiers and the test dataset is used to validate the trained model towards accurate prediction of CKD stages. Further, the trained model is evaluated statistically with the performance metrics such as MSE, bias, variance, precision, recall, F1 measure, accuracy and cross validation scores. The complete model has been built gradually from the data collection to performance analysis and the outcome of the same is depicted in Fig. 1. The procedure adopted in the building blocks of the model is presented in the following sections.

Data Collection
The CKD dataset has been obtained from UCI machine learning repository. It has 400 instances, 25 attributes with 11 numeric and 14 categorical attributes to predict either "ckd" or "notckd". The dataset consists of laboratory test records of real cases and the attributes are either quantitative or qualitative and each attribute is represented with three entities {description, type, unit} as shown in Table 2. Few attributes such as gender, race and estimated glomerular filtration rate are missing in this dataset, which are recommended as essential features by Kidney Disease Improving Global Outcomes (KDIGO) guidelines to predict the stages of "ckd" patients [4].

Feature Extraction
As per the guidelines of KDIGO, the CKD and its stages are predicted with the eGFR using the patient's demographic information and clinical reports. The feature eGFR is extracted with the help of existing attributes 'age' and 'sc' and populated the demographic attributes 'race' and 'gender'. Uniformly distributed random numbers (U i ) are generated in the range of 0 to 1, using the numpy.random.uniform package available in python and the gender and race conditions are being populated as per the following ranges: The eGFR is estimated using Eq. (1), which is known as the MDRD equation [19].

3
where S cr is the standardized sc. The dataset is enhanced with newly extracted features and multiclass labels as shown in Table 3. The CKD dataset includes the newly extracted features and instances that are relabelled as {notckd, stage1, stage 2, stage 3, stage 4, stage 5} against {notckd, ckd} as per KDIGO guidelines.

Stratified Split of Dataset
The benchmark CKD dataset has to be splitted into training and test datasets to envisage the model to learn and then to validate itself. Splitting the CKD dataset randomly leads to an imbalanced distribution of classes (i.e., CKD stages) which affects the performance of the model for the minority classes. Therefore, the dataset is splitted in a stratified way (in the ratio of 70:30) to maintain an equal distribution of classes. The training (70%) and test datasets (30%) are to be pre-processed separately to avoid the data leakage problem.

Data Pre-Processing
Exploratory data analysis is the preliminary investigation on data to identify the factors that cause the progress of CKD through patterns, anomalies, statistical information and graphical representations of the dataset. The enhanced dataset with additional extracted features after the stratified splitting process is utilized for data tuning. The pre-processing is carried out on the training as well as test datasets independently to address the issues of data leakage [9]. The fine-tuned training dataset is used for model building and the preprocessed test dataset is utilized to validate the learnt model.

Data Cleaning
Data Cleaning is done to handle the missing values in the dataset through univariate data analysis. If data instances with missing values are dropped, then the number of instances in the training and testing datasets will be decreased that leads to inaccurate predictions. In this work, median is used to fill the missing values for the numerical data and mode is used to fill the categorical missing values to reduce the impact of outliers.
To quantify the average value, the distribution of data around this average value and the overall degree of asymmetry over the full range of observed values are considered. An estimate of central tendency is used to identify the midpoint of data distribution and to measure the metrics used (mean, median and mode) to fill the missing values [21,25]. The mean is the weighted average of all values (n), based on the relative frequency as expressed in Eq. (2).
where, f i is the relative frequency and it is assumed to be 1/n. Median is the midpoint of the distribution as expressed in Eq. (3). For ungrouped data, The values of mean, median and mode are equal for symmetrical distributions and the mean is strongly influenced by the extreme values, whereas median is more robust and less sensitive to outliers. For asymmetrical distribution, median lies between mode and mean. Skewness is a statistical measure used to reveal the asymmetry of a probability distribution. A measure of skewness (Eq. 4) indicates the degree of symmetry in a dataset. For the more skewed distribution, the higher variability of the measures exists and it leads to unreliable data.
where 's' is standard deviation, x i (i = 1, …, n) is the univariate data and x is the mean of the univariate data. For both positively and negatively skewed data, If the data has outliers or highly skewed then median is preferred over mean to handle the missing data, otherwise mean is preferred. If the data is categorical, then mode is used. If -1 > skewness > 1, then the data distribution is highly skewed. If -1 < skewness < -0.5 or 0.5 < skewness < 1, then the data distribution is moderately distributed. If -0.5 < skewness < 0.5, then the distribution is approximately symmetric. The skewness for exact normal distribution is fixed as zero. Measures of central tendency have been adopted to handle the missing values as per the skewness range. If a variable is normally distributed, then 'mean' is used for imputation. If the variable exhibits skewness, then median is used for imputation for quantitative attributes and for qualitative variables, mode is used for imputation. A sample data instance before and after handling of missing values is shown in Tables 4 and 5.
After imputation, the attributes are pulled towards normal distribution to some extent. It leads to less bias and high variance and turns to be an overfitting model. In this work, the issue of overfitting is handled by validation and bagging techniques.

Feature Selection
Feature selection is a process of evaluating the relationship among the attributes in the enhanced CKD training dataset to handle the issue of underfitting and to improve accuracy by reducing the features in model building. The reduced subset of features thus formed is being utilized in training the hybrid model as well as to be used along with the test mean > median > mode and mean < median < mode 1 3 dataset to validate the learned model. Unsupervised feature selection process is adopted in which the class attribute is not considered for finding the relationship among other variables and to remove redundant attributes. The relevant features have been identified using Pearson's correlation (r xy ) for numerical attributes and Spearman's correlation (ρ) for categorical attributes with threshold > 0.5 and threshold > 0.6 for both cases. These statistical approaches illustrate the degree of correlation between attributes which may be positive, negative or null [10]. The r xy summarizes the strength and direction of the linear association between the numerical attributes as given in Eq. (5).
The ' ' summarizes the relationship among the categorical variables as shown in Eq. (6). The training dataset is visualized through scatter plot (Fig. 2) and heat maps (Fig. 3a, b) to aid in the selection of features appropriately in both correlation analysis respectively.
where,d i is the difference between paired ranks. The redundant attributes are removed in this process of statistical analysis by considering the coefficients r xy and ρ. From Fig. 2,  it is observed that while 'hemo' increases 'pcv' also increases as per the enhanced CKD training dataset and these two attributes are highly positively correlated with the strength of 0.83. Since, both are redundant attributes, the attribute 'pcv' is dropped and 'hemo' is considered for model building.
Many ML algorithms cannot be operated directly on categorical data. Therefore, the categorical attributes are subjected to label encoding which is a binary representation before fitting into a ML model. Another set of features have been selected manually, which is based on medically important features [1,2]. As per the clinical reports, patients with diabetes, blood pressure and coronary artery diseases are more likely to get CKD. The urine and blood samples are tested to track the presence and the progression of CKD. The amount of albumin is checked in the urine samples and eGFR is calculated from the blood samples. The eGFR is estimated using the attributes age, sc, gender and race of the patients. Therefore, the attributes su, bp, cad, al, age, sc, gender, race and eGFR are considered as medically important features to be selected for model building.

Random Forest Classifier
In this work, RF algorithm is used to identify the stages of CKD by analysing the patient's medical reports especially the laboratory test reports [18]. This algorithm is utilized in the context of classification to predict the desired results as it is a supervised classification algorithm. Overfitting is the major issue while applying traditional statistical models in medical analysis. RF algorithm is not only handling the overfitting issue optimally but also handles the missing values and the categorical values effectively. In general, RF is applied to extract the relevant features from the training dataset, while forming the classifier. Prediction is the next stage of RF algorithm once the classifier is generated. The RF algorithm is applied on the enhanced CKD dataset, from which it generates a number of decision trees using the reduced set of features. The base learner for RF is a DT that would be generated as a parallel ensemble method as shown in Fig. 4. The motivation of parallel generation is to reduce the variance by aggregation.
A threshold is fixed while generating the DTs as if the number of trees is more it may slow down the training process and also decrease the performance. At the same time, if more features are considered to increase the depth of the tree, the algorithm faces the challenge of overfitting. RF adopts aggregation principle to overcome the overfitting issue and to improve the prediction accuracy. Also, if more features are considered in each node, the model cannot learn enough for the correct prediction and it is an example of underfitting. The numerical and categorical features are to be handled appropriately to overcome the issues of overfitting and underfitting and hence to improve the performance.
The training dataset consists of 'm' data instances, {X1 to Xm} with F reduced features. RF collects N samples randomly from the training dataset, in which each sample (D i , i = 1 … N) consists of R rows and K reduced features with replacement as given in Eq. (7) where, K equals sqrt(F). From each sample, a DT is constructed, which is represented as DT i , (i = 1 … N). The root of each DT is determined by Gini or Entropy equations (Eqs. 8,9). where 'c' is the number of classes and p i is the probability of class c i . Each DT gives a prediction of ŷ i . The algorithm bootstraps and aggregates the results of each DT as given in Eq. (10) and hence the high variance is reduced to low variance.

AdaBoost
AB classifier is another ensemble classifier, in which multiple classifier algorithms are grouped and the final prediction is based on the combined effect of all those algorithms. The AB works on iterative principle, for improving the accuracy of prediction in the successive iterations based on the accuracy achieved in the previous training with proper selection of the training set. The accuracy is achieved based on the selection of the training set and appropriate consideration of weightage for each classifier. In AB, each classifier is learnt sequentially by fitting the appropriate classifiers and analysing the data for errors and hence, the DT's are constructed at every step to improve the accuracy from the previous error. An iterative approach has been adopted to learn from the mistakes of weak classifiers, and turn them into strong ones by adjusting the weights [17]. AB aims to decrease the bias in each and every successive iterations by better modelling. The motivation for serial generation of ensemble methods is to reduce the bias by adjusting the weights of correctly classified and misclassified labels as presented in Fig. 5.
In this work, stagewise additive modelling with a multiclass exponential loss function (SAMME) is used by the AB classifier for CKD stage prediction [26]. The iterative process continues till the required number of DT's is constructed with adjusted weights using another sub-sample of the dataset where all the misclassified data instances are considered for the prediction.

Voting Classifier
The Hybrid Ensemble Model (HEM) makes the essential predictions based on the collective decisions obtained from different classifiers. Here, the voting strategy is applied to train multiple RF and AB classifiers. Voting is the principle of grouping by weightage based on the outcomes of multiple classifiers used in the model. In the present work, hard voting is used to predict the target based on the majority of decisions attained by the different classifiers. The HEM combines the predictions delivered by the bagging and boosting models with k-fold cross validation in order to ensure that the model avoids over-fitting in the training dataset. A Voting Classifier (VC) shall be used to boost the performance of the other ensemble classifiers to achieve the desired accuracy level of the overall classifier. Voting is the principle of grouping by weightage the outcomes of multiple classifiers in the model. In the proposed model, a voting classifier is used to combine both RF and AB classifiers to balance the bias and variance levels to boost the performance. In the voting classifier, hard voting outperforms since it predicts the class with the largest sum of votes from models.

Hybrid Ensemble Model
A hybrid ensemble VC combines a heterogeneous collection of weak learners to create a single model. Ensemble bagging techniques and cross validations are used to reduce the variance. Since the data is imbalanced, ensemble boosting technique is used while building the model to reduce the bias [17]. The bagging and boosting models are integrated through a voting classifier [27]. The intention is to make the hybrid model more flexible (i.e.)., with less bias and variance. RF classifier trains the encompassed classifiers in parallel with random subset of data while AB trains the encompassed classifiers in sequential order, where each classifier learns by the experience of the previous classifier. In machine learning terminology, these ensemble methods are termed as bagging and boosting respectively. In this research work, AB is combined with RF to make a trade-off between overfitting and underfitting issues while training the CKD dataset with a reduced set of features, which includes relevant new features to predict the stages of CKD accurately. This proposed hybrid ensemble classification model combines the predictions from bagging and boosting models with k-fold cross validation to correctly classify the new data instance by reducing the bias and variance. Multiple RF and AB classifiers with different parameters are generated and fed to the ensemble voting classifier to predict the class of the new instance with the largest sum of votes from all RF and AB models. The steps followed to predict CKD stages using the hybrid ensemble model are described below. The metrices like CVS, accuracy, precision, recall, F1-measure are considered for performance measurement.
The proposed hybrid ensemble voting classifier yields better accuracy in all three reduced feature sets with less MSE, low bias and variance when compared with the RF and AB.

Experimental Setting
The experiments are carried out using National PARAM Supercomputing Facility (NPSF) offered by Centre for Development of Advanced Computing (C-DAC), India. The experimental setup is installed with Python 3.8 and packages such as scikit for machine learning, sklearn, numpy, pandas, matplotlib, seaborn for data analysis and visualization.

Experimental Evaluation
In this work, the feature eGFR is extracted with existing features; age, sc and populated features; gender and race as required by the MDRD equation and the instances are labelled as per KDIGO guidelines in accordance with eGFR value. In the enhanced CKD dataset, the class distribution is as follows: "NOTCKD" instances are 38%, CKD-stage1 instances are 6%, stage2 instances are 5%, stage3 instances are 19%, stage4 instances are 14% and stage5 instances are 18% respectively. The entire dataset is applied with stratified split and hence, the same percentage of classes is maintained in the training as well as test datasets.
The data pre-processing (i.e.)., filling missing values and feature selection is done separately in the training dataset to prevent data leakage, where the knowledge of the test dataset does not leak into the training dataset and vice versa. The missing values are handled in the test dataset separately. The same features which have been selected in the training dataset are being considered in the test dataset. This results in correct estimation of the model's performance when an unseen data is tested for predictions. The univariate analysis is done on the training dataset to know the data distribution, and for the attribute "age" the distribution is presented in Fig. 6. Moreover, sample training data instances with skewness values are shown in Table 6.
The table clearly shows that all the numerical attributes in the training dataset exhibit skewness towards either left or right. Therefore, the mean of each numerical attribute is calculated and presented in Table 7. The missing values in numerical attributes are replaced with the mean of the corresponding variables.
The mode for each categorical attribute is calculated and shown in Table 8. The missing values for the categorical attributes are filled with the corresponding mode.
After imputation the attributes are pulled towards normal distribution to some extent. This leads to less bias and high variance. The change in the distribution of the variable wc (i.e.)., white blood cells count before and after imputation is shown in Fig. 7.
The Pearson's and Spearman's correlation matrices are used to determine the relationship among quantitative and qualitative attributes to select the features for model building by dropping the redundant features. The features selected with medical importance and with Pearson and Spearman correlation coefficients are shown in Table 9. The hybrid model is built with these reduced sets of features presented in Table 9. RF, a bagging technique is used to reduce the variance. As the enhanced dataset itself imbalanced, the bias is found to be invariably high. Subsequently, AB boosting technique is used to reduce the bias. RF and AB classifier models with different parameters are generated and k-fold cross validation is applied to each model for testing and to verify how accurately a new unseen data is classified. With the selected features subset I, the AB gives an accuracy of 94.16% with MSE 0.052 with less bias and variance of values 0.009 and 0.043 respectively. The RF gives an accuracy of 91.7% with MSE of 0.096 with low variance about 0.058 and a bias of 0.037 respectively. Similarly, the bias, variance and MSE values with the selected feature subsets II and III are also estimated along with other performance metrics and the results are summarized in Table 10.
The two heterogeneous models are combined together as a HEM through a voting classifier to improve the accuracy with low bias and variance trade-off. This hybrid model yields 100% accuracy in predicting the severity of the CKD stages for the feature subsets II and III and the other estimated performance metrics are shown in Table 9. The error rates obtained using all the subsets are represented in Fig. 8.
The models are estimated using the mean of the cross-validation scores, where the proposed HEM predicts the stages of CKD with an accuracy of 97.85% with reduced features subset I, 99.28% with reduced features subset II and 99.64% with reduced

Conclusions
For the accurate prediction of CKD stages through ML, the issue of data leakage is eliminated during the data pre-processing stage itself and the HEM classifier is made to learn with low bias and variance. The bagging approach decreases the variance and the boosting approach reduces the bias. The reduced set of features are obtained through correlation analysis and another set of features gathered by manual selection, which are medically important for the CKD analysis. It has been given independently as input to the HEM classifiers, thus enabling the model to learn for the accurate prediction of stages of CKD. The outcomes of the classifiers have been validated using the test dataset and the learning process of the hybrid model controls the False Negative and False Positive errors thereby the model detects as many patients with CKD stages as possible.
Since, the data leakage issue is handled appropriately, the proposed model will have the same accuracy and variance in the production environment. Further, the collection of more samples with timely clinical reports would lead to better prediction accuracy as well.