Estimating the mortality rate using statistical variance and reduced set of clinical and non-clinical attributes for diagnosing chronic kidney disease

It is found that, chronic kidney disease (CKD) is prevalence worldwide. Quality of life (QoL) in terms of health became an essential measure for patients with CKD. This paper uses the real-time dataset of CKD patients collected from reputed medical dialysis unit in Chennai, India. We measure the inter and intra class variations between the clinical and non-clinical attributes. Principal component analysis (PCA) is applied on twelve clinical (biomarkers) and eight non-clinical (comorbiditity) attributes to find salient among them. ANOVA is applied only on reduced attributes to calculate the correlation between the target variables such as mortality, age and gender. The characteristics of the attributes and its discriminating nature is evaluated using various well known classifiers such as Logistic Regression, K-nearest neighbor, support vector machine, Neive Bayes, decision tree, random forest and artificial neural network. The performance of the classifiers are evaluated using parameters such as confusion matrix, accuracy, F-measure, precision and recall. It is found that, the covariance of the attributes linearly separates the output space of target variables that are considered and the performance is encouraged.


Introduction
Chronic kidney disease (CKD) is considered as one of the health issue and affecting around 16% of the population of the world (Ifraz and Rashid 2021).As per survey conducted by Global Burden of Disease Study (GBDS), CKD is considered as one of the major serious disease, which causes death.(Jha et al. 2013).Ene-Ior et al. (2016) have presented a study states that higher number of people are affected by CKD in higher income countries and lower number of people are affected in middle income countries (Ene-Iordache et al. 2016).It is also found that age, socioeconomic status, gender and geographic region also influences the distribution of CKD patients.The end stage of CKD is expensive (Bakhshayeshkaram et al. 2019) and leads to ailment (Eckardt et al. 2013;Aljaaf et al. 2018).
Managing and treating CKD requires better understanding of its characteristics as per National Kidney Foundation (NKF).One of the well known measures is glomerular filtration rate (GFR), which is the estimation of filtering function of kidney on the removal of waste agents such as Creatinine and Cystatin C from body.Similarly, the estimated-glomerular filtration rate (eGFR) is one of the best overall indexes of kidney function in stable and nonhospitalized patients.The various stages of CKD with respect to eGFR are presented in Table 1.
It is observed from Table 1 that while eGFR is more than 90, the kidney is functioning as normal.While eGFR range lies between 60 and 89, there is a kidney disease.In the early stages (1-3), kidneys can filter waste such as Creatinine and Cystatin C from blood.In the later stages such as stage 4 and 5, kidneys have to function harder to filter blood and may lead to kidney failure (Chen et al. 2017).The eGFR is less than 15 indicates severe damaged kidneys and need kidney transplantation.
Vijendra et.al (Singh et al. 2022) have proposed CNN and multimodal algorithm for predicting the risk of chronic cerebral infraction disease.The CNN has used data from patients and the missing data is rebuilt by latest component.A decision tree (Suzuki 2015) has been constructed and found that the performance of ID3 is encouraging compared to evolutionary algorithm.Garcia and Barlaud (2008) has evaluated various machine learning algorithms such as SVM, KNN and decision tree (Ramalingam et al. 2018;Yadav and Pal 2021;Ifraz et al. 2021;Chittora et al. 2021).NVIDIA CUDA API has been used as evaluating platform and observed that the computational load takes polynomial time.Hussain et al. (2019) have worked on CKD data set and dimension is reduced by applying PCA.The missing values are filled by using ANN.This approach helps to predict CKD at earlier stage.
It is imperative from the above discussion that CKD is a series disease and early detection can save the patients.The analysis and interpretation of clinical and non clinical attributes using the computational techniques have opened avenue for complimenting the medical practitioners in early detection (Burgh et al. 2022).Most of the above works have concentrated only on using Artificial Intelligence and Machine Learning techniques.None of them have extracted features from clinical and non-clinical attributes.Without which the purpose of the computational algorithms are defeated.This issue is handled by us and we compute the inter and intra class variance of the attributes which are measured periodically.
The rest of the paper is organized as follows.Section 2 presents the various clinical and non-clinical attributes measured from the patients.It also explains the statistical concepts such as dimensionality reduction and ANOVA.Section 3 presents the experimental results of the proposed approach and we conclude the paper in the last section of the paper.

Proposed work
The well known and universally accepted process to diagnose the CKD is based on Clinical and Non-Clinical diagnosis attributes are presented in Tables 2 and 3.
Based on the content of Tables 2 and 3, it is observed that the CKD attributes play crucial role in early diagnosing CKD and interpreting the patients' health.Below we present the theoretical concept of the proposed approach.It is well known that CKD can be diagnosed based on various clinical and non-clinical attributes of a patient and mathematically represented as, where P CP represents the clinical attributes and P NCP represents the non-clinical attributes and each of them can be represented as, P NCP ¼ P NCP1 ; P NCP2 ; P NCP3 ; . ..P NCPm È É : In Eqs. 2 and 3, n and m are the dimensions of the clinical and non-clinical attributes respectively.The Eqs. 2 and 3 can be considered as feature vector for understanding and predicting CKD and their total dimension is (n ?m).It is known that, most of the cases, few attributes of the feature vector may influence the predictions and affect the learning of the classifier.Most of the cases features with higher dimension creates spatial instability and curse of dimensionality Vishnu Priya and Vadivel (2012).
It is known that some of the attributes of clinical and non-clinical attributes may influence the performance.Thus, the non-performing attributes belong to clinical and non-clinical attributes have to be proved.CKD datasets are usually large and complex, which makes the interpretation as a difficult task.The dimensionality reduction is one of the widely used techniques to handle the above mentioned issues there by reducing the effect of noise, spatial instability, etc.The issue can be handled by using principal component analysis (PCA) such that input attributes are inter operable with lesser information loss.PCA creates new variables having no correlation such that the variance is successively maximized.These new variable without correlation is called principal components and they are reduced to solve eigen value/eigen vector problem.However, using new variables are desired apriori for a dataset and need to apply PCA for newer datasets.
It is a technique for reducing the dimension of such datasets, increasing interpretability with less information loss.It can be achieved by creating new uncorrelated variables that successively maximize variance.Finding such new variables is called principal components and reduces to solve an Eigen value/eigenvector problem.The new variables are defined by the dataset at hand, not a priori, hence making PCA an adaptive data analysis technique.It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures.In this paper, we normalize and standardize the clinical and non clinical attributes such that it is amenable for applying PCA and its being done as given below, Now, the variation of attributes with respect to mean has to be measured for understanding the reducing/importance of variable.This is due to the fact that the degree of correlation of attributes with mean provides correlations and can be calculated using the co-variance matrix.
The co-variance matrix for clinical attributes can be represented as given below, CVM CP V ¼ COVð and the range of diagonal cell of the matrix is same as first attributes of COV(…).Similarly, COVðP CP i ; P CP j Þ is commutative and hence symmetric pattern in the covariance matrix.Here CVM CP X is the ||x|| matrix and from which the PCA can be calculated and is written as:  Estimating the mortality rate using statistical variance and reduced set of clinical and non-clinical… Here, CVM CP X is the matrix and k is eigen value of the matrix CVM CP X , x is a non-zero vector called as eigen vector of CVM CP X corresponding to eigen value, k.The final dataset is represented as FinalDataset = transformed P CP times the result of Eq. 3. Similar to above, the co-variance matrix for non-clinical attributes can also be derived.
As a result, the dimension of feature vector in Eqs. 2 and 3 are reduced to p and q and can be written as In Table 3, the clinical and non-clinical attributes after applying PCA are presented.In this work, we consider only six principal components (PC) ie, PC0 to PC5 for further analysis.
The clinical attributes such as Albumin, Alk_phos, Bicarbonate, Bun, Calcium and Creatinineare considered as principal components.Similarly, Ckf, Ckd, liver_disease, chronic_respiratory_disease, Dysrhythmia and Pvdare considered as principal components from nonclinical attributes.The reduced clinical and non-clinical attributes are presented in Table 4.
The process flow diagram of the proposed approach is shown in Fig. 1, it consists of three phases, namely, data collection and preprocessing, analysis of variance using ANOVA and classification.During the first phase, real time CKD data is collected and pre-processed using various statistical techniques.Preprocessing technique replaces the missing values with the mean values of the previous three months data.It helps to reduce the noise and spatial instability of the samples.PCA is applied to find the salient clinical and non-clinical attributes (Qin et al. 2020).In the second phase, analysis of variance (ANOVA) is used, which is a statistical approach to calculate the inter and intra class variance.The ANOVA is applied with certain assumptions, say sampling is random, independent errors, normal distribution etc.In this paper, the clinical and nonclinical attributes of a patient are measured periodically (3 months interval).The frequency of measurement is considered as factor level and samples in each interval are considered as number of replicates.Since, ANOVA is balanced; The number of samples in each factor level is equal.The first step is to calculate the sum of square terms and can be written as, A multidisciplinary approach is required to accomplish this goal and the types of treatment of the patients are not considered.Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into.As such, the procedure is often called k-fold cross-validation.This experiment uses fourfold cross validation techniques for model evaluation.
As mentioned in Sect.2, we use variance as measure for interpreting the outcome of experiments.We have used ANOVA with mortality, age and gender as dependent variable.Below, we present the experimental results and interpretation for mortality as dependent value.
The number of samples and F values are given in x and y axes respectively.The F-value for mortality = 1 and mortality = 0 are presented and the F-value is calculated as,

F -value ¼
Between group variance within group variance : In our data set, the groups are the various attributes of clinical and non-clinical say for example, calcium, bun, hemoglobin, etc.The F-value can be interpreted as ''the Fvalue is F times the size within group variation''.Mortality = 1 represents the patient is no more and mortality = 0 represents the patient is alive.It is observed from the Fig. 2 that, the F value for mortality = 0 is lower than F value for mortality = 1.It is increasing linearly for both the cases.The output space can be linearly separable with a clear boundary to define the points of mortality = 0 and mortality = 1.Thus, the interpretation is that the patients having variance more than 700 are belonging to risky class.The rate of increase of variance is steep after 1000.As a result, the type of treatment can be devised accordingly to save the patients.
In Fig. 3, we have presented the result by considering the gender as target value.It is observed that from Fig. 3 that the output space is linearly separated based on the gender of the patients.This is due to the fact that the clinical attributes of the female patients may be influenced by estrogen (Suzuki 2015).Thus, between groups variations of the clinical attributes are low.In contrast, the Fvalue for the male patients are on the higher side, which implies that between group variance of the clinical attributes are high.Also, the protective effects of estrogens or the damaging effects of testosterone and declines the function of kidney along with unhealthier life style.As a result, the F-value of the male patients tends to be higher as depicted in Fig. 3.
It is well known that the age is also an important attributes in addition to mortality rate and gender.Since, age plays crucial role, which influences the value of the clinical and non-clinical attributes.In this work, we have considered age as one of the dependent variables and, measured the variance and shown in Fig. 4. The patients are categorized into two groups, say patients having age C 50 and \ 50.It is observed that the derivative of the variance is almost zero for patients having age less than 50.In contrast, the change in variance is notable and the variance increased almost linearly.This is due to the fact that younger patients respond to the treatment well and thus the changes in negligible in samples.However, in aged patients, the changes in variances are notable (Fig. 4).In addition to the above results, the and nonclinical attributes are classified to the target attributes.The performance of machine learning algorithm is measured using various metrics such as precision, recall, accuracy and F-measure.Precision is defined as ratio of number of patients correctly classified as CKD and total number of patients.Accuracy is defined as number of CKD patients correctly classified as CKD.F-measure is defined as the ratio of recall to precision.The performance metrics play an essential role in consolidating and identifying best performing classifiers for CKD classification: In Table 5, we have presented results for various wellknown classifiers such as logistic regression, K-NN, SVM, Naı ¨ve Bayes, DT, RF and ANN.The accuracy is measured for each measurement interval, say 3, 6 months, etc.The precision, recall, F1-score and accuracy are considered as performance metrics are given Eqs.15-18.
It is observed from the Table 5 that both the random forest and ANN classifiers perform well on these data sets.Say, for example, RF performs good on attributes collected in 3rd, 18th, 24th and 48th months data and the ANN's performance is encouraging on data collected in 6th, 12th and 72nd months.The rate of accuracy is not uniform for the entire sample as the mode of treatment to the patients is not uniform.

Conclusion
In this paper, we have used clinical and non-clinical attributes of patients having CKD.The attributes are measured in 3 months intervals for six years.The contributions of various attributes in understanding its impact on Mortality, Age and Gender are confirmed with PCA and the dimension is reduced accordingly.The inter and intra class variance is considered for each of the target value and found that the out space is linearly separable to enable the classification.The experiment is further consolidated using various performance measures such as precision, recall, F1score and classification accuracy.It is found that the attributes are having discriminating power in terms of its variance on target variables.
most common protein found in blood plasma.Low albumin levels in the blood indicate serious problems in kidney alk_phos Alkaline phosphatase (ALP) indicates the measurement of protein in body tissues.When the liver is damaged, ALP may leak into the bloodstream Alkaline phosphatase are significantly associated with several comorbid conditions such as fractures, parathyroidectomy, etc. Bicarbonate Bicarbonate affects the function of kidney.The function and significantly improve vascular endothelial is improved in patients having CKD The kidney degrade in synthesizing ammonia extract, hydrogen ions and regenerate bicarbonate Bun The blood urea nitrogen (BUN) measured the as serum creatinine levels in blood BUN is inversely associated with hemoglobin level Calcium It indicates the blood calcium levels The negative calcium balance increase risk of osteoporosis and positive balance increases risk of vascular calcification Hemoglobin CKD patients are affected by Anemia Phosphorous High phosphorous damages bone and kidney Higher phosphorus in body affects ability of the body to control other minerals Potassium The nerve and muscle function is generally controlled by potassium High potassium in the blood is called hyperkalemia, it causes nausea, weakness, etc. Pth PTH levels increases for the patients say 3-5 stage of CKD in which they are not taking the dialysis regularly (Lysaght 2002) Para thyroid hormone (PTH) levels are associated with an increased cardiovascular risk

Fig. 1 Fig. 2
Fig.1The process flow of the proposed approach.DV dependent variable

Table 1
Various stages of CKD with respect to eGFR

Table 3
Non-clinical attributes of CKD

Table 4
Clinical and non clinical attributes after dimensionality reduction

Table 5
Performance evaluation of various classifiers