Deep Learning Identifies Intelligible Predictors of Poor Prognosis in Chronic Kidney Disease

Early diagnosis and prediction of chronic kidney disease (CKD) progress within a given duration are critical to ensure personalized treatment, which could improve patients' quality of life and prolong survival time. In this study, we explore the intelligibility of machine-learning and deep-learning models on end-stage renal disease (ESRD) prediction, based on readily-accessible clinical and laboratory features of patients suffering from CKD. Eight machine learning models were used to predict whether a patient suffering from CKD would progress to ESRD within three years based on demographics, clinical,and comorbidity information. LASSO, random forest, and XGBoost were used to identify the most significant markers. In addition, we introduced four advanced attribution methods to the deep learning model to enhance model intelligibility. The deep learning model achieved an AUC-ROC of 0.8991, which was significantly higher than that of baseline models. The interpretation generated by deep learning with attribution methods, random forest, and XGBoost was consistent with clinical knowledge, whereas LASSO-based interpretation was inconsistent. Hematuria, proteinuria, potassium, urine albumin to creatinine ratio were positively associated with the progression of CKD, while eGFR and urine creatinine were negatively associated. In conclusion, deep learning with attribution algorithms could identify intelligible features of CKD progression. Our model identified a number of critical, but under-reported features, which may be novel markers for CKD progression. This study provides physicians with solid data-driven evidence for using machine learning for CKD clinical management and treatment.


I. INTRODUCTION
A PPROXIMATELY 10.8% (10.2-11.3) of the population in China are suffering from chronic kidney disease (CKD) [1]. With population aging and the rising prevalence of chronic diseases such as diabetes, hypertension, and obesity, the number of people suffering from CKD is anticipated to increase in the next few years [2]. CKD carries a high risk of complications (such as cardiovascular events), as well as death. In early-stage CKD, non-drug therapies (such as diet and lifestyle adjustments) and specific drugs (such as angiotensin-converting enzyme inhibitors or Angiotensin II receptor blockers) are commonly introduced to preserve kidney function [3]. However, due to the robust compensatory ability of the kidney, most people have no apparent symptoms in early stage of the disease [4]. Once CKD progresses into end-stage renal disease (ESRD), sufferers develop typical renal insufficiency symptoms. By this stage, the available treatments are largely limited, including hemodialysis, peritoneal dialysis, or kidney transplantation [5].
Early diagnosis and prediction of CKD progress within a given duration are critical to ensure personalized treatment, which could improve patients' quality of life and prolong survival time. However, due to the heterogeneity of CKD sufferers and the impact of confounding factors, it is difficult to predict when CKD will progress into renal failure [6]. Inaccurate prediction of CKD progression may lead to delays in treatment for people who are likely to progress to renal failure, and unnecessary treatment for people whose condition may not progress.
Percutaneous kidney biopsy is helpful to determine the pathological types of CKD, guide treatment, and identify the degree of fibrosis, which is the gold standard in defining prognosis [7]. However, percutaneous kidney biopsy is an invasive procedure that may induce bleeding, infection, or other damage. Noninvasive biomarkers such as estimated glomerular filtration rate (eGFR) are used to detect the progress of CKD and provide an individualized prognosis which in turn can provide clinicians with roadmaps of early intervention [8], [9].
Logistic regression and Cox proportional hazards regression models are the most commonly used clinical methods to predict CKD progression using non-invasive biomarkers [10], [11]. Tangri et al. used Cox proportional hazards regression models to establish the Kidney Failure Risk Equation based on age, sex, eGFR, and urine albumin/creatinine ratio (UACR) to predict CKD progression [12]. However, these studies were based on linear assumptions, and these models performed relatively poorly in the validation cohort. Furthermore, these studies mainly focused on people with advanced CKD, while ignoring the much larger group of people with early-stage CKD. Thus, establishing methods that provide more accurate predictions for people with earlier-stage disease is critical for personalized treatment.
In the past few decades, machine learning and deep learning technologies have been widely used in many fields, such as translation and face recognition [13]. Some progress has been made in medical research [14], [15], especially for the progression prediction of CKD. For example, deep learning models can be used to identify CKD and type 2 diabetes from fundus images combined with clinical data [16]. They can also be applied to predict the risk of diseases whilst still in early stage [17]. Specifically, Thomas et al. [18] developed a random forest model for the progression of CKD and achieved an AUC-ROC of 0.88 (95% CI 0.87-0.89). Francesco et al. [19] developed an artificial neural network to predict ESRD in patients with IgAN (Immunoglobulin A Nephropathy) and achieved an AUC-ROC of 0.82. However, although deep learning models could significantly improve prediction performance, they are nearly all black-box models, that humans cannot understand how the input features are being organized by the models to make predictions.
In this paper, we aimed to apply a deep neural network (DNN) and compare it with classic machine learning models to predict CKD progression for people at different stages of the disease, based on demographic variables, laboratory and blood biochemical indicators, and comorbidity information. In addition, we introduced advanced attribution algorithms to enhance the intelligibility of DNN, and compared their outputs with those from other intelligible machine learning models. Our models and intelligibility analysis may assist clinicians to formulate more appropriate management and treatment plans to delay the progression of CKD and reduce patient burden.

A. Study Population and Data Processing
This research was approved by the Ethics Committee of Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology (No.TJ-IRB20210517) in accordance with the Declaration of Helsinki, and the informed consent was waived. We retrospectively analyzed data of 2382 people diagnosed with CKD from January 2009 to December 2020. The database includes basic demographic and clinical characteristics such as liver and kidney function, blood routine test results, and comorbidity information. We excluded patients based on the following criteria: (1) those missing greater than 30% values; (2) those people younger than 18 years old; (3) people with only one admission record or whose observation period was less than six months, or lost of follow-up; (4) people with acute renal insufficiency or congenital kidney diseases. Finally, 1765 people were included in this study. The flowchart of the study cohort is presented in Fig. 1.
The factors (input features) affecting the progression of CKD disease can be roughly divided into three categories: (1) basic demographics such as sex, age; (2) systemic comorbidity such as hypertension, diabetes, urolithiasis, hyperlipidemia; and (3) basic laboratory biochemical tests. For each subject, comorbidity information was collected from diagnosis records between the first diagnosis date and the date 30 days after the first kidneyrelated diagnosis record. For each laboratory biochemical test, we kept the earliest record of each test. The missing values of one subject's biochemical tests were replaced by the mean value calculated from the subjects that belong to same CKD stages. The basic statics of the features considered in this study are shown in Table I, data are presented as mean ± standard deviation (SD), or n (percentage).

B. ESRD Definitions
In this study, the end of CKD progression within three years (the positive label, denoted as 1) was end-stage renal disease (ESRD). We considered the possibility of false negative results due to both too short and too long follow-up times. Too short follow-up times might miss patients who will eventually progress to ESRD (only 20% subjects have progressed to ESRD within 1 a in our dataset), while long follow-up times might include mostly patients who have already progressed to ESRD (89% of subjects progressed to ESRD within 5 years, and more than 77% of those subjects had already progressed to ESRD 2 years prior). ESRD was defined as the initiation of renal dialysis treatment (including peritoneal dialysis and hemodialysis) or kidney transplantation, or eGFR was reduced by 50% over the observation period of an individual from the first time it was recorded. The eGFR values were calculated by using the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation based on creatinine clearance [20].

D. Deep Learning Model With Intelligible Mechanisms
Deep neural networks (DNN) [22] are artificial neural networks with multiple layers between the input features and output predictions. Each linear layer is connected by non-linear activation functions to learn non-linear relationships between the input features. In this study, we utilized a four-layer neural   [23] and Dropout [24] modules for better performance. Each layer is as follows: where I l , O l are the input and output of layer l, respectively, and ReLU denotes the Rectified Linear Unit activation function. The last layer is a sigmoid layer for a binary prediction as shown in (2).
where I last and O last are the input and output of the last layer, respectively, and σ denotes the sigmoid activation function.
Since deep learning models are mostly black box models which lack interpretability, we introduce several attribution algorithms [25] to enhance the intelligibility of DNN. These algorithms computed the gradient of the model's prediction concerning each feature to show how the output value changes, given small changes due to perturbations of input features. Here we applied four attribution algorithms, including Integrated Gradients [25], DeepLIFT [26], Gradient SHAP [27], and Feature Ablation [28]. All four algorithms generated a score for a specific feature based on the trained model, denoting the contribution of this feature to the positive label. Specifically, high positive scores can be interpreted as positively related to ESRD. Instead, high negative scores would be interpreted as a higher value of this feature, a lower possibility of progressing to ESRD. a) Integrated Gradients: Integrated Gradients is an axiomatic attribution proposed by Mukund et al. [25] It represents the path integral of the gradients along the path from the baseline x to the input x. In this study, the baseline is the average or mode value for each feature of the subjects with label 0. And the integral is approximated by Riemann Sum or Gauss Legendre quadrature rule as follows where x denotes the input and x is the baseline; i is the ith feature; F is the neural network; α is the scaling coefficient. b) DeepLIFT: DeepLIFT [26] is similar to Integrated Gradients which is a back-propagation-based approach that attributes changes to inputs based on the differences between the inputs and baselines. DeepLIFT uses multipliers m to "blame" specific neurons for the difference in output as follows where Δx is the difference between the input neuron x and baseline, and Δt is the difference between target neuron and reference. C is then the contribution of Δx to Δt. c) Gradient SHAP: Gradient SHAP [27] is SHAPbased method which adds Gaussian noise to the inputs multiple times and then selects a random point along the path between baseline and input. The final SHAP values are defined as follows: where the gradient is defined as the gradient of outputs with respect to the selected random points. d) Feature Ablation: Feature Ablation [28] is a perturbation-based approach. It replaces each input with a given baseline, and computes the difference in output.

E. Experimental Settings
Suppose the features of an individual i are denoted as x i , with label y i , where y i = 1 represents that this person progressed to ESRD within three years, or vice versa. All our models aimed to learn a functionŷ i = f (x i |y i ), whereŷ i denotes the probability of whether one person would progress to ESRD within three years. We formulated the following loss function for our DNN model: is the binary-cross-entropy loss,ŷ i is the predicted label, Θ denotes the model parameters to be learned. The second and third terms are the l 1 and l 2 regularizes to prevent over-fitting, where λ 1 and λ 2 are the hype-parameters for l 1 and l 2 regularizations, respectively.

F. Baselines
We applied seven machine learning models, which were classified into four categories. a) Linear model: We selected Logistic Regression (LR) [29], Ridge Regression Classification (RRC) [30], and Least Absolute Shrinkage and Selection Operator (LASSO) [31]. These are all linear models but with different regularizations on the parameters. Specifically, LR applies a logistic function to model binary dependent variables without any regularizations, which is widely used in medical research [32], [33]. LASSO introduces an l 1 regularization to perform both variable selection and avoid over-fitting problem [31]. In contrast, RRC utilizes an l 2 regularization which extends the robustness of the model but lacks variable selection ability [30].
b) Support Vector Machine (SVM) model: SVM [34] aims to learn a non-linear relationship in the kernel space to make a classification. In detail, the SVM model depends on a kernel trick to implicitly map the features into high-dimensional feature spaces. In this study, we introduce two kernel tricks, a Gaussian kernel (SVM-RBF) and a linear kernel (SVM-Linear).
c) Decision Tree model: Here we introduce two widely-used decision tree models, Random Forest (RF) [35] and XGBoost model [36]. XGBoost is a scalable end-to-end tree boosting system that is fast and accurate and used in many medical tasks.

G. Tuning of Parameters
We utilized grid search to find the best setting for each model, which is conducted by optimizing the area under the receiver operating characteristic curve (AUC-ROC) metric on the validation set. The ratio of training, validation, and test set

A. Study Cohort
In total, 1765 people suffering from CKD were recruited into this study cohort. Detailed demographic characteristics, comorbid conditions, and laboratory data are shown in Table I. There were significant differences within the four groups in age, follow-up duration, and 17 biochemical test values such as serum creatinine, hemoglobin.

B. Performance of Deep Learning
Performance was evaluated by five metrics: accuracy, precision, recall, AUC-ROC, the area under the precision-recall curve (AUC-PR), and F1 score. Each model was trained ten times, and the average performance and standard deviation were reported. In general, as shown in Table II, the DNN model outperformed all baselines given all the metrics except for Precision, which reached a mean AUC-ROC value of 0.8991 (1.2% higher than the second-best finding). Furthermore, the DNN model also achieved much higher recall and PR-AUC metrics compared with other models, indicating that the DNN model can identify the patients who will progress to ESRD within three years more precisely and sensitively. Combining these findings, the DNN model shows a better performance in capturing the non-linear relationships within the input features, and generates a better prediction. For the other models, two decision tree models -XGBoost and RF performed second-only after the DNN model. All the linear models (LASSO, LR, RRC, and SVM-Linear) performed more poorly than the non-linear models, which indicates that the non-linear functions better describe the relationships between the predicted features and the outcome. Compared within three linear regression classification models, RRC performs best, LASSO followed, and LR last, in line with the number of regularization terms. The SVM-RBF, which utilizes a Gaussian kernel, generates a higher AUC-ROC value among two SVM-based models.
To test the robustness of each model, we trained all the models ten times with different random seeds, and the box plots of AUC-ROC values are shown in Fig. 2(c). The DNN model is the most robust (with the slightest standard deviation of 0.0100) of all the models. Except for the two decision tree models, other machine learning models have similarly larger standard deviation values.

C. The Features of the ESRD Driver
Of all the models used in this study, the DNN model with attribution algorithms, LASSO, Random Forest, and XGBoost can generate a score of each feature which denotes its importance to the positive prediction (ESRD). The score is shown as the normalized contribution weights in our study. As shown in Fig. 3, DNN with attribution algorithms and LASSO can generate both positive and negative contribution weights, where the positive weights denote that the higher value of this feature, the higher risk to drive ESRD. In contrast, two decision tree models can only generate contribution weights without directions which indicate how much the feature contributes to the positive label regardless of directions.
In this study, we define "critical features" as the features with the top 20 highest contribution weights (absolute value) to ESRD. The critical features identified by the DNN model are consistent with other intelligible/explainable machine learning models. For example, 14 features such as hematuria and eGFR are captured by both DNN-DeepLIFT and machine learning models. In addition, we performed a Cox proportional hazards regression analysis on our dataset, which revealed that four crucial features: Potassium, Ucr (urine creatinine), Proteinuria, and eGFR (CKD-EPI), exhibit the same logical relationships as those identified by the DNN-DeepLIFT (as detailed in the Supplementary Information). Two decision tree models generate almost the same top 20 critical features (14 in common), where a difference exists in the order of these features. For example, eGFR is identified as second-important in RF, while it is the most important in XGBoost. Hematuria and Ucr (urine creatinine) are identified as the most significant positive and negative critical features by DNN-DeepLIFT, respectively, whilst LASSO, monocytes (%) and red cell distribution width (RDW) are identified as the most significant positive and negative critical features, respectively. Serum creatinine and eGFRare identified as the most critical features by RF and XGBoost, respectively. Comparison within the four attribution algorithms shown in Fig. 3(e), except for the GradientSHAP algorithm, the other three algorithms generate similar results. For example, all three algorithms identified Ucr as the most significant negative critical feature to ESRD. However, GradientSHAP assumes that Ucr is positive, which is not consistent with clinical knowledge [37].
Combining the critical features identified by all these models, we conclude the positive critical features to the progression of CKD are: hematuria, potassium, proteinuria, urine albumin to creatinine ratio (ACR), cystatin C. Negative critical features for the progression of CKD are eGFR and Ucr.
To further validate the interpretability of the DNN model for the individuals with different etiologies, we divided all the individuals into four categories: individuals with CKD caused by hypertension, diabetes, urolithiasis, or chronic glomerulonephritis separately. The identified critical features are shown in Fig. 4. We found that hematuria is the most important independent risk predictor for the progression of diabetic nephropathy (DN) and urolithiasis. Bicarbonate was the most important independent risk factor for predicting the deterioration of renal function in hypertensive patients with renal insufficiency. Furthermore, bicarbonate, hematuria and proteinuria are the most important independent risk factors for the progression of primary glomerulonephritis.

D. Performance for People At Different Stages
We divided all subjects into four groups according to their first eGFR record to test the predictive ability of all models for people at different stages of CKD. The prediction accuracy and recall of all the models in this study are reported in Fig. 5. It is noted that all the machine learning models achieved similar performances except for people at stage 4. These machine learning models achieved the lowest and highest accuracy for people suffering from stage 3 or stage 4, and stages 1&2, respectively, and the recall increased from stages 1&2 to 5. In contrast, the DNN model generally outperformed other machine learning models in view of both accuracy and recall, except for slighter lower accuracy for people at stage 3 and lower recall for people in stage 5. More specifically, for people at stage 4, the DNN model gets the highest accuracy (0.8853) and the highest recall (0.9846) at the same time, indicating that the DNN model not only accurately predicts whether these people might progress to ESRD, but it also identifies these people more comprehensively.

IV. DISCUSSION
Hematuria, potassium and proteinuria were screened as important independent risk predictors for the progression of CKD patients based on the machine learning model and deep learning model in this study. More importantly, hematuria was the most important risk factor for the progression of DN and urolithiasis in this study, which is inconsistent with clinicians' inherent knowledge. Even though the deep learning model used in our study is not specifically designed, this model still achieves superior performance compared with other machine-learning-based models (Table II and Figs. 2 and 5). The reasons behind its superior performance are the complex non-linear relationships between the input features and output predictions. We also observed that in the machine-learning-based models, the nonlinear models better describe the ESRD prediction task, such as the better performance achieved by SVM-RBF compared with SVM-Linear. Considered together, these findings indicate that the relationship between the features of CKD patients and their impacts on ESRD cannot be described by simple linear equations, which were commonly used in previous studies [38].
LASSO, widely adopted in many medical studies [39], [40], [41], [42], has not shown reliable predictive power and intelligence in this study. The critical features identified by LASSO include more comorbidities while ignoring widely-used markers in clinical treatment, such as eGFR [43]. Previous studies have shown that eGFR and the reduction of eGFR are solid markers of CKD progression [43]. In contrast, the critical features identified by DNN-DeepLIFT, RF, and XGBoost are more consistent with clinical knowledge. For example, they all identify eGFR as a critical feature of the progression of CKD. Furthermore, RF and XGBoost also identify serum creatine and cystatin C (both are within top three critical features), which are used to compute the value of eGFR, and show a high correlation with ESRD [44]. ACR is a commonly-used marker for the clinical evaluation of CKD progress and guiding treatment [45], which is also identified by RF and XGBoost. This further provides confidence in these models.
In previous studies, deep learning technologies are usually assumed to be black-box models [46], lacking interpretability even though achieving outstanding performance. We addressed this challenge by introducing novel attribution algorithms, such as Integrated Gradients and DeepLIFT. Three of four attribution algorithms achieve similar patterns of feature contribution weights, which is more consistent with clinical studies [37], except for GradientSHAP. In addition to common features such as eGFR and proteinuria, DNN-DeepLIFT also found some markers that were less reported in clinical studies compared with eGFR, such as hematuria, and considered hematuria to be the most important predictor of the progression of DN and kidney stones, and it is also an important predictor of primary glomerular disease. However, eGFR and proteinuria are clinically recognized as the most important independent risk factors for CKD progression, and hematuria is often ignored [47], [48], [49]. The reason for the differences may be that previous studies did not distinguish the etiology of CKD when evaluating the prognosis of CKD. Glomerular disease, one of the pathological types of CKD, can cause visible or invisible red blood cells in the urine (hematuria) [50]. DN is the main microvascular complication of diabetes, which can easily cause glomerulosclerosis and damage the glomerular filtration barrier, which may increase the risk of erythrocyte leakage in the glomerulus [51]. In addition, bicarbonate and potassium are the most critical features for individuals with CKD caused by hypertension. Hypertension not only leads to decreased nephron mass, but also increases sodium retention and extracellular volume expansion [52], which may be the reason why electrolyte levels were found to be important factors affecting the progression of CKD due to hypertension. Our study may provide clinicians with a different perspective, that is hematuria may need to be considered as a more important factor in the progression of CKD in patients with diabetes and kidney stones, but this also requires larger cohort studies and external validation.
Potassium and Ucr are identified as critical indicators screened by the three models in our study. Previous studies have shown that low urine potassium excretion is related to CKD progression [53]. The impaired function of the "sodium-potassium pump" in the renal tubules leads to decreased urinary potassium excretion. Wilson et al. found that the appearance of low Ucr is an important risk marker for the adverse consequences of CKD, which is also consistent with our research findings [37].
Timely and accurate prediction of whether the individuals may progress to ESRD within a given duration is critical to determine the most appropriate treatment plan. Thus, for people in an early stage of CKD (stage 1-3), we should identify those who will progress to ESRD as soon as possible, to achieve early intervention and early treatment. Meanwhile, for people in the more advanced stages (stages 4 and 5), it is important to recommend development of good lifestyle and eating habits and to start using drugs such as angiotensin-converting enzyme inhibitors to slow the progression of CKD, in order to avoid premature initiation of hemodialysis treatment and kidney transplantation [54]. Among all the models used in this study, we found that people with stage 3 CKD were the most difficult for accurate prediction (Fig. 5(a)), while all the models achieved the best performance for people with stage 4 disease. This phenomenon may be due to the strong compensatory ability of the kidneys [55]. Some people suffering from stage 3 CKD have no significant abnormalities in view of some clinical indicators. However, when entering CKD stage 4, compensatory mechanisms are overcome, and multiple test indicators showed significant changes. The DNN model continues to demonstrate better power compared with other machine-learning-based models, except for people in stage 5. We took a careful look at the detailed predictions of these models and found that all the machine-learning-based models predicted that all people in stage 5 would progress to ESRD within three years, achieving a lower accuracy but a higher recall (nearly 1.0), which conflicts with real life. In contrast, the DNN model has been better trained and appeared to make more accurate predictions for each individual (a much higher accuracy of 0.90). The relatively lower recall value is possibly due to the limited number of people at stage 5 CKD in the test dataset (n = 34).

V. CONCLUSION
This study concluded that the DNN model has better performance in predicting the progression of CKD patients to ESRD compared with other machine learning-based models. Furthermore, it provides a potentially more important and different perspective for clinicians' understanding of CKD. That is, hematuria may be an important predictor of the progression of DN and urolithiasis. We compared the DNN model with other machine learning-based models and found that the DNN model performed the best in all CKD stages.