Predicting deceased donor kidney transplant outcomes: Comparing KDRI/KDPI with machine learning

Background: Kidney transplantation is a cost-effective treatment for end-stage renal failure patients that provides a signicant survival benet and improves their quality of life compared to other forms of renal replacement. The predominant method used for donor kidney quality assessment is the Cox regression-based, piecewise linear kidney donor risk index (KDRI). A machine learning method (random forest) was compared to KDRI for predicting graft failure at 12, 24, and 36 months after transplantation. Methods: Random forest was trained and evaluated with the same deceased donor kidney transplant data (n=70242) initially used to develop KDRI (1995-2005) and included four readily available recipient variables from the estimated post-transplant survival score. Results: When comparing type II error rates of 10%, random forests predicted an additional 2,148 successful grafts at 36 months after transplant (126%) than KDRI. Many high-KDRI kidneys, at risk for discard, were correctly predicted for successful transplantation with random forests. Random forest performed signicantly better than KDRI for graft Kaplan-Meier survival analysis from 0-240 months (log-rank test p<0.00). Conclusions: Machine learning methods can provide a signicant improvement over KDRI for the assessment of kidney offers. This work lays the foundation for the use of machine learning methodologies in transplantation and describes the steps to measure, analyze, and validate future models.

transplantation were 4,830 and 4,411 respectively in 2016. 6 The unacceptably high number of deaths from the waitlist could be prevented by increasing deceased donor kidney transplantation (DDKT).
Despite evidence-based clinical and cost advantages of DDKT, nearly 1 in 5 viable deceased-donor kidneys recovered for transplantation are discarded (~4,000 per year). 6 Many studies have demonstrated a signi cant survival bene t for wait-listed patients to accept (or for centers to accept on their behalf) any kidney available for transplant, regardless of the current acceptance metrics. [9][10][11][12] The reasons for these discards are multifactorial and include, but are not limited to, recipient variability, donor variability, governmental regulations, local health provider biases, and nancial pressures. 13 A comprehensive discussion of how these factors impact the health provider decision making at the time of organ offer and acceptance is beyond the scope of this manuscript.
Healthcare professionals can better understand the risk of transplantation with models that capture an individual's health state more entirely in the context of a speci c prospective organ. Our paper uses MLM to recreate the Kidney Donor Risk Index (KDRI) to determine if MLM can lead to a better predictive model than the Cox regression initially used. The Cox regression employed to develop the KDRI resulted in a piecewise linear formula used both in donor allocation and distribution (recipient acceptance criteria). We also question why only donor variables are considered at this stage for allocation purposes? The MLM may provide additional modeling opportunities that incorporate more variables and answer new questions.
The KDRI is commonly adjusted annually and implemented as the derivative metric, the Kidney Donor Pro le Index (KDPI). The KDRI was developed in 2009 and has been the industry and regulatory standard for kidney quality since 2011. 14 Despite the need and the industry's enthusiasm for the adoption of this metric, the KDRI has many limitations. The KDRI has a c-statistic of 0.600, a measure of predictive quality equivalent to the area under the receiver operating characteristic (ROC) curve (AUC). The KDRI was developed using a Cox proportional hazard regression method. The nal KDRI model included 15 variables measured from the donor, resulting in a piecewise linear model that suffers biases when arbitrarily categorized variables are present. For example, in the KDRI model the risk coe cient change based on age over or under 18 and 50; weight over or under 80; creatinine over or under 1.5; etc. 14 By design, the KDRI model incorporates only variables measured from the donor to assess the risk of transplantation for a recipient.

Page 4/18
The identi ed limitations of KDRI lead to several questions that this work explores: Can more successful transplants be predicted using MLM? Might more accurate predictions of graft failure (GF) at 12, 24, and 36 months post-transplantation encourage the utilization of kidneys that are otherwise at risk for discard for transplantation in patients? Further, it would seem logical that predictive measures of transplant outcomes should incorporate recipient variables, like those present in the estimated post-transplant survival score (EPTS) to further improve the modeling. The EPTS was also developed as a piecewise linear model (age over or under 25), using four recipient variables, and is used separately in allocation algorithms. Because EPTS and KDRI were constructed independently of one another, the simultaneous use of these separate models, as in the current allocation system, cannot capture interactions among the donor and recipient variables. A combined model including readily available variables from the donor and recipient, utilizing machine learning, may improve the predictive capabilities for longer-term graft survival.
Machine learning methods are routinely used to capture non-linear relationships in the presence of outliers and noise without compromising predictive quality, adding signi cant variance, or overtraining. [15][16][17] Complicated relationships among many variables, potentially involving independently insigni cant variables, cannot be modeled adequately with linear methods. Variables with known collection challenges (i.e., apparent outliers, or high noise) are routinely excluded from linear modeling but may provide enhanced knowledge to MLM allowing better predictive performance. Retrospective transplant data are sparse with many missing and miscoded values. MLM have demonstrated success in these dirty data conditions in other areas of medicine. [18][19][20] The speci c aim and purpose of this retrospective quantitative research study was to build an MLM using national DDKT and patient follow-up data from 1995-2005, the same data and variables used to develop KDRI, and compare predictive performance using 10-fold cross-validation for GF at 12, 24, and 36 months after transplant, both with and without recipient information.

Data
We obtained Standard Transplant Analysis and Research (STAR) data from the United Network for Organ Sharing (UNOS) inclusive of September 29, 1987, through March 31, 2016. This experiment complied with the data use agreement in accordance with applicable regulating bodies. The UNOS STAR data contained many common free-text entry errors that were re-coded from unknown, blank spaces, empty strings, "NA", "N/A", "na", 999, 998, (for example) to a consistent value of NA. We re-coded predictive variables as categorical or numeric values based on our interpretation of the STAR data documentation. The UNOS STAR data are incomplete, mutable, corrected, and back lled as updates became available. The minor discrepancies between the data used by Rao and in the present study are due to this feature and are presented in Table 1.
We reconstructed, as closely as possible, the original data set used by Rao for the development of KDRI. 14 First, the data were ltered by transplantation date within the acceptable date range (01/01/1995 to 12/31/2005) for only DDKT, Initial Data n i , (see data reduction in the rst row of Table 1). Excluded in sequence from the analysis were: recipients aged less than 18 years, recipients with a previous transplant, multi-organ transplant recipients, and ABO-incompatible patients, keeping consistent with Rao. We also removed observations with invalid and/or missing data: donor height (<50cm, >213cm), weight (<10kg, >175kg), and creatinine (<0.1mg/dL, >8.0mg/dL). Finally, we removed observations without a valid entry for the KDRI_RAO variable. Table 1 compares the number of removed observations from Rao's study and our study for each missing or invalid variable. 14 The clinical setting for our experiment was to predict kidney transplant outcomes with the data present at the time of organ allocation; the same data a clinician would have or could approximate (such as an estimate of the expected cold ischemia time from the current data) at the time of organ offer. Donor variables were those used in the nal model of KDRI: age, race, history of hypertension, history of diabetes, serum creatinine, cerebrovascular cause of death, height, weight, donation after cardiac death, hepatitis C virus status, HLA-B and HLA-DR mismatching, en-bloc transplant, and double kidney transplant indicators (two for one), and known cold ischemia time at time of offer. Additional transplant recipient variables from EPTS score were used in our MLM; recipient age, diabetes, and time on dialysisnotably excluding recipients with prior transplant and multiorgan transplants based on the lter criteria from Rao, et al. 2009. 14

Machine Learning Models
Machine learning models are predictive models that are induced semi-automatically from labeled data.
The input consists of instances (e.g., organ characteristics, or donor/recipient pair characteristics) with ground-truth labels (e.g., success or failure). The supervised learning algorithm (e.g., random forests) optimizes a mapping from inputs to outputs, such that future unlabeled cases can be predicted correctly with high probability. Such techniques are often designed to handle large datasets with nonlinear effects, multiple dependencies, missing values, etc. This exibility requires careful parameter tuning to avoid overly-complex models that generalize poorly to new, unseen cases. Oftentimes, MLM involve simple algorithms that are repeated thousands of times and making small changes as it converges to an optimal performance value (e.g. minimal Type II error rate). The general limitations for these processes are computing power and parallel processing capabilities -two resources that continue to grow exponentially (e.g. Moore's Law).
In contrast to Rao's Cox proportional hazard regression method, we explored a supervised random forest (RF) MLM classi cation model. The RF algorithm constructs and combines the predictions of thousands of machine-generated decision trees to model the probability of graft failure. We created response variables based on patient follow-up data after transplant. This study considered three binary outcome predictions: graft failure (GF) at 12, 24, and 36 months after DDKT were abbreviated as GF12, GF24, and GF36 respectively. For example, the positive (+) outcome for GF12 meant that we observed graft loss within 12 months after DDKT and the negative (-) outcome for GF12 meant that graft was successful at 12-month follow-up. The three models tested for these outcomes were: 1. RF using only donor variables from KDRI (RFD), 2. RF using donor variables from KDRI and recipient variables from EPTS (RFDR), and 3. Rao's KDRI.
The RF algorithm involved the automated creation of thousands of decision trees trained with cases with known outcomes. 21 The algorithm created each decision tree using a subset of the training data called a bootstrap sample (BSS). Each BSS balances the training sample by resampling the minority outcome observations (positives) and under sampling the majority outcome cases (negatives) such that the resulting ratio was 1:1. Each tree consists of multiple decision nodes constructed by randomly selecting a subset of the predictive variables and choosing the one that maximizes the Gini index, a measure of information gained from the use of each variable. Training continued until "exhaustion": each tree completely ts the training BSS. When classifying new examples, all the trees make a prediction, and the output of the RF is the percentage of votes for each outcome, one vote per tree, and the predicted outcome is assigned based on majority vote.
The choice of the RF algorithm allowed for mixed data types (binary, categorical, and numerical) without scaling or signi cant data modi cation. The RF training is computationally e cient, is robust to outliers and co-linearities, contains simple tuning parameters, and has demonstrated success for a variety of healthcare data applications. 21,22 RFs have also demonstrated utility in predicting deceased donor organ transplantation success and offer acceptances in simulated organ allocation models. 3,4 The contribution of each tree in an RF is similar to getting thousands of opinions based on a professional colleague's background. The clinical implementation of the RF algorithm works similarly to secondary and tertiary opinions among professionals in multiple specialties convening for the treatment of a complicated case. A random subset of all available variables and clinical observations informed the construction of every decision tree and resulted in a unique perspective represented by each tree. The consensus among the interdisciplinary experts, or decision trees, is the nal treatment plan for the complicated case with thorough consideration among many different perspectives. Figure 1 demonstrates the different numbers of trees used for RFD and RFDR models including stratifying and balancing to obtain BSS and the resulting AUC. Increasing the number of trees improved the AUC attained by RFD and RFDR, and both converged between 1000 and 1500 trees. The number of trees is a signi cant hyperparameter for the RF algorithm. KDRI results do not depend on the number of trees.

Analyses
We evaluated the predictive performance of the RF models with an industry standard 10-fold crossvalidation approach. 23 The completion of the cross-validation process yielded a ranked list of predicted transplant outcome probabilities. In 10-fold cross-validation, the models are evaluated on test data that are never used in training. Thousands of decision trees are generated in the RF algorithm and each tree is individually over tted on a subsection of the data. The RF model, as a combination of thousands of trees, avoids over tting which is demonstrated by the testing and validation steps and held-out data. The KDRI model yielded a score that was already present in the data as a variable: KDRI_RAO. We used the KDRI_RAO variable to rank the DDKT observations to compare them with the predicted ranked lists from the RF models. We evaluated the models, using the predicted ranked lists, by generating the ROC curve and calculating the AUC or c-statistic as standard measures of predictive quality.
Kaplan-Meier (KM) curves provided the basis for survival analysis and comparison of the models' predicted failure vs. success at each GF interval (12, 24 and 36 months). We chose a prediction cut-off at a 10% false negative rate (FNR: Type II error, the rate at which the model predicted graft success, but the graft failed) to divide the test cases into two groups. The log-rank test was used to determine statistical signi cance among the KDRI and MLM models' respective predicted DDKT success and failure groups. Table 2 contains descriptions and de nitions of predictions; true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The models' predictions are labeled "Predicted GF" and the observed transplant outcomes are "Actual GF." When the Predicted GF and Actual GF are the same, the model predicted correctly. There are two types of model prediction errors; False Positives (Type I errors) happen when a model predicts that the DDKT would fail but the observed data show success, this is a missed opportunity; False Negatives (Type II errors) happen when a model predicts that the DDKT would succeed but the observed data show failure. False Negatives are the worst prediction errors from a clinical perspective; predicting a success, transplanting, and the graft failing or patient dying.

Results
RFDR performed better than RFD in all scenarios: adding recipient variables improved the predictive quality of the RFDR model. This is particularly true when predicting longer graft survival outcomes where RFDR performed increasingly better with longer time periods; GF12 (AUC = 0.636), GF24 (AUC = 0.638), GF36 (AUC = 0.644). RFD and KDRI, both models built without recipient variables, did not predict as well as the RFDR model did at the longer graft outcomes.  KDRI are shown. The vertical axis, Sensitivity, at 0.9 is equivalent to the 10% FNR cut-off used for comparing the models in Table 3. The diagonal line represents a ROC curve with an AUC of 0.500.  Table 3 are relative to KDRI. RFD and RFDR performed signi cantly better than KDRI when predicting which kidney transplant matches would succeed (TN). RFD identi ed 154 (2%) more successful kidney transplants than KDRI. The RFDR, with additional recipient criteria, identi ed 2148 (26%) more successful kidney transplant matches than KDRI.   Figure 3 (a) -(c) show KDRI and RFD directly compared at GF12, GF24, and GF36. Figure  3 (d) -(f) show KDRI and RFDR directly compared at GF12, GF24, and GF36. In all comparisons, FNR was held constant at 10% making all the predicted failure groups (+) statistically similar. RFD survival groups (-) were statistically similar when compared with KDRI survival groups (-) in GF12, GF24, and GF36. RFDR survival groups (-) were statistically signi cantly better than KDRI survival groups (-) in all GF classi cations. Figure 3 (a-f). Kaplan-Meier (KM) survival analysis comparison for KDRI and RF models using different graft failure criteria to split predicted survival groups for each model. Legends include the label for each line, the population size for that group at time zero, and the p-value calculated from the log-rank test between respective predicted outcome groups (i.e., -/ +).

Discussion And Conclusions
Our objective was to develop MLM prediction tools to inform decisions for speci c DDKT matches that were better than the current methods available. The RFD model performed slightly better than the existing KDRI model. The RFDR model demonstrated signi cantly higher performance than KDRI, especially for GF36 predicting thousands more successful DDKTs with the same prediction failure rate as KDRI. Safer clinical decisions informed by MLM could empower clinicians to transplant more organs into patients resulting in better outcomes. The demonstrated results of the RFDR experiments also indicate that using recipient data improves the performance of predicting longer DDKT graft survival. We believe it was appropriate to compare these models directly over the observation time (1995-2005) because KDRI was not implemented clinically or administratively until 2011.
Models with only donor variables, KDRI and RFD, showed decreased longevity predictions whereas RFDR (both donor and recipient variables) better predicted outcomes at GF24 and GF36. The inclusion of recipient data improved AUC and longevity prediction; this also makes clinical sense and can likely be re ned by adding additional data points in the future. Demonstrated improvements in long-term graft survival are associated with increased patient quality-adjusted life years, lower patient health costs, and increased value for all stakeholders. 10 These are the types of decisions at the time of organ offer that can drive real value for the system as opposed to those that optimize only measures of short term success.
Studies such as these are extremely timely as professional societies, regulatory agencies and others seek metrics and strategies to drive overall system performance. The survival observations that trend from the RFDR GF36 (-) group extend past 250 months after transplant and are signi cantly better than the KDRI (-) group; this trend continues to grow over time. RFDR makes more successful predictions than KDRI and signi cantly more of the predicted DDKT recipient grafts survive longer in aggregate.
One limitation of our study was that we used the nal KDRI model instead of recalculating the Cox regression. This limitation exists because we used the same coe cients as the nal KDRI developed by Rao, et al. 2009, with these same data, instead of recalculating the coe cients for each validation fold. Ultimately, the test data used to evaluate KDRI were the very same data that were used to develop the KDRI originally. Alternatively, the MLM were evaluated correctly because the MLM were trained only on training data and tested only on held out testing data during the 10-fold cross-validation protocol. Despite the wrongly in ated performance of KDRI, the MLM still performed statistically signi cantly better.
These results are su cient to justify future exploration of MLM and leveraging big data applied to the processes of organ transplantation. Despite a small statistically signi cant improvement in AUC, the margin of improvement by our MLM amounted to 2148 additional DDKT over 10 years (about 200 per year) that were correctly classi ed as successful transplants at the same (10% FNR) misclassi cation risk as KDRI. These results are clinically signi cant because the tools used by transplant teams will in uence life-changing decisions. A clinical team with KDRI and an acceptable 10% graft failure rate at 36 months may be in uenced to be more conservative for kidney offers that would have been successful. The same clinical team with MLM may be in uenced to be more aggressive for the same kidney offers and capture an additional 26% of successful kidney transplants. Deploying MLM in a live clinical setting may be an avenue to increase the number of DDKT without sacri cing patient outcomes.
Our study design was purposely con ned the same data and variables available for KDRI by Rao, et al. 2009, and did not allow us to leverage the full capabilities of MLM. Our future work will not have these design constraints and we hypothesize that with the bene t of the additional variables, more recent data, and missing data interpolation, the performance will be greater than what we have demonstrated here.
Our future work intends to expand this research by pursuing more aggressive strategies for optimizing the predictive quality of MLM with the inclusion of additional data sources with the ultimate goal of providing real-time decision support to clinicians at the time of organ offer. If one of the stated goals of the OPTN is to increase the number of transplants, it has to start with clinicians being able to make the best use of available data at the point of organ offer. This, of course, will need to be coupled with other larger changes in system dynamics and policy to achieve overall success. MLM will allow us to use variables from the recipient, donor, transplant center (administrative, logistical-temporal turndown data) and perhaps even behavioral data to predict the transplant outcomes with higher accuracy and clinically relevant predictive quality.  Table 2. Definitions of prediction outcomes when models were used to predict graft failure within the months following a kidney transplant. Superscript symbols were used to highlight good predictions (*), bad predictions leading to waste (#), and bad predictions leading to graft failure (!).    (a-c). ROC curves of three predictive models for different graft failure criteria. P-values were calculated from DeLong's comparison method between ROC curves for RFD and RFDR respectively vs. KDRI are shown. The vertical axis, Sensitivity, at 0.9 is equivalent to the 10% FNR cut-off used for comparing the models in Table 3. The diagonal line represents a ROC curve with an AUC of 0.500. Figure 3