Applying Random Forest Model Algorithm to GFR Estimation

Background The utilization of estimating-GFR equations is critical for kidney disease in the clinic. However, the performance of the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation has not improved substantially in the past eight years. Here we hypothesized that random forest regression(RF) method could go beyond revised linear regression, which is used to build the CKD-EPI equation Methods 1732 participants were enrolled in this study totally (1333 in development data set from Tianhe District and 399 in external data set Luogang District). Recursive feature elimination (RFE) is applied to the development data to select important variables and build random forest models. Then same variables were used to develop the estimated GFR equation with linear regression as a comparison. The performances of these equations are measured by bias, 30% accuracy , precision and root mean square error(RMSE). Results Of all the variables, creatinine, cystatin C, weight, body mass index (BMI), age, uric acid(UA), blood urea nitrogen(BUN), hematocrit(HCT) and apolipoprotein B(APOB) were selected by RFE method. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. In the 9-variable model, RF model was better than revised linear regression in term of bias, precision ,30%accuracy and RMSE(0.78 vs 2.98, 16.90 vs 23.62, 0.84 vs 0.80, 16.88 vs 18.70, all P<0.01 ). In the 4-variable model, random forest regression model showed an improvement in precision and RMSE compared with revised regression model. (20.82 vs 25.25, P<0.01, 19.08 vs 20.60, P<0.001). Bias and 30%accurancy were preferable, but the results were not statistically signicant (0.34 vs 2.07, P=0.10, 0.8 vs 0.78, P=0.19, respectively). performances forest regression better than revised linear regression Low-density lipoprotein cholesterol, HDL-C High-density lipoprotein cholesterol, TRI Triglyceride, CHOL Total cholesterol, BMI Body mass index, CO 3 − bicarbonate radical, ALB albumin, FBS Fasting blood sugar, BUN Blood urea nitrogen, UA Uric acid, PA Prealbumin, APOA Apolipoprotein A, APOB Apolipoprotein B, HCT Hematocrit, ,PA Prealbumin, LPA Lipoprotein A, TRI Triglyceride, EPO Erythropoietin, CCB Calcium channel blocker, ARB Angiotensin receptor blockers, ACEI Angiotensin converting enzyme inhibitor, URI PRO Urine protein, HBP high blood pressure, CHD chronic heart disease, MCV Mean Corpuscular Volume, MCHC Mean Corpuscular Hemoglobin Concentration,


Background
The application of the estimating-GFR equation in the clinic is critical, especially when it comes to diagnosing and prognosis of nephropathy, therapeutic interventions as well as drug dosing evaluation (1).
It is impractical to get true GFR directly, because GFR is changing overtime. Instead, we could get a relatively reliable index by the measurement of serum or urine clearance of exogenous and endogenous ltration markers (2,3). However, high expenditure, di culty in obtaining and complicated procedures make these methods inconvenient for daily use. As a result, estimated GFR (eGFR) equations could be an optimal choice for routine GFR evaluation. The 2012 CKD-EPI equation built by linear regression with creatinine, cystatin C, sex, race and age as variables have been considered to be the best equation until now (4). Nevertheless, serum concentrations of creatinine and cystatin C are affected by GFR and non-GFR determinants, which will cause inaccurate GFR prediction (5). Age race and sex are clinical characteristics and demographics and can offset a proportion of the impacts of non-GFR determinants (6). But they may be not su cient. Because there are numerous conditions including UA(7), urinary protein (8), anaemia (9) and smoking(10) that can affect GFR and many metabolites are in uenced by glomerular ltration rate (11,12). Thus more variables may be needed to optimize the CKD equation.
On the other hand, Previous researches indicate that machine learning algorithms contribute to a speci c improvement of eGFR Eq. (13)(14)(15)(16). Random forest regression is a machine learning algorithm which can train and predict samples by ensembling regression trees (17). Since the randomness injection and the nature of bagging ensemble of regression trees, random forest regression is capable of dealing with relatively small samples with large amount of variables. As a result, this algorithm has been used instead of linear regression in various elds (18)(19)(20), especially when the number of variables is relatively large.
In this setting, we assumed that the random forest regression with more variables imported could enhance the performance of the evaluation equation for GFR.

Data sources
The study was based on the data from the Third A liated Hospital of Sun Yat-sen University, Guangzhou. To be eligible for the study, participants should measure their GFR by the mean of radionuclide renal dynamic imaging or the dual plasma diethylenetriamine pentaacetic acid (DTPA) and should be older than 18 years old. Exclusion criterion: 1) amputation; 2) acute renal insu ciency; 3) hemodialysis, peritoneal dialysis, renal transplantation, and chronic kidney disease in patients with obstructive nephropathy; 4) severe oedema; 5) skeletal muscle atrophy; 6) hydrothorax or ascetic uid; 7) dystrophy; 8) under 18 years of age; 9) heart failure; 10) ketoacidosis. Participants who are daily use cimetidine or trimethoprim would not be included in the study. Among them, data deletion criteria included(21): 1) previous alternative renal replacement treatment experience; 2) mGFR 5 ml/min/1.73m 2 . 4504 participants were screened from January 2012 to April 2015, and 1732 individuals were included in the study (the details were demonstrated in table1). Data collected from 1333 participants (923 participants for modelling data set and 399 participants for internal validation data set) in Tianhe District of the Third A liated Hospital of Sun Yat-sen University were regarded as development set. External validation set Data set used the data from 399 individuals in Luogang District of the Third A liated Hospital of Sun Yat-sen University. Subjects who had been enrolled since July 25, 2017 signed written informed consent.
For those who were previously enrolled, we contacted the participants or their families by phones or letters and obtained informed consents. For those who were unable to contact, Institutional Review Board at the Third A liated Hospital of Sun Yat-sen University provided the ethical review approval of an exemption from written informed consent.

Laboratory methods
The measurement of standard GFR (sGFR) or measured GFR (mGFR) was radionuclide renal dynamic imaging or 99mTc-DTPA, which was described in detail before (13)(14)(15)22). Serum creatinine (SC) measurement was on Hitachi 7180 auto-analyzer (Hitachi, Tokyo, Japan) using reagents from Roche Diagnostics (Mannheim, Germany). The enzymatic method was used for SC measurement. Isotope dilution mass spectrometry reference method was performed on the calibration of SC level after the year 2010. Serum Cystatin C determination was with reference of standard reference material (SRM 967).

Development of the smooth line revised linear regression equation Model
Revised regression models of four variables( 2012 CKD-EPI equation) and nine variables (selected by RFE approach) were implemented as comparisons. The process of equation development has carried out in detail in the paper we had published before (14,24)and was shown in the Additional le.

Development of the New Prediction Model by Random Forest
The four and the nine variables mentioned before were included in the development of models by the method of random forest algorithm respectively. Tunning parameters of random forest were selected by grid search and 5-fold cross validation. Then the external validation set was used for the evaluation of the equation. The construction of the model was described in detail in Additional le.

Statistical methods
Accuracy, bias, precision and RMSE were the primary outcomes of the evaluation. Accuracy was the proportion of eGFR that was not more than 30% deviation from the mGFR. Bias was de ned as the median difference of the mGFR and eGFR; precision was assessed as the interquartile range (IQR) of the difference. RMSE is the square root of the ratio of the sum of the square of the deviation of the observed value from the true value and the number of observations. Bootstrap method(2000 bootstraps) was used for the 95% con dence intervals (CIs) (25). Independent samples t-test or the Mann-Whitney test was used when comparing quantitative variables between two data-sets. Difference and accuracy were performed by Wilcoxon signed-rank test and McNemar test, respectively. Statistical signi cance level was P < 0.05. All calculations and statistics were conducted by SPSS 20.0, R 3.53 software and Python 3.7 software.

Study population
There were signi cant differences between development datasets and external validation in some baseline characteristics. As it is shown in the table1, Blood lipid including CHOL, LDL and APOA were lower in development and internal validation set compared with the external validation set (4.82±1.44 vs 4.98±1.41, P=0.049; 2.95±1.14 vs 3.09±1.18, P=0.037; 1.28±0.25 vs 1.31±0.25, P=0.026; respectively). And the population in development and internal validation set used more glucocorticoids for treatment (P=0.04). mGFR, age, serum creatinine and other characteristics did not differ signi cantly.

GFR estimation models performance
Of all the variables, creatinine, cystatin C, weight, BMI, age, UA, BUN, HCT and APOB were selected by the RFE approach. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. 9-variable random forest regression model was optimal. In the 9-variable model, random forest regrssion model was better than revised linear regression in terms of bias, precision, 30%accuracy and RMSE (0.78 vs 2.98, 16

Discussion
We performed revised linear regression and RF modelling for four and nine variables and performed a head-to-head comparison analysis of the same variable combinations. Our results demonstrate that eGFR equation based on RF model performed better than revised linear regression model on GFR estimation. Our ndings can help clinicians better understand the patient's condition and optimize their treatment plan.
The advance of computer science and statistics has made machine learning an increasingly vital method. A series of machine learning methods contribute to a speci c improvement for eGFR Eq. (13)(14)(15). However, no prediction equation can comprehensively exceed the CKD-EPI equation. Random forests are also a kind of ensemble-based machine learning method, which can nd non-linear relationship that lies underneath data and make better classi cation or regression results. The random forest method has the advantage of producing a high-precision classi er, dealing with multi-class variables, reducing errors in balanced classi cation and preventing over-tting. Many papers compared linear regression with random forest algorithms in other elds. As we can see, the random forest algorithm has not performed better in all eld and aspects (26)(27)(28), this reveals that random forest method can only take advantage over linear regression in some data models. So a more signi cant number of multi-centre data are needed to validate our outcomes in the eld of GFR estimation.
Non-GFR determinants affect the performance of the eGFR equation. In other words, the decrement of the in uence of non-GFR determinants can elevate equation performance. An article points out that multimetabolite panels can improve performance of the eGFR equation even without age and gender as variables (11,29). Perhaps these factors can reduce the impact of non-GFR determinants. Possibly, we could incorporate more data dimensions to improve the prediction performance of the eGFR equation.
There were signi cant differences in CHOL, LDL.C, APOC and the use of glucocorticoids in the modelling and validation groups. According to the literature, the ratio of blood lipids and blood lipidosis indeed related to renal function [20,21]. For glucocorticoids, it is well known that it is a treatment for many kidney diseases. Inevitably, these factors would interfere with the glomerular ltration rate. Therefore, these differences between modelling and validation groups can have an impact on the performance of all equations. Fortunately, these variables were not included in the prediction equation. Of note, sex did not include in the variables of RF selection, and maybe gender was not weighted enough among these variables. As for race, most of the participants are Han Chinese, which will limit its applicability.
Our study had several limitations. Firstly, the data collectionwas only one time. There might be measurement error in single data modelling, and medical data from the same patients should be collected and modelled repeatedly. Secondly, it may be impossible to obtain "true GFR", and Levey recommends plasma clearance of iohexol and urine clearance of iothalamate as mGFR for GFR Eq. (30,31). In our study, mGFR was measured and calibrated by 99mTc-DTPA renal dynamic imaging. This measurement method will produce relatively obvious deviations both at high GFR and low GFR. (32,33). Hence, the system and measurement errorcould not be avoided. Besides, mGFR should be veri ed to reduce the measurement error. Repeated measurement of serum of creatinine is also recommended (31,34). Finally, samples of the validation set are from a single centre and are small in number. Furthermore, this study did not re-evaluate the predictive e cacy of subgroup classi cation for age, diabetes, and CKD staging.

Conclusions
Random forest RFE was used to select variables and to develop eGFR equations. The performances of random forest regrssion models are better than revised linear regression models based on the same variables. Much attention should be put to the random forest regression because it might contribute to a speci c improvement to the eGFR evaluation equation.

Availability of data and material
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Authors' contributions
Research idea and study design: XL; data acquisition: XL, YTC; data analysis/interpretation: PJL DY; statistical analysis: DY; supervision or mentorship: XL,WTH, SML, BHW and XL. Each author contributed important intellectual content during manuscript drafting or revision and accepts accountability for the overall work by ensuring that questions pertaining to the accuracy or integrity of any portion of the work are appropriately investigated and resolved. XL takes responsibility that this study has been reported honestly, accurately, and transparently; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained. All authors read and approved the nal manuscript.

Competing interests
Authors declare no con ict of interest.

Consent for publication
Yes.
Ethics approval and consent to participate Ethics has been authorized by the Ethics Committee of the Third A liated Hospital of Sun Yat-sen University and has been described in detail in previously published articles.