Predicting Arteriovenous Fistula Non-Maturation in Hemodialysis Patients: Analytics of Inammatory Markers and Serum Metabolic Values

Background Population aging has brought a rise in the prevalence of diabetes and hypertension, leading to more cases of renal failure. Hemodialysis, as a method of renal replacement therapy, by far prevails over peritoneal dialysis (93.5% vs. 6.5%). Although arteriovenous stula (AVF) is frequently chosen as the vascular access route for chronic hemodialysis; it has limitations including non-maturation. As maintenance of an AVF is much more costly than its creation, foreseeing maturation failure can lead to a wiser allocation of patients to AVF surgery or other alternatives, with potential for signicant cost containment.Methods We investigated the relationship of routinely available systemic inammatory markers and baseline metabolic values in 107 end-stage renal disease patients (over 35 years of age undergoing their rst brachio-cephalic AVF access surgery at wrist level for chronic hemodialysis). In this study, for the rst time to our knowledge, we applied predictive analytic tools such as Random Forest for retrospective analysis of prospectively collected data between 2011 and 2018.Results Our results showed that a combination of inammatory markers and serum metabolic values can prognosticate AVF maturation outcomes with an accuracy of 0.723, by the 95% condence interval of (0.715, 0.731) and AUC of 0.853. Also, a combination of inammatory markers, including albumin, c-reactive protein, erythrocyte sedimentation rate, hemoglobin, lymphocytes, neutrophils, white blood cells, platelets, and red blood cell distribution width, can prognosticate AVF maturation outcomes with an accuracy of 0.674, by the 0.95 condence interval of (0.665, 0.684) and AUC of 0.824.Conclusion Risk stratication of patients for AVF non-maturation before attempting the rst AVF surgery may help prevent multiple surgical failures and costly endovascular interventions by allowing vascular surgeons to make an individualized choice of vascular access method for new patients.

In ammation can disturb the compensatory mechanisms of the maturation process and lead to intimal hyperplasia and occlusion [7], [9]. The goal of this study is to explore the relationship between an assortment of routinely available systemic in ammatory and metabolic markers on the outcomes of the AVF maturation process.
A few gaps exist in the data setting and data interpretation in published studies. They did not allude to the fact that statistical analyses are parametric and as such put restrictions on the data. One of these conditions is preferring the normal distribution of variables. This research intends to provide an analytical method appropriate for actual data features to extract trustable information.
We aimed to assess the risk of AVF non-maturation non-invasively. As systemic in ammatory markers and serum metabolic values are routinely available, we focused on them. In addition to these variables, we also gathered medications (anti-coagulant, anti-platelet, anti-hypertensive, anti-in ammatory, and hematopoietic drugs), disease histories (diabetes mellitus, heart disease, and hypertension), and systolic and diastolic blood pressures.

Methods
Research participants are ESRD patients dialyzed by primary AVF. We used statistical, and machine learning approaches to explore and analyze data.

1)Participants and Variables
We retrospectively analyzed the archived data and the electronic charts of ESRD patients at Hasheminejad Kidney Center (HKC); the national tertiary referral center for urology and nephrology in Tehran, Iran. People from around the country refer to this hospital for AVF surgery, so electronic archives at this center are a reasonable representation of the population. We applied the following inclusion and exclusion criteria on 943 patients who had undergone AVF creation at HKC in 2011-2018 (Fig 1).
Patients with non-primary AVF were excluded to remove vascular consequences of the preceding failed access. Patients under 35 were excluded to consider the projected population. Records, with over 50% of missing values were excluded to limit data noise. We limited this study to AVFs created on the wrist to form a more uniform case-mix (assuming that metabolic factors promoting accelerated atherosclerosis are likely to be more severe among failing larger vessels such as brachio-cephalic AVFs). All procedures were sutured in 6-0 Prolene, and the anastomosis technique was end-to-side. All the patients had initially provided informed consent for access to their archived data for research purposes. The Ethics Committee of Hasheminejad Clinical Research Development Center approved the research.
From 934 patients of the sampling frame, 549 had primary AVFs. Four hundred ve of them had a determined maturation outcome (matured or failed). Three hundred thirty-eight of 405 patients aged above 35 years; among them, 315 did not experience early failure. Two hundred ninety-four of 315 patients had fewer than 50% missing values, and 190 of them had radio-cephalic AVFs. Since surgeon skill is a risk factor in AVF maturation, we just considered 114 patients surgically treated by a speci c vascular surgeon (Fig 2). Measurements were collected by (version 21.0.0.0, 32-bit edition). We used R (version 3.6.1, 2019-07-05) and Cran packages for analyzing data. Table I represents the characteristics of 114 included patients.  Table II represents the explicit and implicit in ammatory markers mentioned in the literature. In this study, we considered in ammatory markers, other serum metabolic values, medications, pre-operation blood pressure, and disease histories which were available on the electronic archives (Table III, Table IV, and   Table V). From among the lab tests recorded during hospitalization interval, we collected the closest test to the operation time (pre-opt, at dialysis initiation). The target variable is the outcome of the maturation process that captures the values of "matured" or "failed." Our criteria for maturation were (1) easily palpable super cial vein, (2) vein relatively straight, (3) adequate diameter for easy two-needle cannulation (3-4 mm), (4) proper length (≥10 cm, for su cient distance between the two needles), and (5) uniform thrill to palpation and auscultation [10].

2)Data Analysis
In the pre-processing phase, we removed two variables (Creatine phosphokinase and Lactic acid dehydrogenase) with more than 80% missing values, seven records with more than 50% missing values, and ve medications with zero-variance (Plavix-Osvix, Warfarin, Digoxin, Enalapril-Captopril, and Simvastatin). In the end, missing values were imputed, and the data were balanced (Fig 3). In medical studies, we often encounter imbalanced classes of data (Belarouci and Chikh, n.d.). Class imbalance heavily compromises the learning process of predictive models because they tend to focus on the prevalent samples and ignore the rare events (Menardi and Torelli, 2014). Hence, prediction models trained on imbalanced data are highly susceptible to produce inaccurate results (He and Garcia, 2009).
Prior studies applied class rebalancing techniques to mitigate the risk of imbalanced data. Tantithamthavorn and Matsumoto investigated the impact of rebalancing techniques on the performance and interpretation of prediction models and found that these techniques have little effect on popularlyused classi cation models like logistic regression and random forest . In this study, we used Synthetic Minority Oversampling Technique (SMOTE) to over-sample the minority and under-sample the majority class. After rebalancing, we had 34 failed and 34 matured samples.
After pre-processing, the data is ready for analysis. For choosing the appropriate analysis method, we did some exploratory analysis. Based on exploratory analysis results, we concluded that the relationships between variables are non-linear, and the variables are almost non-normal. In medical studies, logistic regression is a commonly used method for classi cation. However, when we have a non-linear classi cation problem with few samples and lots of features, it is not the right choice. In this situation, it ends up with too many features that may lead to over-tting and is computationally expensive. The random forest model is a much better way to learn complex non-linear hypotheses, even when the feature space is ample, and the sample size is small. This model involves a group of decision trees to achieve better accuracy and stability. Predictive models usually suffer from bias and variance errors. As the complexity of the predictive model increases, bias will decrease, and the variance will increase. The random forest model tries to balance between bias and variance this way that every tree in the forest is randomly built and fully grown. These trees are at the highest complexity to reduce the bias. The randomness of trees mounts the noise uniquely on each tree and prevents the noise from being aggregated. We used this model to identify informative factors in predicting maturation outcomes. Also, we extracted frequent and accurate rules from the model to use for decision making.

1) Descriptive Analysis
In this section, statistical methods such as normality check, and principal component analysis (PCA) were employed to explore the data.

A. Central Tendency and Dispersion
The mean and standard deviation ( ) calculated for independent variables are given in Tables VI, VII, and VIII.  (Tables IX, X,  and XI). So, we can conclude that the data almost do not obey the normal distribution.

C. Principal Component Analysis
Using PCA, we investigated the relationship between variables. Measures showed that there is no dominant direction of variation, so the variables almost vary independently, and there is no redundant dimension (Table XII).

D. Correlation Analysis
Highly correlated independent variables may interfere with each other when a model is being interpreted [11]. Jiarpakdee et al. claimed that removing correlated independent variables improves the consistency of the highest-ranked variables regardless of how a model is speci ed and negligibly impacts the performance and stability of models [12]. Fig 4 shows the hierarchically clustered correlation matrix of independent variables. Three clusters have been formed that are represented in Table XIII. Since variables are highly correlated, we selected one variable in each group and removed cholesterol, neutrophils, and red blood cells.

2) Predictive Analysis
Data distribution is a determinative factor in the choice between parametric and non-parametric methods.
The assumption of normality needs to be checked for many statistical procedures, especially parametric tests, because their validity depends on it [13]. If we are not sure or we suspect that distributions do not behave normally, it is better to use non-parametric methods. Also, considering the results of PCA, relationships between variables are non-linear, so we should choose a non-parametric method for analyzing non-linear data relationships. In the data analysis section, we argued that the random forest model is the right choice for non-linear and non-normal data. So, in the following, we modeled the data using this method.

A. Random Forest
A random forest model containing a thousand decision trees was built. Splitting criteria for this model was Gini decrease, and the number of variables randomly selected at each split was two. Mean accuracy of the model was 0.723, with a 95% con dence interval of (0.715, 0.731) and AUC of 0.809. Variables importances calculated by the mean Gini decrease criteria are presented in Table XIV. We reduced the feature space to 33 most important variables and then built a new random forest model (1000 trees, Gini criteria, 2 randomly selected variables at each split). This time, the mean accuracy was 0.71, with a 95% con dence interval of (0.701 0.719) and AUC of 0.853. We extracted the decision rules from all the decision trees and selected a concise set of rules with low length, high frequency, and small error rate (Table XV). For this task, we used "inTree" package of Cran. Regarding the characteristics of the data (non-linearity and non-normality), the random forest model was the right choice for modeling. Importances of variables in predicting the maturation outcomes showed that medications do not have considerable ability to discriminate two classes of "failed" and "matured" (Fig 4). In the following, we investigated the effect of highly important predicting variables such as creatinine, red cell distribution width, high-density lipoprotein, sodium, ferritin, serum iron, and total protein on the output of the AVF maturation process. Finally, we investigated the ability of in ammatory markers in predicting the outcomes of AVF maturation process.
Creatinine Fig 5 illustrates the boxplot of creatinine in two classes of "failed" and "matured." This gure shows that creatinine has the same interval in two classes, but, the mean of creatinine is considerably lower in the "matured" class. There is no rule containing creatinine in table XV. Considering this gure, we can conclude that creatine may have a negative effect on the maturation process. However, Duque et al. did not nd any association between creatinine level and AVF primary failure [14]. Conz et al. found that uremic patient (higher creatinine and blood urea nitrogen) have a slower maturation process [15] that is consistent with our results.
Red blood cell distribution width Fig 6 shows that red blood cell distribution width has a considerably lower mean in the "matured" class. Extracted rules containing red blood cell distribution width (  [16].   [19], [20].  [22]. failure. There is no study which investigated the effect of iron on the AVF maturation process, but the literature mentions a possible association between high iron marker levels and poor cardiovascular outcomes [23]. Total protein Fig 11 shows the higher values of total protein in the "failed" class. Extracted rules containing total protein (Table XXI) indicate that more elevated amounts of total protein will lead to the AVF maturation failure. Kaygin et al. found no relationship between total protein and AVF maturation outcomes [24]. Also, Sari et al. found no correlation between total protein level and AVF thrombosis [25]. In ammatory markers Two in ammatory markers including red blood cell distribution width and albumin have considerable importance in predicting AVF maturation outcome. Other in ammatory markers are also among predicting variables but with a lower level of signi cance. We built a random forest model by a combination of in ammatory markers, including albumin, c-reactive protein, erythrocyte sedimentation rate, hemoglobin, lymphocytes, neutrophils, white blood cells, platelets, and red blood cell distribution width. The mean accuracy of the model was 0.674, with 0.95 con dence interval of (0.665, 0.684) and AUC of 0.824. This result shows that a combination of in ammatory markers can predict AVF maturation outcomes with acceptable accuracy.

Conclusion
The present study had several limitations. Our study was retrospective and restricted to one referral center. We had a small sample size and a considerable number of missing values and imbalanced data. Imputation and sampling added extra noise to the data. We did not control all the risk factors that may have biased the output.
Big data help machine learning algorithms to nd better patterns in the data and improve their accuracy.
So, it is better to repeat this study with much more sample size. Finally, it is concluded that serum values, including in ammatory markers, have an excellent power to predict AVF maturation outcomes. This prediction helps surgeons to avoid allocating patients in whom the AVF is destined to fail to this vascular access in the rst place. This can potentially make high cost conserving decisions possible.

Consent to publish
All the patients had initially provided informed consent for access to their chart data for research purposes.

Availability of data and material
The data that support the ndings of this study are available from Hasheminejad Kidney Center (HKC), but restrictions apply to the availability of these data, which were used under license for the current research, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of HKC.

Figure 2
The inclusion/exclusion process for ESRD patients in this study.

Figure 3
Page 23/25 Hierarchically clustered correlation matrix of independent variables.

Figure 4
Variables importance based on mean decrease Gini extracted from the random forest model.

Figure 5
The boxplot of creatinine in two classes of "failed" and "matured."

Figure 6
Page 24/25 The boxplot of red blood cell distribution width separated by classes of "failed" and "matured."

Figure 7
The boxplot of high-density lipoprotein in two classes of "failed" and "matured." Figure 8 The boxplot of Sodium in two classes of "failed" and "matured." Figure 9 The boxplot of ferritin in two classes of "matured" and "failed." Figure 10 The boxplot of serum iron in two classes of "failed" and "matured." Figure 11 The boxplot of total protein in two classes of "failed" and "matured."