Risk factors of hyperuricemia calculated by random forest machine learning

Objectives The present study aimed to develop a random forest (RF) based prediction model for hyperuricemia (HUA) and estimate associated risk factors. Methods This cross-sectional study recruited 91,690 participants (52,607 males, 39,083 females). The prediction models were derived from training sets using RF learning machine. Performances of the prediction model were evaluated in validation datasets. Significant indicators were produced after comparing between true positive set and true negative set. Odds ratio was calculated by binary logistic regression models. Results The area under the receiver-operating curve was 0.732 in males and 0.837 in females in the RF prediction models. The sensitivity, specificity and negative predictive value of the models were 0.686, 0.656 and 0.882 in males, 0.786, 0.738 and 0.978 in females, respectively. According to the feature value of each index in RF, a total of 10 explanatory variables were selected for each gender. Triglyceride, creatinine, body mass index, waist circumference, alanine transaminase, age, weight and total cholesterol were high-risk factors for HUA in both genders. Conclusion Chinese population. risk actively control above factors to HUA.


Introduction
Uric acid (UA) is a metabolite of purines (ATP, GTP, and nucleic acids) circulating in the blood. The excretion of UA plays an important role in removing nitrogenous wastes from the body [1].
Hyperuricemia(HUA)can be caused by overproduction or underexcretion of UA [2]. Increased production may be the consequence of high-purine diet, alcohol abuse or congenital enzyme deficiency. Reduced excretion can be caused by genetic defects, renal disease or drugs which interfere with uric acid excretion such as diuretics and cyclosporine. Over the past 40 years, the level of serum UA (SUA) and prevalence of HUA have risen sharply globally [3]. From 2000 to 2014, prevalence of HUA in China was 13.3% (19.4% for male and 7.9% for female), among which HUA was more common in urban residents than rural, which may be closely related to not only the consumption of meat, seafood and alcohol but also work type, commuting method and exercise frequency [4,5].
Besides, plenty of studies have suggested that elevated SUA level was associated with various diseases, including metabolic syndrome [6], dyslipidemia [7], chronic kidney disease[8] and cardiovascular events [9]. In addition, it is known that many patients with HUA develop gout eventually [10].
In order to elucidate the relationship between SUA and other variables, a previous study developed a prediction model based on cox proportional hazards regression for HUA using 10 selected variables [11]. Note that the risk factors for HUA remain controversial, in this study, we intended to establish a RF model through large sample from the perspective of data mining, with the aim of effectively predicting the prevalence of HUA, comprehensively exploring risk factors of HUA and timely guiding high-risk groups to take actions. Our cross-sectional study did not screen risk factors through prior knowledge but directly established a RF model to analyze and compare the correlations of all included factors. steps: deleting items containing vacant values, correcting items containing illegal values, normalization processing and data transformation (Appendix 1). The study cohort was divided randomly into training and validation datasets, with 80% of subjects assigned to training, and the remaining 20% utilized for validation. A RF model was trained for predicting HUA using 35 baseline clinical variables including continuous variables and categorical variables. RF is an algorithm that integrates multiple decision trees through the idea of Ensemble Learning that determines a consensus prediction for each observation by averaging the results of many individual recursive partitioning tree models [13,14]. Each of the individual trees are fitted to a randomly selected subset of the observations, and utilize a random subset of the available predictors at each node as candidates for splitting [13]. Finally, the RF model integrates all categories and outputs the result based on a majority vote. The main advantages of RF are as follows. First, RF can handle input samples with high dimensional features and does not need dimension reduction. Second, RF can assess the importance of each feature in classification. Third, RF can balance the error to a certain extent for an unbalanced data set. Fourth, the models trained by RF can maintain accuracy even if there are many missing values. Last but not the least, any interaction or correlation between variables does not adversely affect the RF classification since it is capable of representing high order interactions [15]. The analysis was accomplished by using Python.

Evaluation Criteria
The discriminatory power of models was analyzed by ROC curve. ROC curves were constructed by plotting true positive versus the false positive fraction. Sensitivity (the probability of a positive test given the individual has the disease), specificity(SPC, the probability of a negative test given the individual does not have the disease), positive predictive value (PPV, the probability of having the disease given a positive test), and the negative predictive value (NPV, the probability of not having the disease given a negative test) were calculated for each cutoff score [16].

Study characters
From 91,690 subjects participated, the training dataset was determined as 73,351 subjects, of which 42,085 was males (57.4%). The validation dataset consisted of 18,339 subjects, of which 10,522 was males (57.4%). All baseline variables in Table 1 were included in the RF model. The training and validation data sets were similar in terms of baseline covariates. Samples for statistical analysis are individuals in the validation dataset correctly diagnosed by the RF model (Figure1).

Evaluation of model predictive ability
There were 2,306 (21.9%) males in the validation set with HUA, and the detection rate was 68.6%. For females, 7817 individuals were in the set and 560 (7.2%) had HUA, with 78.6% of the detection rate. NPV was 0.882 in males and 0.978 in females, which suggested that the model was suitable for initial screening.

Participants with or without HUA
In this part, we analyzed variables from the true negative set and the true positive set to obtain more distinct feature contrast. Variables with the top ten values in each gender were selected. Clinical baseline characteristics classified by gender were indicated in Table 2. All parameters showed significant difference except FPG in male (P>0.05). BMI, weight and WC in individuals without HUA were significantly lower than individuals with HUA. TG, Cr and ALT showed the same results in both groups, but baselines of these characteristics were much higher in male than in female. In addition, the average age (58y) of HUA in females was much older than that in males (43y). The level of UA increased significantly with the rise of age in female (P < 0.001); but the level of UA showed a downward trend with increase of age in male.

Risks of developing HUA with different variables
In our study, we used the binary logistic regression model to calculate the risk of developing HUA in men and women with different variables (Table 3). Crude OR had no covariate, while adjusted OR included all other 9 selected variables as the covariates.
TG, Cr, Weight, BMI, ALT, TC and WC had hazardous effects on HUA in both genders. BUN and SBP had hazardous effects on the prevalence of HUA in females. Age was negatively correlated with prevalence in males and females under 50, but positively correlated with prevalence in females over 50 (Table 3). FPG in males showed no significance, but its negative effect appeared when included other covariates. Furthermore, we calculated the risk of HUA by including other covariables in model 2. Weight of females with HUA showed a positive effect with an OR of 1.182, but the effect disappeared when other covariates were included.

Discussion
HUA is defined by the finding of an abnormally high level of UA in the blood. HUA could be associated to factors from different aspects such as laboratory variables, food intake habits (including drinking history and smoking history) and education [17,18]. Our previous study demonstrated correlations between lifestyle choices and HUA [4]. In the current study, we developed a RF based prediction model for HUA and analyzed its associated risk factors. In fact, we were the first to predict HUA using RF.
One important concern is parameters related to lipid metabolism. In our survey, TG showed the highest weight (feature value) in the judgment of RF models in both genders (Table 2). Elevated TG will lead to the production and utilization of more free fatty acid, accelerating synthesis of purines and FPG ranked tenth in males. Previous studies of relationship between UA levels and diabetes have yielded inconsistent findings, including positive, negative and no significant relationship [23][24][25]. Our current study demonstrated an inverse association between male FPG and HUA prevalence in diabetes individuals (OR = 0.446, CI = 0.311-0.639), and a positive association in individuals with normal glucose tolerance, while the latter disappeared when other covariates were included (OR = 1.073, CI = 0.897-1.284). Similarly, a bell-shaped relation between FPG and SUA levels has been shown in several studies [26]. The possible mechanism for a positive relationship between glucose and UA may be related to the dual biological properties of UA. UA usually has an antioxidative effect; however, it becomes a strong oxidant in the environment of metabolic syndrome [27]. Inflammation and oxidative stress induced by metabolic syndrome and HUA may predispose individuals to a higher risk for diabetes [24]. Biological mechanism underlying the relation between higher FPG and SUA levels is thought to be due to the uricosuric effect of glycosuria [28]. Glycosuria occurs when glucose in the renal tubules exceeds its maximum absorption capacity, which inhibits the reabsorption of UA at the same place. What needs to be emphasized here is that it is glycosuria, rather than FPG, that leads to increased UA excretion. Further studies, especially of UA in the normal glucose tolerance group, are urgently needed.
ALT ranked fifth in males and sixth in females. ALT is closely related to intrahepatic fat deposition and has been widely considered as a marker of nonalcoholic fatty liver (NAFLD) in some epidemiological studies [29,30]. Many clinical studies have shown that HUA and NAFLD have similar metabolic disorders, including insulin resistance, dyslipidemia and visceral obesity [6,31,32]. Therefore, there may be a positive correlation between elevated SUA and elevated ALT. Another plausible explanation for the link between HUA and ALT elevation is oxidative stress [33]. The production of UA is accompanied by the production of reactive oxygen species. In patients with NAFLD, increased SUA levels may alter endogenous antioxidant defenses of liver fat peroxidation, thereby promoting the progression of liver injury and leading to elevated ALT[29, 34].
The peak age of prevalence in females was completely opposite to that in males (Fig. 3). However, the curvilinear distribution of HUA prevalence between the sexes indicated that both sexes may be affected by hormonal factors with only a difference in degree. As one of the mechanisms underlying the gender discrepancy in the prevalence of HUA, the estrogen's uricosuric effect has been widely recognized [35]. Multivariate analysis found a negative and a positive association between HUA prevalence and age in the female groups under and over 50 years old (OR = 0.894, CI = 0.852-0.938; OR = 1.046, CI = 1.007-1.086), respectively. While the question of whether androgen could independently affect SUA levels, as in the case of estrogen, remains controversial. The urate transporter 1, a specific urate transporter, expresses higher in male mice than in female mice, which are positively affected by testosterone(T) [36], and it also has been reported that androgen played a certain role in promoting the catabolism of nucleotide[37], both of which suggest that T levels are positively correlated with UA levels. Rosen et al. showed no difference in the serum T levels between asymptomatic HUA and normouricemic group[38]. However, a few studies found a negative association between T levels and SUA, which are consistent with us [36,39]. It has been reported that serum total T and free T concentrations fall by 0.8% and 2% per year in middle-aged men, which may provide an explanation for the high prevalence in older men [40]. Insulin resistance, obesity, alcohol intake may also be associated with higher prevalence in HUA with age [41][42][43]. The phenomenon that the prevalence of male HUA decreases during the third to the seventh decades then shots up needs to be further studied. Besides, higher values among young male may be a secondary consequence of other pathological change [44].
SBP was only shown in female ranked tenth. Previous studies have shown that SUA helped to maintain blood pressure through both acute renal vasoconstriction (via stimulation of the renin angiotensin system) and chronical renal microvascular and interstitial disease (by inducing saltsensitivity via activation MAP kinase, PDGF, and COX-2 systems) [45,46]. When renal microvascular disease continues to progress (a lesion resembling arteriolosclerosis), and sufficient narrowing of the arteriolar lumen occurs, a component of the hypertension becomes salt-driven, renal-dependent, and independent of UA levels [47], which may also explain the no-significance association between SUA and hypertension in the older female group[48] and males. In addition, Cr of male and female HUA patients in our study was significantly higher than that of non-HUA patients. It has been reported that gout patients had lower Cr clearance and fractional UA excretion [49]. Obermayr et al. found that UA levels of 7-8.9 mg/dl nearly doubled the risk for incident kidney disease (OR = 1.74, CI: 1.45-2.09), while UA levels > 9.0 mg/dl got a tripled risk (OR = 3.12, CI: 2.29-4.25) [50].
The study also has some limitations. Firstly, the dataset was based on a cross-sectional, single-center study, which may have selection bias and lack of representativeness, and such a study cannot provide causality information. Secondly, all these ostensible subjects might have some diseases they do not know by themselves, which could influence SUA. Thirdly, data loss occurs when continuous variables are converted to categorical variables (labeled individual SUA levels with having HUA or not).

Conclusions:
In conclusion, we developed a RF based prediction model for HUA in general Chinese population by a cross-sectional dataset. The model demonstrated good stability and strong predictive power, which could be used to identify high-risk groups of HUA in the early stage and provide early warning and intervention.

45.
Mazzali    Feature value was rounded to three decimal places.     varied greatly in gender. The peak age of prevalence in females was obviously different with that in males. The prevalence rate of women increased rapidly after menopause, while that of men was higher in the young and middle-aged years, then decreased, and then increased after 70y. The general population showed an increasing trend with age.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.