Clinical and Genetic Determinants of Heart Failure: Optimized by Machine Learning and Mendelian Randomization

Background: Identifying unrecognized, potentially modiable risk factors is essential for heart failure (HF) management. Methods: The Atherosclerosis Risk in Communities (ARIC) study was used for machine learning (ML) to establish the top 20 important variables as potential risk factors for HF. Multivariable Cox regression analysis was performed in an explorative manner to nd independent factors for HF and Mendelian randomization (MR) analysis to address causality. Results: Of the 14,842 participants included in the ARIC analysis, 20.4% of participants (3,028) were identied as HF. The 20 variables with the highest importance selected by ML were creatinine, glucose, age, previous coronary artery disease (CAD), systolic blood pressure, brinogen, albumin, income, diabetes, magnesium, insulin, white blood cell, hemoglobin, sodium, education, phosphorus, diastolic blood pressure, protein-c, heart rate and body mass index (BMI). Cox regression analysis demonstrated 19 independently associated variables except sodium. MR analysis provided evidence supporting that genetically determined BMI, CAD, diabetes and education was causally associated with HF. Conclusions: The ML plus MR framework was useful in identifying important causal factors of HF. BMI, CAD, diabetes, and education not only served as excellent prognostic factors for HF, but therapeutics targeted at these factors were likely to prevent HF effectively. number of other well-established been reported in chronic obstructive pulmonary disease, 1 , were not observed among the top 20 predictors by ML approach.


Background
Heart failure (HF) is considered an epidemic disease in modern world, affecting approximately 1-2% of the adult population. 1 It is the most common cardiovascular cause for hospital admission for people older than 60 year-old. 2 Traditional risk factors for HF mainly include coronary artery disease (CAD), diabetes, hypertension, valvular heart disease, arrhythmia, hypertrophic cardiomyopathy, and in ammatory diseases. 1 Recently, socioeconomic status is reported affects health outcomes. 3 Previous mendelian randomization (MR) study demonstrates that low education is a causal risk factor in the development of CAD, 4 which is the major cause of HF. It implies that some socioeconomic factors may also serve as novel risk factors for HF. Identifying unrecognized, potentially modi able risk factors is essential for HF management, which is likely to improve the outcome of HF patients.
Machine learning (ML) methods, a eld of computer science using algorithms to identify patterns, with lower variance and bias, is a useful approach to identify the best predictors among hundreds of complex phenotypic variables and build more accurate data-driven models. 5 MR analysis, using genetic variants as instrumental variables to test for causality, can infer credible causal associations. 6 In this study, the Atherosclerosis Risk in Communities (ARIC) study is used for ML to establish the top 20 variables as potential risk factors for HF. Multivariable Cox regression analysis is performed in an explorative manner to nd independent factors for HF and MR analysis to address causality.

Study population
ARIC study is an ongoing prospective observational community-based study of the natural history of atherosclerotic diseases and cardiovascular risk factors. 7 In brief, the original cohort was recruited between 1987 and 1989 using probability sampling of 15,121 middle-aged (age 45 to 64 years) adults from 4 U.S. communities. Follow-up visits were carried out in 1990-1992 (93% return rate), 1993-1995 (86%), 1996-1998 (80%) and 2011-2013 (65%). Institutional review boards of all eld centers approved the study protocol, and all participants gave informed consent.
The current study analyzed data with individuals who participated in ARIC visit 1 (1987-1989). HF was assessed after ARIC visit 5 with follow-up through December 31, 2015. The median (maximum) follow-up period for HF was 25.05 (28. 12) years. For this analysis, the adaptive tree imputation method was used for imputation of missing data, and variables in less than 40% of the population were excluded. Of 15,121 individuals who completed visit 1, we excluded 279 variables that had missing data more than 60%, and the remained 14,842 participants were included in this analysis.

De nitions of phenotypic variables
We included all phenotypic variables available from questionnaires, physical examination, laboratory biochemistry, electrocardiography (ECG) and ultrasonography tests at visit 1 as potential risk factors for HF. 8 Supplement Table 1 provides a list of the variables used.

De nition of incident HF
In ARIC study, telephone interviewers contacted participants to inquire about all interim hospital admissions, outpatient diagnoses, and deaths every 6 to 9 months. 9 Two physicians reviewed all medical records for independent endpoint classi cation and assignment of event dates. Disagreement between discharge coding and computer algorithm were adjudicated by the ARIC Mortality and Morbidity Classi cation Committee. Incident HF including all subtypes was de ned as the rst occurrence of hospitalization records and death certi cates for a HF diagnosis with an ICD-9 code of 428 (428.0-428.9) or ICD-10 code of 150.
Machine learning for variables selection 300 phenotypic variables were included after eliminating duplicate and meaningless variables. ML methods for variables selection using the Random Survival Forest (RSF) algorithm, an ensemble tree-based method for analysis of right-censored data. 10 While RSF is typically used for prediction, it is also an e cient variable selection technique. 11 The variable importance is ranked by the mean of the minimal depth of the maximal subtree over the entire forest, and variables appearing higher on the tree have a higher rank (and hence are more important). 12 By using the locally weighted scatter smoothing (LOWESS) curve and bar plot in non-parametric regression, the possible nonlinear associations between the survival probability calculated from the RSF method over the range of values for the top-20 variables were assessed.

Cox regression analysis for independent variables
Multivariable Cox regression analyses were performed in an explorative manner to nd independent factors for HF. The top 20 variables selected by ML above were chosen for multivariable analysis. A backward stepwise procedure was performed using P>0.10 of the likelihood ratio test for exclusion.

Mendelian randomization for causal analysis
Based on the Cox regression analysis, the 19 independent variables other than serum sodium were selected for causal association analysis by MR method.
We included summary GWAS data from any array-based analysis, including targeted and untargeted arrays, with or without additional imputation for single nucleotide polymorphism (SNP). We also collected published GWAS associations that comprise only the signi cant hits of a GWAS after applying stringent p-value thresholds (e.g., P<5*10 -8 , a conventional threshold for declaring statistical signi cance in GWAS), using the clumping algorithm (r 2 threshold=0.05 and window size=1 Mb). Data included in this study were the GWAS summary statistics from the MR-Base platform. 13 Details of studies and datasets used for analyses were presented in Supplement Table 3.
We then performed MR in a strategy known as two-sample MR by using results from GWAS. 13 We explored the associations in the following scenarios: 1) Causal associations between the potential factors and HF. We applied inversevariance weighted (IVW) method for deriving causal estimates. 15 2) Heterogeneity: We conducted heterogeneity tests in MR analyses using IVW and MR-Egger approach. 3) Horizontal pleiotropy: it referred to when a genetic variant associated with traits on discrete pathways that were also causal in disease. 16 It was evaluated by P-value of the MR-Egger intercept.
The "causal" relationship was considered established if the observed association passed the IVW method and with no heterogeneity nor horizontal pleiotropy. We used another two MR methods including weighted median and MR-Egger for methodology sensitivity analysis. 15 Statistical analysis For all analyses, surviving patients were right censored to their follow-up date. Data transformation, indexing, and imputation were performed as necessary to generate data points to predict outcomes over the follow-up period. Using the imputed dataset, each continuous variable was centralized to the mean and scaled to the standard deviation, whereas categorical variables were coded into binary numbers (0 and 1). Descriptive data were presented as the mean ± SD for normally distributed variables and median (25th, 75th percentile) for non-normally distributed variables. P values were 2sided, and evidence of association was declared at P < 0.05. Analyses were performed using R software (www.rproject.org) and Stata release 13.1 (StataCorp LP).

Baseline characteristics
Baseline characteristics of the study sample are shown in Table 1. Of the 14,842 participants included in the analysis, the average age was 54.2 years with 45.2% male, 26.2% black. At visit 5, 20.4% of participants (3,028) were identi ed as HF.

Important variables selection and Cox regression analysis
As presented in Figure 1, the top-20 variables with the highest importance selected by RSF for HF were creatinine, glucose, age, previous CAD, systolic blood pressure, brinogen, albumin, income, diabetes, magnesium, insulin, white blood cell, hemoglobin, sodium, education level, phosphorus, diastolic blood pressure, protein-c, heart rate, and body mass index (BMI). The 20 important variables were then chosen for multivariable Cox analysis ( Table 2) and non-linear analysis (Supplement Figure 1). Cox regression analysis demonstrated 19 independently associated variables except sodium. LOWESS curve revealed nonlinear associations between the survival probability over the range of values for heart rate, insulin and BMI (Supplement Figure 1).

Mendelian randomization
These 19 variables were then selected for causal association analysis by MR ( Table 3). The IVW method estimate indicated that the odds ratio (OR) (95% con dence interval [CI]) for HF was 1.001 (1.001-1.002) per kg/m^2 increase in BMI. Results were consistent in weighted median method (OR, 1.001; 95% CI, 1.000-1.002; P=0.002). Both IVW and MR-Egger estimates indicated that there was no heterogeneity amongst the 304 SNPs in the causal effect between BMI and HF (P=0.165 and P=0.158 respectively). Moreover, there was no evidence of directional horizontal pleiotropy (MR-Egger intercept P=0.643). Our two-sample MR analysis provided evidence supporting that genetically predicted BMI was casual associated with HF.
Notedly, the IVW method estimate indicated that OR (95% CI) for HF was 0.999 (0.998-1.000) per standard deviation increase (3.6 years) in education. Results were consistent in weighted median method (OR, 0.998; 95% CI, 0.996-0.999; P=0.003) without evidence of heterogeneity and directional horizontal pleiotropy (MR-Egger intercept P=0.999). So, the genetically predicted education was negatively casual associated with HF. Similar results were also detected in CAD, diabetes and income.

Discussion
In this study, we used an ML approach in a hypothesis-free manner to identify important factors of HF in ARIC study. This powerful method con rmed several well-established relationships and identi ed a variety of novel factors which have not been previously reported. We then used Cox regression analysis in an explorative manner to nd independent factors and MR analysis to address causality. Our ndings revealed that BMI, CAD, diabetes, and education, not only served as prognostic factors for HF, but potential therapeutic targets for the treatment and prevention of HF.
Established and novel prognostic factors for HF HF prevalence has increased exponentially over the last three decades. This increase is attributable to several factors, including an aging population, and recent advances in the treatment of cardiovascular disease, leading to increased survival following an acute cardiac event. 17 Prior studies have yielded inconsistencies in predictors of HF. Our ML results indicated that the 20 variables with the highest importance selected by RSF for HF are creatinine, glucose, age, previous CAD, systolic blood pressure, brinogen, albumin income, diabetes, magnesium, insulin, white blood cell, hemoglobin, sodium, education level, phosphorus, diastolic blood pressure, protein-c, heart rate and BMI. It con rmed several established risk factors as previously reported (e.g., age, history of CAD, diabetes, BMI, hemoglobin, total white blood cell, creatinine 18 and hypertension). [19][20] Yet to our surprises, a recent study in 20,254 US male veterans revealed that increased cardiorespiratory tness was associated with progressively lower HF risk regardless of BMI, challenging BMI served as a well-established risk factors for HF. 21 Our result was different from the above research, and we analyzed that it might due to the different population included.
As for glucose impartment, it was reported that there existed a positive, continuous, and independent association between fasting plasma glucose and risk for HF. 22 The British Regional Heart Study carried out in older men demonstrated that serum magnesium was inversely related to risk of incident HF after adjustment for conventional CVD risk factors and incident MI. 23 Our results also support a positive association between glucose and HF and a negative association between magnesium. A number of other well-established factors that have been reported in literatures, including smoking, atrial brillation, chronic obstructive pulmonary disease, 1 , 19 were not observed among the top 20 predictors by ML approach. Though not selected, it did not mean that these traditional risk factors were not important for HF, as most of risk factors had adverse effects on cardiac structure, which ultimately would result in HF.
It was noteworthy that several novel predictors of HF were identi ed, including brinogen, albumin, income, education, phosphorus, protein-c. Fibrinogen was suggestive associated with incident HF that had preserved ejection fraction (HR 1.12; 95% CI 1.03-1.22; P=0.01). 24 A prospective study of 3,366 men found that brinogen was associated with incident HF but this was abolished after adjustment for HF risk factors. 25 So brinogen as a risk factor for HF was still controversial.
Many epidemiological studies have suggested an inverse association between serum albumin level and HF. In the Health ABC study of 2,907 elderly individuals with a 9.4 years follow-up, low serum albumin level was associated with the development of new-onset HF, mainly with preserved ejection fraction, regardless of in ammatory markers, BMI and CHD. 25 So low albumin level might serve as a novel predictor of increased risk of HF. In a very large population (N = 7,638,524) of chronic HF patients with access to universal healthcare, lower income was independently associated with higher mortality. 27 Another study conducted in 54 countries reported that greater income inequality was associated with worse HF outcomes, with an impact similar to those of major comorbidities. 28 More importantly, previous MR study demonstrated that genetic predisposition towards 3.6 years of additional education was associated with a one third lower risk of CAD, supporting that low education is a causal risk factor in the development of CAD, 29 which is the major cause of HF. These studies suggested that socioeconomic status (income and education) might also affect HF besides the traditional risk factors.
Low serum magnesium and high serum phosphorus were identi ed independently associated with greater risk of incident HF in ARIC cohort, 30 which was in accordance with our results. Our multivariable analysis revealed that protein C was slight negative associated with HF. Yet a prospective case-control study that involved 50 children demonstrated that there was a signi cant increase in plasma levels of cardiac myosin binding protein-C in patients with HF and this increase was associated with increased severity of HF, indicating positive association between cardiac myosin binding protein-C and HF. We analyzed the different associations might lie to different HF population and different kinds of protein-C.
In summary, by ML approach and multivariable analysis, we identi ed 13 traditional risk factors, including creatinine, glucose, age, previous CAD, systolic blood pressure, diabetes, magnesium, insulin, white blood cell, hemoglobin, sodium, diastolic blood pressure, heart rate and BMI, and 6 novel risk factors, including brinogen, albumin, income, education, phosphorus and protein-c.

Causality between the potential risk factors and HF
It would be of clinical value if the modi able risk factors, such as BMI and education, were shown to causally lead to the development of HF.
As for causal factors of HF, hyperhomocysteinemia 30 and elevated lipoprotein(a) levels 32 were reported to be causally associated with HF. Because these two factors were not selected as top 20 variables, we did not analyze their causal estimate with HF. To our surprise, a recent MR study reported that though there was an observational association of CAD with HF, the genetically determined risk of CAD was signi cantly associated with HF with reduced ejection fraction but not with HF with preserved ejection fraction, 33 indicating that HF with reduced and preserved ejection fraction should be treated differentially.
Our MR analysis showed that the genetically predicted BMI, CAD and diabetes was positive casually associated with HF, and education as a novel factor was also negative casually associated with HF. Among these four factors, education might serve as the source of some established risk factors, for education re ected socioeconomic circumstances and cognitive level. People with higher education level were usually with more self-management skills to maintain healthy status and access better health care. Previous ndings also indicated a causal association between low educational attainment and increased risk of smoking, 34 which was risk factors for both CAD and diabetes. Conclusions BMI, CAD, diabetes and education not only serve as excellent prognostic factors for HF, but potential therapeutic targets for the treatment and prevention of HF. In other words, these four factors served as both "markers" and "makers" for HF.

Consent for publication
The authors declared they consented for the publication of this manuscript.

Availability of data and materials
For more data and materials, please contact Xiao-dong Zhuang (zhuangxd3@mail.sysu.edu.cn).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.