Quantile regression forests to identify determinants of stroke: implications for neighborhoods with high prevalence

Background : Stroke exerts a massive burden on the U.S. health and economy. Place-based evidence is increasingly recognized as a critical part of stroke management but identifying the key determinants of stroke and the underlying effect mechanisms at the neighborhood level is a topic that has been treated sparingly in the literature. We aim to ll in the research gaps. We develop and apply analytical approaches to address two challenges. First, domain expertise on drivers of neighborhood-level stroke outcomes is limited. Second, commonly used linear regression methods may provide incomplete and biased conclusions. Methods: We created a new neighborhood health data set at census tract level by pooling information from multiple sources. We developed and applied a machine learning based quantile regression method to uncover crucial neighborhood characteristics for neighborhood stroke outcomes among vulnerable neighborhoods burdened with high prevalence of stroke. Results: Neighborhoods with a larger share of non-Hispanic blacks, older adults or people with insucient sleep tended to have a higher prevalence of stroke, whereas neighborhoods with a higher socio-economic status in terms of income and education had a lower prevalence of stroke. The effects of ve major determinants varied geographically and were signicantly stronger among neighborhoods with high prevalence of stroke. Conclusions: Highly exible machine learning identies true drivers of neighborhood cardiovascular health outcomes from wide-ranging information in an agnostic and reproducible way. The identied major determinants and the effect mechanisms can provide important avenues for prioritizing and allocating resources to develop optimal community-level interventions for stroke prevention.


Background
Stroke is the fth leading cause of death in the United States and is a major cause of serious disability for adults. 1 The prevalence of stroke is approximately 3%, accounting for one of every 20 deaths. With an estimated $45.5 billion in direct and indirect costs, stroke is a chronic disease exerting a massive burden on the U.S. health and economy. Considerable research has been conducted on the risk factors for stroke at the individual level. [2][3][4] These studies have demonstrated accumulative scienti c evidence showing that stroke is associated with modi able risk factors, such as high blood pressure, obesity, high cholesterol, and health behavioral risk factors such as smoking, sleep deprivation and sedentary lifestyle. [5][6][7][8] There are also remarkable disparities, with higher stroke incidence and prevalence found among older population, Blacks and those with low socioeconomic status. 9 In comparison, few studies have examined the mechanisms between neighborhood characteristics and neighborhood-level prevalence of stroke, despite the growing awareness that individuals' health is closely related to the neighborhood environment they live in. 10,11 The connections between place and health can be seen in the apparent clustering of high prevalence of stroke in the Stroke Belt states and in certain census tracts across major US cities. 12,13 However, there is an insu cient understanding of what and how neighborhood characteristics drive the neighborhood-level prevalence of stroke. Identifying critical predictors is important as it provides an opportunity for policymakers to plan tailored community-based interventions, which have been shown to be more effective and cost-effective in reducing the burden of cardiovascular disease and curbing health care costs compared to individual-based interventions. 14 This study aims to contribute to neighborhood cardiovascular health research. We address two primary challenges. First, in public health research, domain expertise is frequently used for variable selection.
However, subject matter expertise on key drivers of neighborhood-level cardiovascular health outcomes and their relative importance is limited. In practice, variable selection is often carried out with certain degree of arbitrariness (e.g., tests based on statistical signi cance level, the order in which variables are entered into a model, the choice of a statistical model). In addition, the relative importance of each variable in relation to the outcome is often unclear.
Second, commonly used linear regression (LR) methods for determining the association between exposures and an outcome assess how the mean of the conditional distribution of the outcome varies with exposures. However, the mean of the neighborhood-level prevalence of stroke may be a poor indicator of central tendency and conveys limited information about how prevalence of stroke varies across different neighborhoods. The distribution of the neighborhood-level prevalence of stroke is skewed; see Figure 1. The effect of a factor may be different across quantiles. Consequently, using LR methods to estimate only the effects at the mean level may result in incomplete and biased conclusion about the effect.
Research is needed to understand the most important links between neighborhood level characteristics and a high prevalence of stroke at the neighborhood level, as such knowledge would aid in prioritizing and deploying prevention interventions for the affected communities. Focusing on these vulnerable communities requires an analysis of the tail of a distribution, e.g.,, 90 th percentile of the distribution of the prevalence of stroke as it signals "troubled" neighborhoods.
Quantile regression (QR) methods are well suited to estimate how speci ed quantiles, or percentiles of the distribution of the outcome variable vary with covariates, and is robust against outliers and is more informative for a skewed distribution than mean-based regression. 15 In this article, we demonstrate the value of a highly exible machine learning based quantile regression method in studying neighborhood stroke burden.
We rst created a large-scale neighborhood health data by pooling information from multiple sources and considered 24 factors. These factors have been linked to cardiovascular health outcomes at the individual patient level, and can be grouped into four major domains, unhealthy behaviors, prevention measures, sociodemographic indicators and environmental measures. 5,6,8,9 We then exploited quantile regression random forests (QRFs) -a machine learning modeling technique -to rank the relative importance of the potential predictors, and proposed and implemented an algorithm to identify a set of major determinants for the distribution of neighborhood-level prevalence of stroke. We further compared the performance of our machine learning method to the performance of regression approaches commonly used in practice. Finally, we quanti ed the effects of the identi ed major determinants on stroke prevalence in vulnerable neighborhoods where the stroke prevalence ranked in the 90 th percentile, and assessed the bias from mean-based analyses.
Results from this study will provide insights into how to prioritize and incorporate the fabric of neighborhood health and sociodemographic environment into stroke-prevention strategies for communities heavily burdened with stroke.

Methods
We created a new neighborhood health data set by pooling information in three datasets from the Centers database. 19 We did not obtain IRB approval as this ecological study used census tract level data from publicly available data sources.
We included four types of neighborhood risk factors: i) unhealthy behaviors (e.g., smoking, no leisure-time physical activity, insu cient sleep, and obesity), ii) prevention measures (e.g., lack of health insurance, visits to dentist, colonoscopy screening, up to date on a core set of preventative services for male and females), iii) sociodemographic indicators (e.g., age, sex, race/ethnicity, income, and education), and iv) environmental measures (e.g., ambient air pollution). Both the stroke outcome and its predictor variables were measured at the neighborhood level (no person-level data were used). Detailed description of the variables, their data sources and distributions are shown in Table 1. We excluded 1307 census tracts that had missing data on key variables. Among the 1307 census tracts, 975 had missing health measures, 137 had missing socio-demographic measures and 295 had missing environmental data. Our nal analytical dataset included 26,697 census tracts.
We rst explored a heuristic approach to remove the minimum number of highly correlated predictor variables. Redundant predictors add complexity to the model than information they provide to the model.
Using highly correlated predictors in regression models can lead to highly unstable results. The variance in ation factor (VIF) can be used to identify predictors that are impacted but does not determine which should be removed to resolve the problem. We followed an iterative algorithm to remove the minimum number of variables to ensure that all pairwise correlations are below a certain threshold, for which we chose 0.75. 20 Details of the algorithm appear in Figure 2.
We then applied a high-performance nonparametric machine learning technique, QRFs, on the reduced data with no highly correlated variables. QRFs is a generalization of the Random forests (RFs). RFs is a machine learning modeling technique that builds an ensemble of regression trees to exibly capture the relationship between the conditional mean of the response and predictor variables and has gained popularity in medical research for its high prediction accuracy and adaptability. [21][22][23] QRFs utilizes the infrastructure of RFs, and gives a non-parametric and accurate way of estimating conditional quantiles.
The method has been shown to be consistent and competitive in terms of predictive power. 24 QRFs grows an ensemble of regression trees, employing random nodes and split point selection as in the standard RF algorithm, but for each node in each tree, RFs keeps only the mean of the observations that fall into this node, whereas QRFs keeps the values of all observations in the node. Thus QRFs can assess the conditional distribution function of the response given the covariates, and can provide a fuller picture of the exposure-outcome relationship than mean-based RFs.
We developed and implemented a variable selection algorithm based on the variable importance scores generated by QRFs to determine the most critical predictors for the 90 th percentile of the neighborhoodlevel prevalence rate of stroke. The algorithm is described in Figure 2. A similar algorithm was suggested by Dietrich et al. for implementing RFs with survival outcomes but without assessing the optimal balance between the prediction error and the number of selected variables. 25 The importance score for each variable is computed by randomly permuting the values of each predictor for the out-of-bag (OOB) sample of the predictor for each tree and measuring the decrease in model accuracy by the permutation averaged across the forest. The more important the variable is, the larger decrease (i.e., importance score) is produced by the permutation. We carried out an iterative process for variable selection. Each time we removed the least important variable and rebuilt a QRFs model with the remaining variables and recorded the out-of-bag (OOB) average quantile loss (AQL) until no variable is left. We used AQL for the evaluation of model performance because the true conditional quantiles of the responses are unobservable. So as suggested by Wang et al and Fang et al, we computed the prediction error of the -th conditional quantile by averaging the quantile loss function, , over all observations, where . 26, 27 We then plotted the OOB AQLs against the number of selected variables, and set the nal model to be the one corresponding to the 'elbow' point, which achieved the best balance between the smallest OOB AQL and the parsimoniousness of the selected variables.
To empirically evaluate whether our machine learning algorithm selected major determinants, we compared QRFs with classical linear QR including all predictors additively, termed as LQR-AllVar, which is frequently used in public health. We compared the metric AQL and AQL reduction per predictor -de ned as (AQL null -AQL method )/Number of Predictors method, where AQL null is the AQL from the null model, i.e., intercept only model, and AQL method corresponds to the AQL of each speci c method. AQL reduction per predictor answers the question of how much gain do we get for adding each predictor variable suggested by a variable selection approach, and therefore methods that give larger AQL reduction per predictor are desired.
Finally, to "unblackbox" machine learning, we included the major predictors selected by QRFs in a linear QR model to quantify the effects of each predictor on different percentiles of the response, and in a LR model to show how mean-based analysis may provide incomplete and biased summary of the effect of exposures. All statistical analyses were performed using R version 3.6.1. QRFs models were built using the "quantregForest" R package.

Results
We rst applied the iterative algorithm (Step 1-4 in Figure 2) to identify and remove 8 redundant and highly correlated variables from the 24 candidate predictors. We then built a QRFs model with the remaining16 predictors and ranked the relative importance of each predictor in relation to the 90 th percentile of the neighborhood-level prevalence of stroke; see Figure 3. Sociodemographic indicators related to race, age, income level, education, and unhealthy sleep behavior appeared to be the leading neighborhood-level risk factors for high prevalence of stroke, whereas the environmental measures and gender composition are of relatively low importance.
We further identi ed major determinants of high stroke prevalence using the relative importance scores.
Targeting the 90 th percentile of the prevalence of stroke at the neighborhood level, our QRFs based variable selection algorithm (Step 5-9 in Figure 2) identi ed ve crucial factors that explained the majority of the variability in stroke prevalence among the most vulnerable neighborhoods. They are, in the order of relative importance, the share of non-Hispanic blacks, the proportion the percentage of population over 65 years of age, median household income, the percentage of population with insu cient sleep and the share of population with higher education. These ve predictors correspond to the 'elbow' point in Figure   4 -variables remained in the QRFs model in the 11 th iteration of our QRFs variable selection algorithm. Together the predictors reduced the AQL from the null model (with no predictors) by 70%, similar to the percentage of reduction in AQL (72.5%) delivered by a full model including all 16 available predictors, as suggested in Figure 4 by the curve of OOB AQL gradually reaching a plateau after the 'elbow' point. The AQL reduction per predictor achieved by these ve predictors was 0.04 as compared to 0.01 by the full model. An 'unblackboxing' analysis provided interpretable effects of the identi ed major determinants on the high prevalence of stroke at the neighborhood level. To demonstrate that a risk factor may have different effects on the tails of the outcome distribution than on the outcome on average, we examined the respective effects on the 90 th (upper tail), 50 th (median) and 10 th (lower tail) quantile and the mean effects. Figure 6 displays the point estimates and 95% con dence intervals for each of the ve major factor. First, larger shares of non-Hispanic blacks, older residents over 65 years of age and people who have insu cient sleep were positively associated with higher 90 th , median and 10 th quantile of the neighborhood-level prevalence of stroke. Median household income and the fraction of adults with higher education were inversely associated with all three quantiles. Second, all ve major factors disproportionally affects different parts of the outcome distribution. The fractions of non-Hispanic blacks, older adults, highly educated residents and people with insu cient sleep had signi cantly larger (absolute) effects on the upper tail than on the lower tail. Third, estimates from the mean-based LR analysis hardly covered the QR estimates. These ndings suggest that analyses based on the premise that the prevalence of stroke is uniformly or symmetrically distributed across the nation would lead to an incomplete and biased summary of the effect of exposures. A geographical comparison of the effects on the 90 th and 10 th percentile appears in Figure 7. Take the New York City as an example, Manhattan and Bronx sit at the opposite tails of stroke prevalence distribution (lower (10 th percentile) and upper (90 th percentile), respectively), the effects of major factors such as the prevalence of insu cient sleep and the age structure are substantially different (e.g., non-overlapping con dence intervals of the effect estimates) between these two neighborhoods, underscoring heightened in uence of insu cient sleep and older population in Bronx than in Manhattan, which in turn can provide guidance for developing targeted intervention programs.

Discussion
In this study, we developed and applied a robust and reproducible machine learning based approach to identify major factors for the tails of the distribution of the neighborhood-level cardiovascular health outcome, prevalence of stroke, when the distribution was not normal, and investigated the underlying effect mechanisms of the major factors, leveraging a high-performance nonparametric quantile regression technique, QRFs. We exploited a large-scale dataset with wide-ranging information from unhealthy behaviors and prevention measures to sociodemographic status and environmental factors, pooled from more than 20,000 census tracts in 500 cities of the US.
Our approach identi ed a parsimonious set of predictors for quantiles of the neighborhood-level prevalence of stroke, shedding light on the true drivers for high prevalence of stroke at the neighborhoodlevel. The identi ed neighborhood characteristics were in good agreement with known individual-level risk factors. Neighborhoods with a larger share of non-Hispanic blacks, older adults or people who have insu cient sleep tended to have a higher prevalence of stroke, whereas neighborhoods with a higher socio-economic status in terms of income and education had a lower prevalence of stroke. All of ve factors disproportionally affected the prevalence of stroke among neighborhoods with different stroke prevalence pro le. The effects on the 90 th percentile (upper tail) were signi cantly higher than effects on the 10 th percentile (lower tail), and higher than effects at the mean level. Using mean-based LR methods would have led to a limited and biased conclusion. Our approach offered a "higher-resolution" analysis that can be used to expand and deepen the existing quantitative evidence on stroke prevalence and its risk factors.
Results from our study may help inform public health policies. Establishing key neighborhood characteristics for high neighborhood-level prevalence of stroke allows policymakers to prioritize communities burdened with a high prevalence of stroke in developing and customizing community-based intervention programs to improve cardiovascular health outcomes. For example, resources may be allocated to the boroughs of New York City that have a high prevalence of stroke (e.g., the Bronx) to develop community-level educational interventions, that promote exercise, improve bedroom ambience or alleviate sleep disorders that may promote or interfere with sleep. 28 As the share of non-Hispanic blacks and the older population structure are two key components that may drive up the prevalence of stroke, it is critical for communities to make efforts to address avoidable inequalities and to eliminate health and health care disparities. 29 Identifying the most in uential and true determinants from wide-ranging information is challenging, especially when the number of relevant predictors is sparse relative to the total number of available predictors and relationships between predictors and outcomes may be nonlinear. The presence of skewness in the outcome elevates the complexity. Previous studies that evaluated the relationships between neighborhood characteristics and cardiovascular health outcomes are typically conducted at the individual level, and have limitations in analytical approaches. 30 The skewness of the outcome is typically ignored as mean-based regression analyses are commonly used. Predictors are often selected a priori or using test procedures based on some arbitrary threshold value. As a result, these studies may not provide speci c insights into precise drivers for diverse neighborhoods with varied prevalence of cardiovascular diseases.
Our method is capable of specifying the effect of a predictor on the tail of the outcome distribution in the presence of skewedness that is missed by others. We compared our approach to classical QR, and classical LR. Our approach achieved nearly the same prediction error reduction with only ve predictors as the full QR model. In comparison, implementation of the two-standard-deviation approach within the framework of QRFs proposed in Fang et al. 27 selected only one variable, failing to capture many important predictors. Our "higher resolution" analysis showed that the major determinants disproportionally affected neighborhood-level stroke outcomes, underscoring the larger effects in the areas with a higher prevalence. In conjunction with the ranking of variable importance, our method can provide valuable guidance for targeted community-based interventions.
There are several limitations in this study. First, some behavioral and health outcome measures available in the 500 Cities Data were estimated by the CDC using a small area estimation approach. Although these estimated measures may not be accurate as real statistics, they provide the best available data for these small areas and the approach has been well validated. 31 Second, we could not make causal claims about the relationship between neighborhood characteristics and health outcomes due to the nature of the cross-sectional data and the ecological study design. However, our results identi ed important factors of neighborhood cardiovascular health and can potentially stimulate future causal inference research in neighborhood cardiovascular health. Finally, there could be other important variables that were not included in our study, either unmeasured or not collected in our data, due to the complexity of the neighborhood cardiovascular health. Despite the potential omitted variables, by combining data from three different large datasets and using an innovative machine learning approach, we believe the scope and depth of our analysis can provide important insights on policymaking and lead to more innovative investigations in the area of neighborhood population health.

Conclusions
Highly exible machine learning identi es drivers of neighborhood cardiovascular health outcomes from wide-ranging information in an agnostic and reproducible way. Quantile regression based approaches provide an opportunity to deepen and expand the quantitative evidence gained from mean-based analyses. The identi ed major determinants and the effect mechanisms can provide important avenues for prioritizing and allocating resources to develop optimal community-level interventions for stroke prevention.

Declarations
Ethics approval and consent to participate This study used census tract level data (no patient-level data was used) from publicly available data sources. Ethical approval is not application for this study.

Consent for publication
Not applicable

Availability of data and materials
We used 3 datasets during the current study. CDC's 500 Cities Project 2017 data release on 28,004 census tracts is publicly available on its website, https://chronicdata.cdc.gov/browse?category=500+Cities. 16 The 2011-2015 American Community Survey 5-Year Estimates is publicly available on the website, https://www.census.gov/data/developers/data-sets/acs-5year.html. 17 The EPA's Environmental Justice Screening (EJSCREEN) database is also publicly available on the website, https://www.epa.gov/ejscreen/download-ejscreen-data. 19 Competing interests    Variable selection algorithm using Quantile Regression Forests.

Figure 3
Importance ranking of predictors for the neighborhood-level prevalence of stroke based on 10000 trained trees for the QRFs. Importance is measured as follows. For each tree, the prediction performance (i.e., mean squared errors) on the OOB samples is recorded. Then the values of each predictor in the OOB samples are randomly permuted, and the prediction performance based on the shu ed data is recorded.
The importance score of that variable is measured as decrease in the prediction performance after permutation averaged across all trees. QRFs = quantile regression forests; OOB = out-of-bag.  Comparison of QRFs and LQR-AllVar based on AQL, number of predictors selected and AQL reduction per predictor (larger AQL reduction is better). QRFs = quantile regression forests; LQR-AllVar = linear quantile regression including all variables; AQL = average quantile loss.

Figure 6
The effects of ve major determinants on stroke varied across the 90th, 50th and 10th quantile of the distribution of the neighborhood-level prevalence of stroke, in contrast to the uniform effect from the mean-based LR analysis. The height of the bars corresponds to estimated point effects, error bars represent the associated 95% con dence intervals. Horizontal grey solid and dotted lines represent the effects and con dence intervals on the mean responses. Effect estimates represent changes in the τ-th quantile (bars) or the mean (horizontal grey lines) of the prevalence of stroke per 10% increase in NON_HIS_BLACK, AGE65_OVER, INSUF_SLEEP and COLLEGE_HIGHER and per $100,000 increase in MED_INCOME. LR = linear regression.