Integrating geo-linked electronic health records with neighborhood data to identify place-based barriers for diabetes management in a public healthcare system

We examined geographic clustering of poor diabetes control and corresponding structural and social determinants of health within San Francisco’s safety-net healthcare system. Specically, we used EHR data to identify individual patient-level addresses and clinical outcomes for diabetes, using hot spot analysis to determine signicant hot and cold clusters of diabetes control throughout San Francisco. In addition, by linking patient addresses to public datasets, we described the neighborhood-level conditions associated with hot spots.


Conclusions
We present methodological advantages of hotspot analyses utilizing EHR data, as well as the value of combining these approaches with additional public population census data about neighborhood conditions. Moving forward, healthcare systems must partner with public health agencies to utilize EHR data as a means to evaluate structural interventions and eventually target investment in place-based health-enabling resources.

Background
It is well known that fundamental causes of place-based health disparities are rooted in structural racism and other structural determinants of health. [1][2][3] Numerous factors related to where people live-access to healthy food, [4][5][6][7] access to spaces for recreation and walkability, [5,8,9] built environment, [10,11] neighborhood safety, [12] and neighborhood socioeconomic status [13,14]-impact diabetes-related health outcomes. Furthermore, these fundamental place-based factors also directly in uence multiple other domains of health, from social and interpersonal interactions to individual health behaviors. [15] Prior social epidemiological studies have utilized various methods to identify place-based disparities in diabetes prevalence and management, including county-, census tract-, or neighborhood-level prevalence measures, and corresponding hot spot analysis of prevalence. [16][17][18][19] While these studies contribute to an evidence-base for prioritizing place-based disparities in diabetes, there is an opportunity to leverage these approaches using clinical data captured in electronic health records (EHR), including objective health outcomes based on laboratory measures, linked to public population-level datasets that include information on neighborhood conditions. [20] Beyond facilitating clinical data linkages and affording nuanced objective measures of health outcomes, healthcare systems themselves are increasingly prioritizing social determinants of health (SDOH) and considering new ways to address upstream and structural factors. [21][22][23] Financial incentive structures and value-based payment models have emphasized population and public health. [24,25] Often, this focus on SDOH has increased the screening and referral for social resources for patient populations, and documenting these actions within electronic health records (EHRs). [23,26,27] However, most health systems have not effectively leveraged EHRs to understand the impact of place, speci cally patient residence, on patient health outcomes. [23] Health systems have a speci c opportunity to utilize geocoded EHR data at the level of individual patient address to understand the implications of patient neighborhood factors as they relate to healthcare utilization and health outcomes. Therefore, we linked EHR and publicly available population data on neighborhood conditions to examine geographic clustering of poor glycemic control and corresponding place-based structural and social drivers of poor glycemic control within the public safety-net healthcare delivery system in San Francisco. Our ultimate goal was to enable data-driven approaches for future place-based interventions for diabetes management.

Methods
Study Population, Study Setting and Clinical Outcome. The San Francisco Health Network (SFHN), including Zuckerberg San Francisco General Hospital, is an integrated safety-net healthcare system serving publicly insured and underinsured patients. We used SFHN EHR data from Jan 1, 2013 to Dec 31, 2017 to identify our study population and obtain individual-level patient characteristics including sociodemographic (e.g., race/ethnicity, insurance type) and clinical information (e.g., diabetes control). Our study population included SFHN patients who had an outpatient visit 2016-2017 and at least one additional outpatient visit within the prior two years, an ICD-9-CM or ICD-10-CM diagnosis of diabetes, at least one HbA1c lab result subsequent to diagnosis, and a residential address in San Francisco. [28] We de ned patients as having poor diabetes control if they had a glycosylated hemoglobin (HbA1c) level greater than 9% at their most recent lab test during the study period. We only considered HbA1c values after a recorded diabetes diagnosis to ensure the lab result was capturing control of an active diabetes diagnosis instead of a lab result potentially leading to a diagnosis, prior to clinical treatment.
Ethics and consent to participate. This study was approved by the University of California, San Francisco Institutional Review Board. IRB approval allows for use of clinical patient health data for analysis. Human subjects were not involved in this study and therefore written or verbal informed consent was not required Neighborhood characteristics. Structural and social determinants of health across several domains (racial/ethnic and language composition, socioeconomic context including poverty and unemployment, food environment and access, housing) were compared across patient cluster groupings. Data for neighborhood-level characteristics were downloaded from the UCSF Health Atlas, an interactive map with a catalog of characteristics that illustrate social, economic, and built environments in California. [29] Speci c data sources for each characteristic are described below.
1. Racial/ethnic and language composition of neighborhoods. [3] At the census tract level, we extracted percent White, percent Black, percent Asian, percent Native Hawaiian or Paci c Islander, percent Native American (alone or in combination with other races), and percent Latinx residents, sourced from 2013-2017 American Community Survey (ACS) data.
[30] We also measured limited English pro ciency, de ned as the percent of the population at the census tract level that speaks English less than "very well," and percent of the population that speaks a language other than English at home, also sourced from ACS. [30] 2. Socioeconomic Context and Neighborhood Built Environment. To measure several additional indicators of structural and social determinants of health at the neighborhood level, we examined: a. Poverty and unemployment. Percent poverty from ACS data was de ned as the percent of the population with income below 100% of the federal poverty level in the past 12 months,[30] which we included along with the Housing and Urban Development's Extremely Low-Income (ELI) measure, de ned as below 30% of the area median income (relevant for high-income locations, such as San Francisco). [31] We also included percent unemployment from ACS. [30] b. Socioeconomic Indices. We used a composite measure of the Healthy Places Index (combining economic, education, housing, healthcare access, neighborhood, clean environment, transportation, and social factors), where a higher percentile indicates less healthy neighborhood conditions. [32] c. Food environment and access. Percent low-income, low-food access tracts was obtained from the US Department of Agriculture Food Access Research Atlas, [33] de ned as low-income tracts where at least 500 people or 1/3 of the population lives more than half a mile away from the nearest supermarket.
Percent of the population with Supplemental Nutrition Assistance Program (SNAP) bene ts in the past 12 months was sourced from ACS.
[30] Finally, food insecurity measures were census-tract level modeled estimates of percent food insecurity sourced from Feeding America. [29] d. Housing. Housing data was sourced from HUD Comprehensive Housing Affordability Strategy Data. [31] Renter-occupied households were de ned as the percent of housing units within a census tract that are lived in by a renter. Severe rent burden is de ned as the percentage of renter-occupied households in a census tract for whom housing costs are over 50% of household income.
Statistical Analyses. We conducted descriptive analyses of patient characteristics by uncontrolled diabetes, overall and by sociodemographic characteristics. Geospatial Analyses. We geocoded residential addresses of patients in our study population, using patients' most recently recorded address in the EHR as of June 13, 2019. We used ArcGIS Pro Version 2.6 (Environmental Systems Research Institute, Inc., Redlands, CA, USA) for all geospatial analysis.
Census Tract Prevalence. We calculated the prevalence of poor glycemic control among diabetic patients in our study population by census tract and categorized census tracts into tertiles of high (between 18% -47.1%), medium (11.9% -17.9%), or low prevalence (less than 11.8%). Rates of small case counts are less reliable and therefore census tracts with fewer than 10 patients were excluded in rate calculation.
Hot Spot Analysis. We then conducted a hot spot analysis to identify hot and cold spots of poor diabetes control in San Francisco. We used the Getis-Ord Gi* statistic to assess randomness of the spatial distribution of high (poor glycemic control) and low (good glycemic control) values using the "Hot Spot Analysis (Getis-Ord Gi*)" tool in ArcGIS Pro 2.6.
The tool de nes a "neighborhood" for each patient as the set of patients within a xed distance band.
The xed distance band is determined using an incremental spatial autocorrelation test to assess the likelihood that spatial distribution patterns of high and low values are random. Using the Global Moran's I statistic and z-scores generated from the test for a range of xed distances, we identi ed distance bands with the greatest likelihood of having a non-random spatial distribution in order to identify the distance band at which spatial patterns of diabetes control are most likely to be clustered. For the incremental autocorrelation test, we used the Euclidean distance method to examine 15 distance bands each 20 meters (0.03 miles) apart over a range of 500-800 meters (0.31 -0.49 mi). We identi ed the distance band at 620 meters or 0.385 miles as having the maximum spatial autocorrelation with a z-score of 5.83 and pvalue <0.001.
The Hot Spot Analysis (Getis-Ord Gi*) tool compares the prevalence expected value of diabetes control for all patients in the study population with the prevalence value of diabetes control within a patient's "neighborhood" and calculates a z-score and p-value for each patient. A patient is classi ed as a hot spot if there are statistically signi cantly more high values in the patient's "neighborhood" than in the full study area (San Francisco). A patient is classi ed as a cold spot if there are statistically signi cantly more low values in the patient's "neighborhood" than in the full study area. A high z-score indicated clustering of higher levels of poor diabetes control and a low negative z-score indicated clustering of lower levels of poor diabetes control, where the higher or lower the z-score indicates the intensity of clustering. We categorized the z-score into 3 categories-hot spots, cold spots, and not statistically signi cant-based on a 90% con dence level.
To mask point locations of patient residences and protect patient privacy, we used inverse distance weighting interpolation to visualize geographic areas of hot and cold spots.
Associations between patient diabetes control clustering and census tract characteristics. Finally, we summarized structural and social determinants of health indicators by patient cluster groupings (hot spots, cold spots, and not signi cant) to compare values and describe observed differences across these cluster groups. We assigned the census tract values for all characteristics to each patient and then averaged the census tract values of each characteristic for patients within each cluster classi cation (hot spot, cold spot, and not signi cant).

Results
Our study sample included 11,333 SFHN patients with diabetes living in San Francisco. Average patient age was 59.6 and 51.1% were women (   Hotspot analysis ndings. Within the study population, 18.8% of patients (n = 2,126) were clustered in hot spots (high levels of uncontrolled diabetes), 12.8% (n = 1,448) were clustered in cold spots (low levels of uncontrolled diabetes), and 68.5% (n = 7,759), were not signi cantly spatially clustered (Figure 2). Hot spots were primarily within the southeast quadrant of San Francisco, whereas cold spots were primarily within the western and north eastern neighborhoods of San Francisco. [34] As expected, very few census tracts had both hot and cold spot patients.  Census tracts of residence for patients in hot spot clusters, compared to patients in cold spot clusters, had a higher prevalence of residents receiving SNAP bene ts (15.2% vs 5.1%), residents experiencing food insecurity (19.9% vs. 17.3%), and low-income, low-food access tracts (32.1% vs. 8.6%).
Census tracts of residence for patients in hot spot clusters, compared to patients in cold spot clusters, had a higher prevalence of renter-occupied households (67.7% vs. 55.9%) and households experiencing high rent burden (16.1% vs. 13.3%).
As indicators of overall neighborhood socioeconomic status, census tracts of residence for patients in hot spot clusters, compared to patients in cold spot clusters, had a higher Healthy Places Index percentile score (55.9% vs. 32.1%), indicating less healthy neighborhood conditions.

Discussion
Our study supports the use of geospatial analysis to identify placed-based diabetes disparities in a safety-net healthcare system. Hot spot analysis showed that nearly 1/5 of diabetes patients in a safetynet healthcare system were clustered in hot spots, and hot spots were primarily located within the neighborhoods of San Francisco known to be more underserved in terms of social service investment prioritization from the city and county government.
Based on the public availability of neighborhood-level data at the census tract level, we also compared these hot spots to census tract characteristics of patients. This revealed associations between our geospatial analysis and neighborhood variables re ecting structural racism and its health-harming impacts on glycemic control, such as the racial/ethnic composition of neighborhoods and multiple other neighborhood socioeconomic indicators. While many research studies are often exclusively using either geospatially-driven associations or place-based analyses using census tracts or other xed boundaries, we feel that it is important to combine these methods -especially since different scienti c disciplines and different leaders (e.g., healthcare system executives, policymakers) have different sets of data available to them as they make critical decisions about interventions and programming.
A strength of our study was our ability to link EHR data with information about place-based social determinants of health. Most publicly available data sources do not typically include robust clinical data on uncontrolled diabetes, given that survey data often relies on self-reported diagnosis at a crosssectional time point, and are limited to associations using pre-de ned geographic boundaries. Thus, 1) the ability to use patient-level lab test results of clinical control veri ed within the EHR is a signi cant improvement on outcome ascertainment, and 2) our primary hotspot analysis was not limited from the outset by administrative geographic boundaries such as census tract de nitions. Moving forward, healthcare systems must be at the table when examining place-based disparities in their regions and catchment areas -particularly in the U.S. where the clinical data stored in EHRs can be fragmented across many different healthcare institutions and not available for public health planning or programming. More speci cally, with the increased use of clinical outcomes, public datasets, and hot spot analysis in future studies, we may be better able to bridge the health and healthcare priorities across healthcare systems, public health departments, and city agencies to align and guide investment in local health-enabling resources and programs.
Beyond working with other public health partners in their region, healthcare systems can also leverage EHRs to integrate population health informatics to conduct their own programs and interventions, including those at the structural or neighborhood level. [21,35,36] Population-level data infrastructure within healthcare systems can be a tool for surveillance of health disparities, programmatic decisionmaking, integration of health outcomes and social determinants of health, and evaluation of interventions. [21,23,26,[35][36][37][38] Furthermore, this work is underscored by new reimbursement approaches prioritizing value and incentivizing health plans and systems to more holistically address social and medical needs, reduce healthcare disparities, and build infrastructural and organizational capacity for structural change. [24,25,27] Limitations. Our study population was comprised of patients from only one healthcare system in San Francisco, yet we chose to focus on the public delivery system which serves the majority of underserved patients in our city. In addition, the results of a spatial analysis do not capture the lived experiences of residents with uncontrolled diabetes, nor does it reveal what resources, support, and changes patients with uncontrolled diabetes want and need in order to improve their health. [39] Integration of data-driven approaches with patient and community-centered partnerships presents an important opportunity for healthcare systems to meaningfully invest in place-based health interventions and evaluate programs by tracking geospatial health outcomes. Finally, while we compared our geospatial hotspot analysis to census tract level prevalence of key neighborhood characteristics, we also understand that this approach was descriptive to show the intersection of these methods.

Conclusions
Combining methods, such as geospatial analyses with neighborhood-level associations, as well as datasets, such EHR-based demographics and clinical outcomes with publicly available neighborhood indicators, can demonstrate multiple lenses for place-based and structural health and healthcare disparities research. Leveraging these datasets in future strategic and collaborative local partnerships, healthcare systems can be better positioned to provide structural interventions that align with ongoing public health programs. Human subjects were not involved in this study and therefore written or verbal informed consent was not required. All methods were carried out in accordance with relevant guidelines and regulations.

Abbreviations
Consent for publication: Not applicable.
Availability of data and materials: The electronic health record data that support the ndings of this study are not publicly available due to containing protected health information that could compromise research participant privacy and HIPAA regulations. Publicly available population-level data sources accessed for this study are collated and available from: www.healthatlas.ucsf.edu Con icting and Competing Interests: None to report.  Census-tract level prevalence of poor glycemic control among diabetic patients in the San Francisco Health Network between 2013-2017 categorized into tertiles (excluding census tracts with fewer than 10 patients).