Application of Spatial Analysis for Electronic Health Records: Characterizing Patient Phenotypes and Emerging Trends

Background: Electronic health records (EHR) commonly contain patient addresses that provide valuable data for geocoding and spatial analysis, enabling more comprehensive descriptions of individual patients for clinical purposes. Despite the widespread use of EHR in clinical decision support and interventions, no systematic review has examined the extent to which spatial analysis is used to characterize patient phenotypes. Objective: This study reviews advanced spatial analyses that employed individual-level health data from EHR within the US to characterize patient phenotypes. Methods: We systematically evaluated English-language peer-reviewed articles from PubMed/MEDLINE, Scopus, Web of Science, and Google Scholar databases from inception to August 20, 2023, without imposing constraints on time, study design, or specific health domains. Results: Only 49 articles met the eligibility criteria. These articles utilized diverse spatial methods, with a predominant focus on clustering techniques, while spatiotemporal analysis (frequentist and Bayesian) and modeling were relatively underexplored. A noteworthy surge (n = 42, 85.7%) in publications was observed post-2017. The publications investigated a variety of adult and pediatric clinical areas, including infectious disease, endocrinology, and cardiology, using phenotypes defined over a range of data domains, such as demographics, diagnoses, and visits. The primary health outcomes investigated were asthma, hypertension, and diabetes. Notably, patient phenotypes involving genomics, imaging, and notes were rarely utilized. Conclusions: This review underscores the growing interest in spatial analysis of EHR-derived data and highlights knowledge gaps in clinical health, phenotype domains, and spatial methodologies. Additionally, this review proposes guidelines for harnessing the potential of spatial analysis to enhance the context of individual patients for future clinical decision support.


Introduction
Electronic health records (EHR) have signi cantly enriched clinical decision support by providing relatively cost-effective, time-e cient, and convenient sources of a large population of patient records [1,2].Because EHR often contain patient addresses, spatial analysis can enable value addition via highresolution geocoding.The simplest of such analyses may be mapping, which can promote a better understanding of health disparities.Further, patient geocoding can link external data such as environmental, demographic, and socio-economic factors for more re ned patient phenotyping and a more profound understanding of patient exposures for targeted interventions [3].
The possibilities for applying spatial analysis of individual-level EHR-derived data are beyond geocoding, basic mapping, or external data linkage.For instance, spatial network analysis examines proximity to the sources of pollution [4], measures accessibility to healthcare facilities [5], and optimizes resource allocations to mitigate health disparities [6].Spatial clustering pinpoints statistically signi cant spatial and spatiotemporal hotspots and cold spots [7], especially when considering longitudinal EHR data.Moreover, spatial and spatiotemporal modeling can identify localized patterns, trends, and relationships within a speci c region [8,9]; however, ignoring spatial dependence in modeling can bias parameter estimates [9][10][11].Identifying underserved communities through spatial analysis can enhance clinical decision support to implement targeted interventions such as screening, vaccination, or health education campaigns.
While spatial methodologies have the potential to better describe the context of individual patients in biomedical studies, there is a need for improvement in their utilization to derive meaningful insights.To accurately address medical conditions, identify a disease in a patient, and scale that to cohorts of patients, phenotyping is required [12].Phenotypes are a combination of observable traits, symptoms, and characteristics.They can contain inclusion and exclusion criteria (e.g., diagnoses, procedures, laboratory reports, and medications) and can be used to recruit patients who t the necessary criteria for clinical trials.
A prior systematic review employed spatially linked EHR data to investigate the effects of social, physical, and built environments on health outcomes [13].Another study highlighted the need to integrate spatial data related to individual patients into healthcare decision-making and practice [14].Nonetheless, this is the rst comprehensive study that systematically reviews the US-based studies that used spatial analysis for analyzing EHR-derived data in characterizing patient phenotypes for clinical decision support and interventions.This review collates and synthesizes existing literature that employed individual-level health data from EHR in conjunction with advanced spatial analyses and patient phenotyping.Thus, the main objectives of this review are: 1.To evaluate the degree to which advanced spatial methods are currently being utilized with individual-level data sourced from EHR in the US; 2. To identify areas of spatial analyses most applicable to biomedical studies; 3. To categorize publications concerning their biomedical and clinical areas and the speci c patient phenotypes they target.4. To highlight knowledge gaps and propose guidelines for harnessing the potential of spatial analysis to enhance the context of individual-level data sourced from EHR for future biomedical studies.

Methods
This systematic review was performed using the protocols outlined by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) to identify the articles that satisfy the eligibility criteria for subsequent data extraction and synthesis.

Data Source
A comprehensive search for peer-reviewed articles was carried out using abstracts and titles screening within PubMed/MEDLINE, Scopus, and Web of Science databases using the search terms in Table 1.The search was conducted on August 29, 2023, without limitations on time, study design, or speci c health domains.

Search Strategy
The initial search comprised two main categories.The rst category included a broad set of key terms related to spatial analysis.The second category employed the key terms associated with EHR.
Henceforth, our reference to EHR will also encompass electronic medical records (EMR), electronic patient records (EPR), enterprise data warehouses (EDW), and research data warehouses (RDW).The Boolean operator (AND) was applied to synthesize the two categories.

Study Selection
The retrieved abstracts and titles were imported into Covidence systematic review software, where duplicate records between original databases are automatically eliminated.Two reviewers (AM and BH) independently assessed the eligibility of the articles based on the following inclusion and exclusion criteria: The articles were eligible for primary inclusion if they (1) were composed in English, (2) were original peerreviewed articles, (3) used individual-level patient data derived from EHR/ EMR/ EPR/ EDW/ RDW, (4) incorporated at least one form of spatial methods.Conversely, the articles were excluded if they (1) were not peer-reviewed (e.g., letters, editorials, reviews, case reports, abstracts, and grey literature), (2) solely geocoded addresses or generated basic visualizations (e.g., dot map and choropleth map) without any spatial analysis, (3) not based on US EHR data.
The reviewers (AM and BH) independently reviewed the full texts of all remaining articles.The articles also were excluded if they lacked phenotype characteristics.Further, we manually checked the references for all the selected articles for possible inclusion.We also searched the rst 20 pages of Google Scholar for potential inclusions.A third reviewer (AVA) was consulted to break ties.

Data Extraction
Upon identifying articles that satis ed all inclusion criteria, two reviewers (AM and BH) extracted the following items for each article: title, publication year, country and region, sample size, study period, spatial methodologies, and key ndings from the spatial methods.Moreover, articles were assessed to identify clinical domains (including primary and secondary when applicable), health conditions or problems, and themes (including social determinants of health (SDOH), environmental factors, ecological aspects, climate, microbiome, genomics, and clinical phenotypic characteristics).Previous publications have emphasized the importance of data domain sources in phenotyping, underscoring the need for validating the created phenotype [15] and using multiple data sources.Thus, in cases where the included publications did not provide details of data sources but instead referenced previously published works, referenced publications were reviewed.We also documented the number of organizations contributing data.Additionally, we cataloged the types of EHR that served as the sources.

Narrative Synthesis
Following data extraction, the articles were categorized into the following spatial methodology classi cations: descriptive, clustering, modeling (frequentist), spatiotemporal (frequentist), and Bayesian.The phenotype characteristics were extracted and recorded as free text.It should be noted that the categories were not mutually exclusive.The quality appraisal of the studies was not feasible due to the substantial heterogeneity in spatial methodologies and health domains.The geospatial distribution of the included studies was visualized using ArcGIS Pro software 3.0 (ESRI, Redlands, CA, US).

Study Selection
The initial search yielded 1,758 references.After removing duplicate records, we identi ed 952 articles for abstract and title screening, from which 375 were selected for full-text review.Out of these, 322 articles were excluded as they only contained geocoding or basic mapping without any spatial analysis.Additionally, 14 articles were omitted due to the absence of patient phenotype characteristics (n = 2) or were not based on US data (n = 13).We further manually searched references and Google Scholar and found 11 new articles that met the eligibility criteria.Therefore, 49 articles that ful lled the inclusion criteria were retained for data extraction and synthesis.Figure 1 depicts the PRISMA owchart for the study selection process.

Temporal and geographic distribution of studies
While no time restrictions were imposed, a limited number of articles (n = 7, 14.3%) were published prior to 2017.The earliest article included in this study was published in 2011, and the publication frequency has experienced a signi cant upsurge since 2017 (n = 42, 85.7%).There was only one article [16] at the national level.General characteristics of the included articles are presented in Table 2. Most studies were concentrated in North Carolina (n = 8, 16.3%), Pennsylvania (n = 6, 12.2%), California (n = 6, 12.2%), and Illinois (n = 4, 8.2%).Figure 2 illustrates the geospatial distribution of articles at the state level in the US.

Spatial Methodologies
Most studies focused on frequentist methods compared to the Bayesian methods.Among frequentist methods, the most prevalent category was clustering (n = 29), followed by descriptive (n = 12), modeling (n = 6), and spatiotemporal analyses (n = 2).

Spatial Sampling
A two-standard deviation ellipse method is employed to optimize spatial sampling density.This ellipse contains almost 95% of the locations of patients and is used to ensure that the collected samples re ect the underlying spatial pattern in data, particularly when resources are limited [62].[38,39] adopted this approach when sampling women who underwent cytomegalovirus antibody testing during pregnancy, especially in peripheral areas with limited subject representation.

Spatial Overlay
Spatial overlay integrates various spatial data sources, often maps, to represent their shared features.[56] overlaid the map of major radiation treatment interruptions based on race onto the map of median household income.Their analysis implied that regions with higher income levels experienced lower rates of radiation treatment interruption.[48] spatially joined patient addresses to the nearest city parcels and computed an estimate of the incidence of emergency department visits for asthma for each parcel.

Proximity Analysis
Proximity analysis includes measuring distances between geographic features to identify nearby features within a de ned distance or buffer zone to uncover proximity patterns [63].[57] created temporal and spatial buffers to assess the correlation between individual exposure to violent crime and blood pressure.[49] evaluated the associations between environmental factors and body mass index (BMI) within a 0.5-mile network buffer from the place of residence.[23] investigated the associations between prenatal residential greenness and birth outcomes within 250m and 1,250m buffers.Utilizing a GIS service area network analysis, [35] examined BMI percentile and proximity to fast-food and pizza establishments among adolescents within 0.25 mile Euclidean and network buffer zones.

Spatial Interpolation
Ordinary Kriging is one of the most widely used spatial interpolation techniques that leverages the spatial autocorrelation structure of observed locations to estimate values at unmeasured locations [64].[33] applied ordinary Kriging with a spherical semi-variogram model based on observations of the children's elevated blood lead level (BLL) geocoded to the home address to visualize BLL variations before and after water source changes.[44] interpolated the levels of neighborhood physical disorder based on an exponential variogram.[16] demonstrated spatial variations for the incidence rates of each ICD-9 diagnostic code based on an exponential variogram.[54] estimated monthly average concentrations of ne particulate matter to investigate the associations between air pollution exposure during pregnancy and gestational diabetes mellitus (GDM).

Spatial Clustering
Spatial clustering techniques assess whether health outcomes are random, uniform, or clustered and pinpoint the locations of clusters [65].Spatial clustering was the most widely used category (n = 29) among all studied categories.Moran's I clustering and cluster detection were the most frequent techniques (n = 10), followed by kernel/point density estimation (n = 5), spatial scan statistics (n = 4), and Getis-Ord Gi* statistics (n = 4).

Kernel/point Density Estimation
Kernel density estimation (KDE) generates a smooth surface to visualize areas of the most signi cant spatial intensity by calculating a distance-weighted count of events within a speci ed radius per unit area [66].Several studies adopted KDE to analyze patterns, including cholera hospitalization [59], comparison of the spatial intensity of chronic kidney disease (CKD) with non-CKD patients [30], and comparison of the spatial intensity of breast cancer and non-breast cancer [52].Using the point density function, [17] pinpointed hotspots of inpatient bed-day rates within a 2-mile radius of a medical center and [36] estimated the number of participants per square mile.
Global and Local Moran's I Global Moran's I (GMI) evaluates the overall pattern for spatial autocorrelation [67] by inferring if a variable is spatially clustered or over-dispersed vs. being randomly distributed under the null hypothesis [67].Local Moran's I, often called LISA, is used to locate statistically signi cant clusters, including hotspots, cold spots, and outliers [68].GMI has been adopted to analyze spatial clustering of health outcomes, including GDM [54], day-of-surgery cancellation [43], obesity [55], and COVID-19 [51].All exhibited clustered patterns.[59] analyzed three groups: depression, obesity, and comorbid cases, con rmed clustering for all outcomes, and identi ed spatial clusters and outliers.[47] found random distributions for dermatomyositis (DM) and subtypes, classic DM (CDM), and clinically amyopathic DM (CADM).Meanwhile, [25] pinpointed clusters with higher or lower depression prevalence, and [58] identi ed a cluster of low utilization of acute pediatric mental health interventions in less-densely populated rural border areas.
GMI and (semi)variograms can also identify spatial autocorrelation in model residuals.If detected, the models are adjusted accordingly to avoid biased estimates.For example, [42] [69].Bivariate LISA assesses the relationship at the local level.[47] employed bivariate GMI for the prevalence of DM, CDM, and CADM with airborne toxics but found no overall spatial dependencies.However, bivariate LISA identi ed local dependencies at the zip code level.[31] applied bivariate GMI and found signi cant overall associations between longer (average) distances to the nearest supermarket and higher incidence of diabetes, and bivariate LISA identi ed signi cant "high-high" relationships at the zip code level.[28] utilized bivariate LISA and found no local association between radiation therapy interruption and social vulnerability index at the zip code level.

Getis-Ord Gi*
The Getis-Ord Gi* statistic identi es high or low-value clusters (hotspots and cold spots) by assessing deviations of health outcomes at locations from the average within a de ned neighborhood [70].[40] measured racial residential segregation by examining the deviations in the (proportion of) African American residents in each census tract from the mean of neighboring tracts.
Similarly, [45] measured the racial residential segregation for the percentage of non-Hispanic Black residents.[7] identi ed signi cant community-onset methicillin-resistant Staphylococcus aureus (CO-MRSA) hotspots with distinct patterns between cases and controls.[37] detected the high and low values clusters for the child opportunity index and median household income.

Spatial Scan Statistics
The spatial scan statistics technique identi es high and low-risk clusters and estimates their relative risks [71].It also can incorporate covariates to characterize underlying patterns [72].[42] found that people living in zip codes within the primary cluster had an almost 2.5 times greater risk of NTM disease.[41] identi ed clusters of under-immunization and vaccine refusal among children, with rates ranging from 18% to 23% inside the clusters compared to 11% outside.
The technique can also pinpoint cold spots.[21] identi ed areas with signi cantly lower COVID-19 testing than expected, indicating a need for interventions.[60] observed signi cantly low rates of up-to-date colorectal cancer screening.

Spatial Modeling (Frequentist)
Among the included articles, the generalized additive models (GAMs) emerged as the most frequently employed spatial models.GAMs can account for spatial autocorrelation by incorporating smooth functions (such as thin-plate regression) of spatial coordinates [73], allowing the estimate of geographic variation with or without covariate adjustments.GAMs identi ed spatial variabilities in asthma prevalence [3,8] and cytomegalovirus [38,39], although such variations often diminished when adjusted for demographic factors such as race and age.Among less commonly used geospatial models were generalized linear mixed effects [51] and spatial error [43] models.

Bayesian Analysis
The articles employing Bayesian methods were categorized into Empirical Bayes smoothing (n = 5) and Bayesian modeling (n = 6).
The Empirical Bayes smoothing was employed in [40,43,55,59] to stabilize estimated rates in areas with limited data points by borrowing information from the overall population [74].[61] employed nonparametric kernel smoothing to estimate the prevalence of childhood obesity in areas with sparse observations (n<20 individuals).
Bayesian modeling can account for spatial and temporal dependencies and quantify uncertainty by specifying prior distributions [75].Among the articles, the conditional autoregressive (CAR) prior emerged as the most used, with two variants: intrinsic and multivariate CAR.Intrinsic CAR was used to assess the spatial variations in diabetes in relationship with racial isolation [18], hypertension related to racial isolation [19], and type 2 diabetes mellitus with the built environment [20].Multivariate CAR was employed to identify areas with higher or lower-than-expected prostate cancer while controlling for risk factors [29].Moreover, hierarchical Bayesian that can incorporate hierarchical structures for modeling interactions in data with multiple levels [76] was used to investigate spatial distributions of patients admitted for drug-related reasons concerning the area deprivation index [24].Bayesian negative binomial hurdle models that can account for excessive zeros and overdispersion were used by [26] to examine spatial variation between patient responses to the questions concerning unhealthy home environments and the mean number of emergency department visits after screening.

Phenotyping Clinical Domain Characteristics and Themes
The largest category of articles was classi ed under the infectious disease (n = 7), endocrinology (n = 7) and oncology (n = 6) domains.Additionally, 19 articles had a pediatric domain or focus, as noted with an additional column in Table 3. Maternal and newborn care was classi ed as its own domain (n = 8), but it overlapped with other domains such as nephrology, endocrinology, and infectious disease.
The relationship between the clinical domains and the "conditions/problems of focus" in each article was examined (Table 3).In some cases, direct correspondence was observed, while in other instances, the "condition/problems of focus" differed from the phenotype of the patient cohort.In many articles, one or more overlapping domains were observed (e.g., rheumatology, neurology, and dermatology for the study of dermatomyositis).Asthma (n = 5), hypertension (n = 5), and diabetes (n = 4) were studied most frequently.Three articles did not focus on any health condition but rather on examining disparities in either a data source or a speci c domain or cohort (e.g., disparities in the use of pediatric intensive care units).
Every article was attributed to at least one prominent theme, with the possibility of multiple themes.SDOH themes were prevalent in many articles.To organize and present this information, we utilized the domains de ned by the Healthy People 2030 framework [77].There are ve domains in the SDOH framework (Table 1), with the corresponding counts of these domains being seen as themes of the articles.Most articles had one or more SDOH themes (n = 42).Many articles focused either on all the domains or SDOH holistically without particular focus on any speci c domain (n = 32).However, some articles contained prominent themes that were not directly related to SDOH, which were phenotypic features (n = 4), followed by environmental (n = 3), and ecological (n = 2), with climate, genomics, and microbiome, each contributing one article.

Clinical Phenotype Features
For each publication, clinical phenotype de nitions were extracted (Supplementary Appendix S1).In almost all studies, phenotype de nitions included demographic details such as patient age, race, and gender, along with some diagnostic characteristics (e.g., asthma diagnosis).Only a limited number of phenotypes were observed to be validated (n = 8).The most frequently observed method for phenotype validation was a manual chart review of all matches or a sample of matched charts.None of the articles with chart review as a validation method shared information on the match rate.Additionally, only two articles [20,59] were observed to utilize validated eMERGE Network computable phenotypes from the Phenotype Knowledgebase (PheKB) [78][79][80].

Discussion
This systematic review is the rst comprehensive investigation of spatial methodologies within EHRderived data in the US.Spatial clustering and descriptive analysis were the most used methods, while space-time modeling, either frequentist or Bayesian, remained under-explored.The diverse use of spatial analysis for EHR-derived data in different health domains highlights the potential to incorporate spatial methods to enhance the context of individual patients for future biomedical research.We found limited use of EHR-derived data for spatial analysis, probably due to the challenge of safeguarding patient privacy.Address data, crucial for spatial analysis, is highly con dential and often restricted from sharing.
Researchers and institutions often use geographic masking techniques [53,81] to balance data utility and privacy protection by altering the precise geographic coordinates while preserving the overall spatial characteristics of data.Encouraging the adoption of spatial analysis could promote biomedical knowledge sharing and collaboration.
The application of spatiotemporal analysis of EHR-derived data was mainly limited to exploring spatiotemporal clusters with no spatiotemporal modeling.This might be due to the technical expertise required for analysis, data complexity, availability of longitudinal data, and computational challenges.The Bayesian framework offers a more adaptable framework to handle complex spatial and temporal dependencies, control confounding variables [82], and incorporate prior information, such as existing medical literature and expert opinions, resulting in more interpretable results [83,84].
Moreover, spatiotemporal Bayesian modeling can aid in understanding disease trends and progressions, seasonality, and long-term shifts at the local levels [85].Bayesian modeling can better account for uncertainty in parameter estimates and predictions to assess the reliability of ndings before implementing interventions [86].Future research should delve into spatial and spatiotemporal modeling, focusing on Bayesian approaches.
Among the health conditions studied, chronic and infectious diseases emerged as the most frequently investigated domains compared to others.This disparity may be attributed to the pressing public health concerns posed by diseases with immediate impacts that often attract more funding and resources for research initiatives [87,88].The historically high mortality rates of these conditions likely led to continuous research.Surprisingly, despite the plethora of funding in cancer research, we only found a small number of articles within the cancer domain, which may likewise be attributed to and indicative of the pressing needs of other domains, such as infectious disease.
We observed recurring and prominent themes related to the SDOH.This emphasis may result from the growing maturity and increased awareness within the biomedical informatics community regarding the signi cant in uence of social, economic, and environmental factors on health outcomes.Understanding the roles of SDOH in health disparities will likely lead to the implementation of integrative health interventions that address the needs of individuals affected by these health disparities.These interventions can likewise be enhanced by incorporating spatial perspectives.
Another missed opportunity is the underutilization of computable phenotypes -automated algorithms designed for characterizing diseases and enrolling patients in studies.Most studies primarily depended on the manual application of inclusion and exclusion criteria to de ne phenotypes.While this method may be suitable in certain scenarios, it often necessitates greater depth and granularity to consistently and accurately capture the intended patient cohorts.The accuracy and precision of the manual approach can vary depending on the data sources and clinical domains.Notably, only two of the studies in our review used computable phenotypes, indicating a signi cant underutilization of this essential and potentially transformative approach, highlighting a noteworthy area for growth.Furthermore, only ve articles carried out any form of chart review validation.Validation methods, including chart reviews, genetic markers, and clinical variables, are indispensable in phenotyping to guarantee the accurate characterization of the desired cohorts.This applies even to computable phenotypes within speci c medical domains [89].
This study has several main limitations.First, we only considered English articles, possibly introducing language bias.Additionally, selection bias is possible due to database availability.However, we mitigated these limitations by searching Google Scholar and conducting backward reference checking to identify relevant studies that might yet be identi ed through our initial search strategy.Lastly, we used a query search strategy with limited keywords, which inherently restricted the scope of articles we could retrieve, potentially omitting studies that did not utilize these speci c terms in their abstract or title.

Conclusion
This systematic review provided a comprehensive overview of the current utilization of spatial analysis in EHR-based research in the US and underscored the pivotal role that spatial analysis can play in clinical decision support and interventions.The utilization of EHR-derived spatial analysis is on an upward trajectory, parallel with the widespread adoption of EHR systems.The volume of articles on this topic is anticipated to continue to grow.The primary health outcomes investigated were asthma, hypertension, and diabetes.Notably, patient phenotypes involving genomics, imaging, and notes were rarely utilized.This review also highlighted the need for additional exploration of spatial analysis techniques, including but not limited to spatiotemporal Bayesian analysis and modeling, particularly in the cancer domain.

Declarations Figures
PRISMA study selection owchart.

Table 2 .
General characteristics of the included studies

Table 3 .
Clinical domains and condition/problem of focus for each publication (row).*Publicationswith more than one clinical domain and those with a pediatric component are noted as such.offocus column displays the general condition of the article and may not directly correspond to the phenotype.

Table 4 .
SDOH themes examined within the framework of Healthy People 2030 SDOH domains.