Meaningful Information in the Age of Big Data: A Scoping Review on Social Determinants of Health Data Collection for Electronic Health Records


 Background: Electronic Health Records (EHRs) are key tools for integrating patient data into health information systems (IS). Advances in automated data collection methodology, particularly the collection of social determinants of health (SDOH), provide opportunities to advance health promotion and illness prevention through advanced analytics (i.e. “Big Data” techniques). We ask how current data collection processes in EHRs permit SDOH data to flow throughout health systems. Methods: Using a scoping review framework, we searched through medical literature to identify current practices in SDOH data collection within EHR systems. We extracted relevant information on data collection methodology, specifically focusing on uses of automated technology. We discuss our findings in the context of research methodology and potential for health equity. Results: Practitioners collect a variety of SDOH data at point of care through EHR, predominantly via embedded screening tools and clinical notes, and primarily capturing data on financial security, housing status, and social support. Health systems are increasingly using digital technology in data collection, including natural language processing algorithms. However overall use of automated technology is limited to date. End uses of data pertain to improving system efficiency, patient care-coordination, and addressing health disparities. Discussion & Conclusion: EHRs can realistically promote collection and meaningful use of SDOH data, although EHRs have not extensively been used to collect and manage this type of information. Future applied research on systems-level application of SDOH data is necessary, and should incorporate a range of stakeholders and interdisciplinary teams of researchers and practitioners in fields of health, computing, and social sciences.

Nevertheless, the health care sector is increasingly incorporating data-driven approaches into operations. While new health data streams entails governance challenges, the potential for meaningful public health applications is vast (15). "Big Data" is considered to have considerable potential in core functions of Public Health, including surveillance, hypothesis-generating research, causal inference (16), and promoting health equity (17).
While technological changes have contributed to medical research through increased data acquisition in fields such as genomics (18), health profiles remain incomplete without SDOH data. Advanced analytic techniques with large data sets have been shown to accurately identify and categorize SDOH, such as the structure of social networks (19) or risk of food insecurity (20). Unfortunately, and as mentioned above, clinical interactions do not in general include screening for SDOH, meaning that highly influential data pieces are not consistently considered in health analyses.
The reasons for inconsistent SDOH screening are multi-factorial (12,21). There is wide variability in data collection and management standards (14), leading to EHRs often lacking well-designed documentation tools and mechanisms for managing SDOH data (13).
Peer-reviewed knowledge on best practice in SDOH collection through EHRs is also still relatively nascent (7), and there is limited evidence that SDOH data can be used to effectively minimize health harms arising from social conditions (14,22). As such, this review aims to clarify the current potential for Big Data in advancing health equity by focusing on SDOH data collection methodology and use of automated collection techniques in EHRs.

Methods
The following research question guided the scoping review: How are SDOH factors are currently being integrated into Electronic Health Records?
The search strategy incorporated the three step approach recommended by the Joanna Briggs Institute (23) including initial database searching, an iteratively-revised comprehensive database search after increased familiarity with search terms, and a final search of the reference lists of the included studies. We reviewed a comprehensive medical database (PubMed) using iterative search strings based on the search terms found in table 1.1. We included the MeSH descriptor data to clarify the search concepts for the purpose of this review. Our final search was conducted in May, 2019, and included articles retrievable through PubMed up until that date.

Component Inclusion Exclusion
Population -Stakeholders in EHR or clinical information system (CIS) data, including patients, health care practitioners, and individuals in health service delivery systems.
-Not EHR or CIS users.
-Data warehouse subjects for r purposes only.

Concept
-Research on SDOH data collection, including SDOH variables such as demographic data, social and economic risk factors, cultural/political identity.
-Not research, such as perspec guidelines.
-Collection process does not in SDOH data, i.e, exclusively bio ecological, psychological, or be data.
Context -Data collection processes within an EHR system, including automated or otherwise technologically enhanced processes.
-Did not use an EHR system, e record only, or screening tool n connected to EHR system.
-Not data collection, e.g. data management or analysis only.
We extracted data from the included articles on key aspects of data collection methodology, such as research design, operational definitions, collection tools, data type, and data use. We also extracted information on current or potential use of automated data collection tools in order to understand the current state of automated technology in SDOH data collection. We also included the EHR and EHR context (country of origin, details of CIS, type of health service organization, etc.). We identified 3 articles in PubMed using all three search concepts. We eliminated the third concept, data collection, from our search strategy (table 1.1), which then yielded 168 results. We retained the concept as part of the selection strategy during full-text review.
Search results are presented in figure 1.1. A summary of extracted findings are presented in table 2.1. All extracted data are available in the supplementary data file.

Data input format
Social history or social screening sections of the EHR were commonly used to operationalize SDOH data collection, providing structured and unstructured digital data.
Screening tools included binary screening questions (e.g. "Are you having problems with housing conditions?" (12), likert scales (e.g. "How would you rate [the client's participation in] social network (family, work, friends)?" (24), and categorical or interval survey items (e.g. "What is your sexual orientation", "What was your total family income before taxes last year?" (25) for self-or practitioner-administered questionnaires accessible through EHR portals (5-7, 9, 21, 24-28). Free-text documentation in clinical notes also contained SDOH data in the form of common social history topics (29,30) such as the impact of "monetary assets, occupational level/security, and housing of housing stability and social support" on health (30). Some studies specified that a combination of both structured and unstructured SDOH data were contained in EHRs (12,31,32).

Data collection frameworks
Screening questions were developed based on national or professional guidelines, such as The theoretical underpinnings of data categories were not always provided, but evidence for basis in local population contexts was present in several descriptions of the collection methods (7, 28).

Automation in data collection tools
In all the cases identified in this review, a health care professional provided the data entry point into the health system. Physicians, social workers, or other health care staff were primarily responsible for identifying and coding relevant social information. Several SDOH data collection tools were self-administered, but only at the point-of-care (25,28). We identified no studies in which patient provided free-text data were analysed and screened for SDOH data. This review revealed that verbal or written forms of collection were used exclusively, and no documentation of other input (e.g.: visual or audio) was recorded in the studies present in our review.
No health system had a fully automated collection process, although automation was relevant to collection processes through the use of digital devices for gathering patient information. While this review did not specifically focus on practices beyond collection, our collected articles addressed health care staff's preference for free-text notes over questionnaires (12), as well as the possibility for Natural Language Processing (NLP) algorithms to efficiently and reliably identify SDOH data (29,32,34). One study revealed that data extracted from clinical notes were more comprehensive than data extracted from screening questionnaires (32).
The research articles did not always explicitly mention the end uses of the data. However rationale generally pertained to system improvement (7, 21, 24, 34, 36), such as determining efficient practices in collecting information on social support (37); preventative health (27,29,30,36), such as understanding cancer screening practices within a population (9); and promoting health equity (25,27,34) for instance by improving level of assistance with SDOH needs (7). Other studies reported that SDOH data collection at the care site was a response to specific mandates, such as the Veteran's Health Administration (VHA)'s call to end homelessness among users (21)  This may influence the generalizability of our findings, as unique features of American political economy, such as health policies and specific population diversity and inequalities, may shape SDOH data collection methodology. One of the two articles from Canadian study sites provides an example of how regional information management contexts influence SDOH screening. The authors stated that their survey design did not initially follow the principles of Ownership, Control, Access, and Possession (OCAP®), which governs research concerning Canada's Indigenous populations (38), and thus influenced how the authors reported study findings (6). This observation points to the need for additional research on contextually driven data collection methods, as well as further investigation into how recommendations from leading health organizations (e.g.: the Institute of Medicine) or guidelines (e.g.: OCAP®) influence practices in SDOH surveillance.
The search strategy, particularly the selection of key terms, also shaped the article pool. We searched for "Social Determinants of Health", rather than individual terms associated with SDOH. This was considered more feasible as broad categories of SDOH exist (e.g.: income, education, occupation) but lead to a broad range of variables. For instance, income could be conceptualised as 'family income', 'after tax income', 'accumulated wealth', 'ability to "make ends meet"', etc. Furthermore, a complete set of all social determinants of health is not possible. Such a list could conceivably contain such factors as 'uneven sidewalks', 'political corruption', 'density of green spaces', etc. We therefore limited the search to 'Social Determinants of Health' as the concept of interest.
Although this permitted us to further clarify how this concept is operationalized in contemporary research with EHRs, this search strategy would not have fully captured the breadth of research articles pertaining to data collection of SDOH which may not have been conceptualized as such. For example, information on 'early childhood experiences' were not collected in the EHRs described in this review, although this is a well-established SDOH (11). The full spectrum of SDOH data collection methodologies may therefore not be represented here. We

Discussion
This review demonstrates that several health systems have been able to integrate SDOH data into EHRs. As data collection methods have apparently been designed to avoid disrupting workflow (25,27), EHRs with embedded SDOH data collection tools could conceivably be scaled and expanded into other health systems. Further research into the information technology should improve efficiency and accuracy of data entry. However, before technical processes of SDOH data collection are developed, it is worthwhile to consider the meaningfulness of these data points and their potential to impact health equity.

Care access barriers
The first point to consider in population analyses using EHR data is that data collected through EHRs represent only a subset of the population: those with access and ability to navigate electronic health portals, or individuals with access to a health care provider who has access and ability to navigate electronic health portals. The parameters of EHRderived datasets are therefore limited by technology infrastructure, meaning they are already defined by the privilege of access to health services and information technology.
Evidence further suggests that SDOH (specifically intersections of age, income, and race) influence likelihood of using internet technology in health contexts (39), creating complex intersections of healthy inequities. Indeed, inequalities in access to information communication technology remains "one of the biggest hurdles" to enhancing well-being through digital tools (40).
Separate to population level analyses, the addition of SDOH data into EHRs is expected to advance precision health for individual patients by supporting clinical decisionmaking (29). However given the access barriers mentioned above, and our finding that most processes for SDOH data collection for EHRs take place within primary care settings, where "persistent health and access inequities" are still actualized (41); caution must be taken to avoid undermining equity by developing technology which excludes certain subsets of the population from health advancements.

Situated ontologies
Technological advancement in language processing and data management suggest that SDOH data collection methods may become efficient to the point where 'Big Data' analytics are possible. Even so, the ontological nature of SDOH data and social knowledge paradigms shed light on the complexity of data flow. As identified in this review, SDOH variables used in EHRs varied depending on the local context. Health practitioners may screen for only a handful of loosely defined variables, while others may collect precisely formatted information in as many as 108 domains (31). Decisions around which data to collect are determined by the priorities, and subsequent informational needs, of the health system in question. For example, considerable research is devoted to which data are necessary to improve care continuity and chance of recovery (24), reduce rate of readmission (32), or improve population rates of participation preventative practices (6).
The landscape of SDOH metrics and valid associations therefore changes across environments based on situated needs. Consequently, individuals and systems derive knowledge on health risks associated with SDOH from subjective perspectives, or the knowledge produced through a specific position in time and place (42). This points to the fact that SDOH are social constructs (43) defined and created through narrative discourse (44). By shaping the representation of the data, text-based tools for identifying and capturing SDOH data also contribute to developing the construct itself.
This has both direct and indirect effects on the patient. For example, a patient may feel that the provided SDOH information may further extend power differentials through stigmatization (40). To clarify, physicians may exert authority over the patient via their professionalization; the act of introducing new perceived differentials, such as through income or education "class", can further distance the patient from their provider. This can lead to discomfort in the interaction and/or low response rates to SDOH screening (25).
Extreme lack of cultural safety, or awareness and deconstruction of cultural power imbalances, can even lead to care avoidance (45), with serious consequences on health.
Indirectly, the act of classifying the patient determines their representation elsewhere in the IS. While the continuous and accurate representation of a patient is an ideal target for health systems, this includes the ability to represent the patient throughout the changing life course of a patient. There is therefore a need to engage with SDOH data as dynamic rather than static information, and recognize that health professionals produce these data via relational discourse.

Relationality in data collection methods to disperse concentration of social power
Prioritizing input and collaboration from various data users throughout the local information network in IS design may clarify necessary and situated methodological considerations for SDOH data collection. Roadmaps created through 'stakeholder engagement' (46) are examples of how health system administrators can incorporate local experience into the design of effective SDOH data collection systems. Further integrating discourse from multiple perspectives into SDOH data collection also "troubles" the current constructive narratives, to borrow a sociological term. By redistributing and sharing control over language, IS design which follows a participatory approach to development can dislodge the privilege of certain positions (e.g.: that of medical professionals) and create a more ethical and equitable process of documenting and interpreting social reality.
The reviewed literature showed an apparent gap in input from patients in IS design. The specific collection methods can shape the patient's experience with the health system, which is an SDOH variable in and of itself (47), as well as determine how they are represented through health data. Far from being a socially neutral process, we are cautioned that if the field of data science does not provide space to address how "scientific practices themselves inadvertently legitimate and further disseminate political and cultural values and interest" (for example 'institutional erasure' of non-gender conforming individuals in the health system (48)), it may "end up complicit" in perpetuating social inequality (49). As methodological decisions for capturing SDOH data require critical thought and direct experience with social power structures, as well as consideration for mechanical and professional feasibility, interdisciplinary and participatory research is a fundamental aspect of future work.

Refocusing data sources and collection technologies for transformational information flow
While EHRs are designed to capture information on individual patients, patients are not the sole source for understanding the nature of a given environment as it pertains to health. Linkage with data from other sources, such as government or other public data sets or direct observations from the care setting, also appear to be necessary to create a picture of the social context surrounding the health delivery system. In addition to permitting entry of secondary data into the IS, these connections would allow health evidence to flow into policy decision-making mechanisms. As policy-level decisions are necessary to effectively modify the social structures which contribute to disease (50), and clinical level treatment recommendations or service recommendations alone are likely to be inefficient in addressing health disparities (51), an information infrastructure which connects health data to policy decisions could lead to greater impact. We noted that several EHRs systems were connected to broader data sets through governmental or academic partnerships (e.g. (13)). While data linkage was beyond the scope of this review, interoperability between data sources should be considered in future research in IS design.
In addition to data linkage, emergent technologies permit innovation in the type of SDOH data admissible into an IS. Systems could incorporate non-text based data formats, such as geo-spatial data to determine neighbourhood 'walkability' (52), social network data on levels of social support (53), as well as a variety of input from the clinical setting, where interactions can be considered SDOHs in and of themselves. Although modern technology provides the capacity to "track, synthesize, and visualize" features of a patient's social context, EHRs have largely not capitalized on these capabilities (54). Research on technologically enhanced data collection tools for SDOH is a current gap in the literature and their use in collecting and integrating data into EHRs is a potential avenue for exploration.
F u t u r e d i r e c t i o n s i n a p p l i e d r e s e a r c h We identified multiple health information systems which currently incorporate SDOH data from EHRs into their operations. In spite of similar software platforms and collection tools, SDOH variables were not identical across contexts. Applied research should therefore consider local context in EHR design, namely the unique care pathways and social determinants characterising a given population. Situated knowledge on SDOH also serves to promote accountability and impact by generating locally usable and relevant information. Future contextual research, particularly research using transdisciplinary participatory methodologies, may further refine best practices in data collection as well as clarify (and avoid, in so far as possible) how data streams can perpetuate health inequities.
It is also important to note that 'Big Data' refers not only to the amount of data, but the proportion of collected data relative to all available information (55). As the data economy has shown; patients, or rather human beings, are nearly unlimited sources of data. While a complete profile of all relevant health information for every individual patient is beyond possible, the comprehensiveness of SDOH data could be enhanced by incorporating relevant SDOH information from outside the health sector, such as public domain or community data (56)(57)(58). Further research on data linkage is a next step in developing SDOH data frameworks in EHRs. Conversely, in comparison to the patient, the health service delivery system may be a more appropriate unit of analysis. In such a scenario, data access barriers are reduced, as the system is monitoring itself, and the potential for impact would be greater, as the health system has a higher degree of control over its own decisions and 'behaviours' compared to the control it exerts over patients. As factors within the health system greatly influence the experience of care, a shift in surveillance towards the system of care rather than the individual members of a population, may be a promising avenue for impactful research.
Finally, in addition to the equity implications described above, the use of 'Big Data' in the public sector creates additional challenges with respect to management, quality, ethical and privacy concerns (15,59). Key challenges around information dissemination, such as maintaining privacy and information security, directing relevant information to maximize impact and prevent information overload, and fully understanding the ethical implications of data collection and use, should all be further explored in research and policy discussions on data management.

Conclusion
This review clarified methods of collecting SDOH data for EHRs, which are increasingly relevant inputs for effective health planning and promotion. Current practices predominantly involve embedded and structured SDOH screening tools in the EHR, although the use of free-text data may increase as NLP algorithms become available to health systems. As the comprehensive range of SDOH variables tend to be specific to given populations, applying SDOH data collection tools will need to take local context into consideration. This also speaks to the paradigmatic issue of engaging with SDOH data as dynamic constructs in the experience of care. Although there is considerable perceived potential for automating SDOH data collection in order to enhance health analytics, researchers and practitioners must attend to the implications for the stated health equity goals. Evidence-informed systems-level changes based on situated knowledge should be considered an end goal of SDOH data collection methodology, rather than a sole focus on individual or behaviour driven health promotion strategies. In conclusion, mobilizing EHRs to promote SDOH data collection is a step towards facilitating 'Big Data' analytics in health information systems; however, further interdisciplinary and participatory research is necessary in order to capitalize on SDOH data for equity-oriented health promotion.  Financial ("Are you doing okay making ends meet?"), Food security ("In the past month, did anyone in your family go hungry because Centricity Physician Office there was not enough money?"), Housing ("Are you having problems emotional and tangible support systems have on health; (8) addiction, which encompasses the effects of alcohol, nicotine, and drug dependence both as a result of social inequality as well as a means of increasing its impact; (9) food, or how access to healthy foods can influence chronic disease management and progression; and (10) transport, encompassing both the ability to arrive at appointments and walk/exercise in safe environments. Tan-McGrory  2018 IOM SDOH domains, but adapted to local context and resources 1) caregivers, 2) race and ethnicity, 3) language, 4) sexual orientation and gender identity, 5) disability, and 6) social determinants of health. Residence, Living Situation, and Living Conditions Flowsheet prompts: "home", "house", "housing", "residence", "live", "living", "lives", "people", "mold", "insect", "rodent", "water", "heat", "social", "density" "Stairs", "railings", "safety", "safe", "facility", "group home", "skilled nursing facility", "assisted living facility", "support system", "family", "support", "housing conditions", "caregiver", "bathroom", "community support", "rehab", "assistive device", "social/environment", "equipment", "social support", "household", "transitional care", "social connectedness", "live alone"

No specific EHR
Fairview Health System (FHS) EHR system Figure 1