Completeness of Social and Behavioral Determinants of Health in Electronic Health Records: A case study on the Patient-Provided Information from a minority cohort with sexually transmitted diseases

Racially and ethnically diverse minorities often experience the disease burden of sexually transmitted infections or diseases (STD) more often than their White counterparts. Yet, little is known about the connection of STD systematic discrimination, racism, and social and behavioral determinants. Plus, little to no details exists related to how this information is recorded in their Electronic Health Records (EHRs). The objective of this study is to assess the completeness of social and behavioral determinants of health (SDOH) data in the EHRs of a minority cohort with STD. 2,993 minority patients diagnosed with a STD at the Mayo Clinic were identied for this study. A natural language processing (NLP) algorithm was applied on the Patient-Provided Information (PPI) in their EHRs to extract SDOH information in six domains that are associated with STD, namely alcohol use, substance use, sexual activity, sexual orientation, housing status, and employment status. The completeness of SDOH was assessed in terms of documentation, breadth, and density. Our study indicates that nearly half of 2,993 patients did not have SDOH-related PPI records in their EHRs whereas the patients who had SDOH-related PPI records had well-documented records for ve out of six SDOH domains, including alcohol use, substance use, sexual activity, housing status, and employment status, except for sexual orientation. A total of 1,504 patients had PPI in their EHRs for at least one of six SDOH domains, which is about 50.3% of the study cohort. Most SDOH domains have a short time span of 1 year, with up to 18 years of record data. Our analysis also indicated that education and age have a signicant impact on the recording of SDOH-related PPI records. Patients that are female, older, and higher educated tend to have more SDOH information available in their records. We assessed the completeness of SDOH information recorded in the PPI from patients’ EHRs. Due to large amounts of missing SDOH information in the PPI, future research is needed to integrate accurate and robust SDOH related data for downstream research and the impact of systematic discrimination on how this information is collected and interpreted in the EHRs. This is the rst study, to the best of our knowledge, that assesses the completeness of SDOH records in the PPI from EHRs. We used a minority cohort of 2,993 patients with STDs. Our results show that nearly half of the patients did not have SDOH-related PPI recorded in their EHRs, although the SDOH information is vital to the STD. There are many underlying reasons. One them due to the varied clinical guidelines and questionnaires during a clinical visit overtime. The lack of PPI in of data to prior using PPI analysis and PPI, ve out six domains were complete, sexual


Introduction
The perspective that population health outcomes are notably in uenced by complex, integrated, and overlapping social structures and economic systems is gaining recognition among scientists and public health professionals. Those social structures economic systems have been widely recognized as social determinants of health (SDOH). According to World Health Organization (WHO), SDOH are the conditions in the places where people live, learn, work and play that affect health risks and outcomes in a wide range. [1] SDOH in uences people's access to adequate diet, educational and career opportunities, healthy environmental conditions, and medical care, either in a positive or negative way. [1] They are identi ed to be the underlying factors for a wide-range of diseases and di culty in care engagement that are linked directly to health disparities. [2][3][4][5] SDOH is an important factor to be considered in order to reduce expenditures due to hospital readmissions. [6 7] Identifying SDOH is key to increasing health equity and quality of life for patients historically impacted by SDOH. [8] Addressing SDOH improves healthcare drastically. [9] Thus various research studies recommend the importance of understanding SDOH and the need to incorporate interventions to address problems not only at personal level but also at population level by making policy changes to bring health equity. [10][11][12][13] The exploration of a socioecological perspectives with a concentration on SDOH is important to not only improve health but also health equity by recommending policies that affect global population, [14] so that people receive equitable and equal opportunities to improve their health. [1] Though a wide range of studies have been conducted using SDOH data to improve public health and economy, [15] completeness of the SDOH data used to perform these analyses remained unexplored.
With the wide and rapid adoption of electronic health record (EHR) systems in healthcare organizations, there is increased emphasis on use of electronic health records (EHRs) to document SDOH information. Healthcare data collection has been growing rapidly with the introduction of EHRs. Healthcare data has gained the utmost importance in many areas, such as research and payment to perform signi cant analysis. Patient provided information (PPI) is integrated with the Mayo Clinic EHR system that gathers information from patients systematically and electronically during many aspects of the clinical practice, such as patient care, education, research, and business o ce functions . [16] PPI involves an intelligent system that automatically determines what questionnaire forms/information are needed and either mails paper forms or electronically sends the forms in the patient portal or on a tablet to be completed online or during physical clinical visit. PPI includes valuable information, for example, family history, past medical history, allergies, depression screening, etc., and most importantly lots of information related to SDOH. PPI is a vital part of healthcare data that is critical not only for patient-centered care but also to identify implicit SDOH factors that determine the health of a community/population. [17] However, the collection of patient provided data may differ from traditional methods of collecting data by health providers, with respect to response rate and data quality. Thus, it becomes crucial to verify the data quality of SDOH information captured in PPI from the EHR, especially the completeness of such information, as incomplete data might make the data useless for diagnosis/research or lead to improper/inaccurate analysis outcomes. Assessing the completeness of SDOH in EHRs is a fundamental factor for its primary or secondary usage yet has been rarely studied.
This study aims to assess the completeness of SDOH information collected in PPI from EHRs by using a case study of a minority cohort with sexually transmitted diseases (STD). Racially and ethnically diverse communities experience systematic discrimination which results in SDOH. Furthermore, this is exacerbated by the lack of prevention and treatment options in the communities. The disease burden is not explained by intrapersonal or community level concern but is often deep-rooted in systematic discrimination, lack of cultural awareness, and lack of access to for prevention and treatment [18]. The purpose of this study is to recommend the SDOH research to incorporate the data completeness factor understanding its importance while performing the analysis. This study has the potential to help the research community focusing on SDOH to improve health equity and public health, as well as to assist the analysis of clinical management for certain diseases such as STD. [19] Methods In this case study, the cohort is a minority cohort ( > = 18-year-old), including African American or Black, Hispanic, Asian, Native American, Paci c Islander, and non-White, who were seen at Mayo Clinic and diagnosed with STD from August 14, tool is a scalable informatics framework that stores and queries the standardized clinical data from the Mayo Clinical Data Warehouse (CDW). We identi ed a total of 2,993 patients that had been diagnosed with an STD at Mayo Clinic. Then we retrieved the PPI of this cohort in the same time range. Most PPI is in the form of questionnaires that have been given to patients during their visits. This study was approved by the Mayo Clinic institutional review board (IRB) for human subject research (IRB #20-000908).
Out of a total of 1,647,061 records being retrieved, 3,329 unique questions were found. We de ned six SDOH domains that are related to STD, which include alcohol use, substance use, sexual activity, sexual orientation, housing status, and employment status, as shown in Table 1. We developed a rule-based natural language processing (NLP) to identify questions related to the six concerned SDOH domains. The NLP algorithm used a list of keywords to identify the questions in each SDOH topic. The details of this algorithm can be found in the supplemental materials. With the NLP algorithm, we identi ed and extracted a total of 125 SDOH related questions. Then we manually excluded 7 unique questions as they were not relevant to the SDOH domains, such as "Are you interested in more information about safety (seat belts, smoke detectors, rearms)?" etc. A nal dataset containing 118 unique SDOH related questions (42 for alcohol use, 50 for substance use, 16 for sexual activity, 2 for sexual orientation, 2 for housing status, and 6 for employment status) were used in this study. Since many SDOH information are being recorded with no answers or invalid answers, we also evaluated valid SDOH records (e.g., SDOH questions with valid answers). In addition, we assessed the association between SDOH domains and patients' gender, age, education level, and living area using t-test. The living area is de ned as urban, suburban, and rural based on patients' zip codes. This assessment shows if any of the demographics have an in uence on the completeness of the SDOH information in the EHRs.

Results
We rst demonstrate the demographics of the minority cohort in Table 2. Male and female patients are reasonably balanced in the study cohort, with slightly more female patients. The majority minority in our cohort is Black or African American. The cohort is relatively young with 34% from 18-34 years old due to the study disease. Education status is unknown for more than half of patients (56.6%). For those with education status, most patients (31.8%) are collegeeducated (e.g., Some College or 2 year degree, Four year college graduate, or Post graduate studies). 36.8% of patients lives in suburban area. The education and living area are consistent with regional statistics in the Rochester area [23].  Speci c to each SDOH domain, all the 1,504 patients had recorded SDOH questions in alcohol use followed by those in substance use, whereas only 85 patients had recorded those in sexual orientation. Interestingly, out of the patients who had recorded SDOH questions, most of them had answered most questions in their records (i.e., valid records).

Association
The association between six SDOH domains and patients' demographics is shown in Table 5. Since gender and biological sex were identical for all except one patient, we just used the gender information. We nd that all SDOH domains have signi cant association with gender, age, education level, and living area, except sexual orientation and living area. Table S5, S6, S7, and S8 in the supplemental materials list the number of patients for each SDOH domain in terms of gender, age, education level, and living area, respectively. Female patients have more SDOH information recorded in their EHRs than male patients. This is consistent with literature [22]. Table S6 shows that majority of the documentation was found among the age group of 18-34, with less percentage among the age group 75-84 and > 85. From Table S7, we see that majority of the documentation was found among some college or 2-year degree group, whereas the least documentation was found among the 8th grade or less education level patients. Since most patients in our cohort lived in suburban area, Table S8 shows a limited comparison for different living areas. Substance abuse was well-documented in most living areas, whereas sexual orientation was the least documented SDOH topic. This is consistent with the documentation completeness results. However, the impact of geographical location on SDOH documentation must be further explored since the population size among urban, sub-urban, and rural areas varied greatly in our study. In conclusion, patients that are female, older, and higher educated tend to have more SDOH information available in their records.  Discussion This is the rst study, to the best of our knowledge, that assesses the completeness of SDOH records in the PPI from EHRs. We used a minority cohort of 2,993 patients with STDs. Our results show that nearly half of the patients did not have SDOHrelated PPI recorded in their EHRs, although the SDOH information is vital to the STD. There are many underlying reasons. One of them might be due to the varied clinical guidelines and questionnaires during a clinical visit overtime. The lack of SDOH-related PPI in EHRs indicates an additional layer of data completeness evaluation to be considered prior to using PPI in downstream analysis and research for public health. But amongst the patients who had recorded SDOH-related PPI, ve out of six SDOH domains were complete, except for sexual orientation. This result indicates that alternative data resources in EHR should be considered for extraction of sexual orientation (e.g., clinical notes) and more sophisticated methods should be used (e.g., NLP).
SDOH-related PPI records were higher recorded among females than males. It might be due to that females usually have more health issues compared to males, leading to more hospital visits with a higher number of EHRs, which is consistent with the literature. [24] Most of the SDOH-related PPI were recorded and answered by well-educated people. It might be because of a greater awareness of the importance of PPI medically and socially, whereas this awareness might be less in the not well-educated population. Thus, it is important to nd the means to educate the importance of PPI among not-welleducated patients, as well as the rural population, as the number of patients who recorded PPI in rural areas was also signi cantly lower than urban and suburban areas. Another major factor for no records of PPI might be the quantity of PPI questionnaires and/or the way of displaying the questionnaires for the patients, which might make them feel tiresome and uninterested in lling it out during the waiting time before their interaction with the physician. This can be speci cally among elderly patients.
This study provides a useful conclusion that the SDOH information stored in structured PPI from EHRs has signi cant information gaps, which makes it non-straightforward for secondary use in epidemiological research studies in a population. The information gap might also be due to various formats of the existing SDOH information in different resources in EHRs. There is a need to have universal data standards to appropriately capture, retrieve and analyze SDOH data. [25] It is also important to design methods like NLP, ML to collect SDOH data from various other sources, such as clinical notes of EHR etc.
This study has several limitations corresponding to data and the level assessment of SDOH domains. We have used the PPI data from a single institution. Hence, there is a possibility for inconsistency of the results if extended to generalize on PPI data from other institutions. The assessment is performed at the topic level, i.e., alcohol use related, substance use related etc., but we did not dive into the intricacies of each question, i.e., what type of questions are considered less by the patients, why and so on. We were unable to examine the link to systematic discrimination and SDOH for this study. It is documented that some efforts are in place to address the STD health disparities for Blacks but this limited effort has not been evaluated in all minority communities [18].
For the future work, this study presents the scope of performing research in various areas, such as developing natural language processing (NLP) based approach to derive sexual orientation related PPI information from clinical notes, as the availability of structured data is very low. This approach of evaluating data completeness can be extended to other diseases PPI, such as depression, instead of only the current SDOH domains evaluated in this study. The SDOH outcomes could help in designing policies/strategies to remove barriers of health equity and provide better patient care for individuals. [26] Conclusion SDOH plays a vital role in the research studies to determine ways to bring health equity among the population. Our study indicates that nearly half of 2,993 patients did not have SDOH-related PPI records in their EHRs whereas the patients who had SDOH-related PPI records had well-documented records for ve out of six SDOH domains, including alcohol use, substance use, sexual activity, housing status, and employment status, except for sexual orientation. A total of 1,504 patients had PPI in their EHRs for at least one of six SDOH domains, which is about 50.3% of the study cohort. Most SDOH domains have a short time span of 1 year, with up to 18 years of record data. Our analysis also indicated that education and age have a signi cant impact on the recording of SDOH-related PPI records. Patients that are female, older, and higher educated tend to have more SDOH information available in their records. The impact of geographical location on SDOH documentation has to be further explored as our study cohort is imbalanced among urban, sub-urban, and rural areas. It is also important to consider the ethnoracial factors linked directly to systematic discrimination and lack of healthcare. Due to large amounts of missing SDOH information in the PPI, future research is needed to collect and integrate accurate and robust SDOH related data for downstream research from other EHR resources, such as the unstructured clinical notes, to have a complete picture of SDOH variables for patients.

Declarations
Ethics approval and consent to participate This study was a retrospective study of existing records. The study and a waiver of informed consent were approved by Mayo Clinic Institutional Review Board in accordance with 45 CFR 46.116 (Approval #20-000908).

Consent to publish
Not applicable; the manuscript does not contain individual level of data.