DOI: https://doi.org/10.21203/rs.3.rs-2618841/v1
The use of routinely collected health data for secondary research purposes is increasingly recognised as a methodology that advances medical research, improves patient outcomes, and guides policy. This secondary data, as found in electronic medical records (EMRs), can be optimised through conversion into a common data model to enable analysis alongside other comparable health metric datasets. This can be achieved using a model such as, the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM). The OMOP-CDM is a data schema that uses a standardised vocabulary for the systematic analysis of multiple distinct observational databases. The concept behind OMOP is the conversion of data into a common format through the harmonisation of terminologies, vocabularies, and coding schemes within a unique repository. The OMOP model enhances independent institutional research capacity through the development of shared analytic and prediction techniques; pharmacovigilance for the active surveillance of drug safety; and ‘validation’ analyses across multiple institutions across Australia, the United States, Europe, and the Asia Pacific. In this research, we aim to investigate the use of the open-source OMOP-CDM in a primary care data repository.
We used structured query language (SQL) to construct, extract, transform, and load scripts into a database to convert the data into the OMOP common data model. The volume of distinct free text terms from each unique EMR presented a mapping challenge. Up to 10% of the source terms had an exact text match to the Snomed CT, RxNorm and LOINC standard vocabularies. As part of the manual mapping process for terms that did not have an exact match, an a priori decision rule provided a cut off value for terms that occurred with a low frequency. Based on this frequency threshold, over 95% of the unmapped terms were mapped manually. To assess the data quality of the resultant OMOP dataset we applied the OHDSI data quality dashboard.
Across three primary care EMR systems we converted data on 2.3 million active patients to version 5.4 of the OMOP common data model. The Data Quality Dashboard was used to check data Plausibility, Conformance and Completeness. In all 3,570 checks were performed, each one organized into the Kahn framework. For each check the result was compared to a threshold whereby a FAIL is any percentage of violating rows falling above a predetermined value. The overall pass rate of the primary care OMOP database described here was 97%.
Given the OMOP CDM’s wide scale international usage, support, and training available, it is an opportune way to standardise data for collaborative use. Furthermore, it is easy to share analysis packages between research groups. This allows the rapid and repeatable comparison of data between groups and countries. There is a full suite of open-source tools available to support the Common Data Model. For instance, the OHDSI Data Quality Dashboard proved especially useful in examining the quality of our data. The simplicity of the common data model and the standards-based approach makes it an easy model to adopt and integrate into existing data acquisition and processing procedures.
The use of routinely collected health data for secondary research purposes is increasingly recognised as a methodology that advances medical research, improves patient outcomes, and guides policy [1–4]. This secondary data, as found in electronic medical records (EMRs), can be optimised through conversion into a common data model (CDM) as found in the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM). to enable analysis alongside other comparable health metric datasets [4–6]. The OMOP-CDM, otherwise referred to as OMOP, is a data schema that uses a standardised vocabulary for the systematic analysis of multiple distinct observational databases [7]. The concept behind OMOP, is the conversion of data into a common format through the harmonisation of terminologies, vocabularies, and coding schemes within a unique repository (database) as shown in Fig. 1 [7]. The OMOP model enhances independent institutional research capacity through the development of: shared advanced analytic and prediction techniques; pharmacovigilance for the active surveillance of drug safety; and ‘validation’ analyses across multiple institutions across Australia, the United States, Europe and the Asia Pacific [7]. In this research we aim to investigate the use of the open-source OMOP-CDM in a primary care data repository [8].
Figure 1 OMOP Common Data Model Architecture (adapted from [7]
The primary purpose of an EMR is to record information related to patient care as it naturally occurs in the clinical setting [9]. The use of this data in medical research is a secondary, albeit useful function, as it provides the opportunity to establish ‘real world’ evidence on patient outcomes, healthcare quality, comparative effectiveness, and health system policy [10]. Yet, the quality of data recorded in an EMR varies in its completeness, conformance, plausibility, and currency, hence it is imperative that a measure of its quality is ascertained to determine its ‘fitness’ for research purposes. [11] set out a comprehensive set of quality checks that have now become the ‘de-facto’ standard which are now widely applied to datasets across the globe. These are the standards that have been used to ascertain data quality in our study.
The dataset was sourced from a data warehouse Primary Care Audit, Teaching and Research Open Network Program (PATRON) curated by the University of Melbourne [8]. The database collects de-identified EMR data from over 129 Australian general practices, chiefly in Victoria. The repository comprises over 700 consenting GPs who work in general practices that use Best Practice™, Medical Director™, and ZedMed™ proprietary EMR systems (Table 1).
EMR data are extracted from these systems using the data extraction tool GRHANITE™ [12], and the data is then sent via encrypted transmission into the a repository. The GRHANITE™ tool de-identifies each patient by replacing the patients name with a unique patient identifier that links the patient to the individual visit data in each patient table [12]. Identifiers including patient address, date of birth, Medicare number, and general practitioner/staff member are either removed or deidentified prior to extraction to the data repository.
Each EMR system held in the primary care repository has data structures that are unique. Hence, to facilitate use of the whole database, the data from each system are harmonised to provide consistency, where possible. For instance, to provide a standardised version to the database, all data pertaining to ‘patient history’ from each EMR are merged into a single table, and likewise information relating to ‘medications prescribed’ are also merged into a table. Whilst data standardisation provides a single unified view to simplify researcher use, no data is lost in this harmonisation process (see Fig. 2).
EMR | URL | Approximate Percentage of Practices |
---|---|---|
Medical Director (MD) | https://www.medicaldirector.com | 40% |
Best Practice (BP) | https://bpsoftware.net | 50% |
Zedmed | https://www.zedmed.com.au | 10% |
Figure 2. EMR Harmonisation Process
Primary care EMR systems contain free text terms and use limited proprietary coding such that the coding is unique to each EMR vendor. Therefore, one of the challenging aspects of extracting meaningful data is mapping the free text data to numerical codes in vocabularies such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED) and RxNorm (United States, medication terminology). It is important to note SNOMED is considered the ‘standard terminology’ for conditions in the OMOP CDM. Similarly, RxNorm is considered the ‘standard’ for medication in OMOP. The added benefit of the use of these vocabularies to make data within the repository ‘OMOP ready’, was that the mapping produced a repository that was aligned to international standards.
The mapping process was facilitated with the use of a tool called USAGI, that has been developed by a multi-stakeholder international collaborative body, Observational Health Data Sciences and Informatics (OHDSI) [7]. The USAGI tool converts text terms that have been extracted from the data warehouse into standardised SNOMED, and RxNorm terms. Although USAGI is a valuable aid, it is still a highly manual process that requires inputs from data mappers who have specific domain knowledge. We employed three final year medical students to undertake data mapping as they provided the required domain knowledge for this work. The student mappings were undertaken independently, the mappings for each student were matched in pairs, using excel to determine concordance. There was a greater than ninety percent agreement in the student mappings, where there was a discrepancy, clinical input from a physician was provided. The volume of distinct free text terms from each unique EMR presented challenges. For example, after cleaning, there were 96,000 distinct medication terms, consisting of some combination of a drug's brand name, generic name, form, strength, and packet size. Up to 10% of the source terms had an exact text match to the Snomed CT, RxNorm and LOINC standard vocabularies. As part of the manual mapping process for terms that did not have an exact match, an a priori decision rule provided a cut off value for terms that occurred with a low frequency. Based on this frequency threshold, over 95% of the unmapped terms were mapped manually. There was no direct match of a conditions table in the EMR to the ‘conditions occurrence’ table in the OMOP-CDM. Whilst the CDM has one ‘condition occurrence’ table that is expected to contain matching SNOMED term relating to a patient’s condition, the EMR has ‘reason for visit’ and ’history’ tables that provide data relating to a patient’s current condition, past condition, and clinical observations. If an entry was accompanied with a date, we deemed it to be a current condition and therefore recorded it as a ‘condition occurrence’ in the OMOP table. When a condition did not have an associated date, we considered it as a past observation and therefore recorded it as an ‘observation’ in the OMOP-CDM.
We constructed tables from the mappings produced by the mapping team. These mapping tables were used in the extract, transform, and load (ETL) process to convert the EMR data to the OMOP-CDM. The term ETL, explicitly refers to the ‘extraction’ of data from a source system, where it is ‘transformed’ into a coded value as prescribed by the OMOP model, and then ‘loaded’ into the OMOP-CDM database. We created and executed SQL scripts in the ETL process. SQL is a structured query language that provides a series of syntactic commands to manage and transform data within a database.
The Royal Australian College of General Practitioners (RACGP) use active patients as a target group or denominator for reporting. A recognised RACGP definition of active patients was applied to the data set to improve data quality [13]. Based on RACGP definitions of an active patient, an inactive patient is one that has not attended a practice three or more times over the past two years [13]. Initially, this ‘inactive’ definition was applied to the dataset and the inactive patient records were excluded from the analysis. However, we found adherence to this definition resulted in the exclusion of new patients, with only 1 visit, so we also included patients with at least one visit over the last 2 years into the data set.
Atlas is an open-source Java application that allows visualisation and analysis of OMOP datasets it is used as a standard across the OMOP community [14]. Once the primary care data was converted into an OMOP compliant format it was securely connected to the ATLAS application for data visualisation and analysis purposes.
OHDSI has a standard set of checks called ‘Achilles’. These checks run a set of SQL queries that check data compliance with the OMOP-CDM standard [15]. The OHDSI Achilles checks were performed on the OMOP data set prior to connection to ATLAS. We also applied the Kahn Data Quality Framework [11] to the data set using the OHDSI Data Quality Dashboard [16]. The data quality dashboard runs a series of scripts in R Studio and produces its results in R Shiny. It has a complete set of predefined quality checks preconfigured to run on conformant OMOP datasets. The dashboard summarises the data and provides an overall quality metric expressed as a percentage.
All governance procedures we use are underpinned by the concept of beneficence ‘to do no harm’ [17]. For research using the OMOP-CDM, a data governance framework has been advanced building on the existing comprehensive framework implemented for primary care repository (Fig. 3) which included topics such as consent, privacy, and risk management [18].
Medical practices provided consent for their practice data to be accessed for research purposes via the primary care repository. Practices are also informed they can change their consent options or withdraw at any time, without prejudice. Regarding individual patient consent for the secondary use of their EMR data, a waiver of consent is applied. Practices inform patients their data is used for research using various communication strategies (i.e., on practice websites and practice posters), they are also informed they can withdraw consent at any point.
Our data repository contains only de-identified patient data. The GRHANITE™ data extraction tool de-identifies patient data in the practice, ensuring only de-identified data is sent to our primary care repository.
We carried out a structured risk assessment of the entire process considering privacy, organisational, and technical risks (Table 2)
Identified Risks | Risk Controls | Risk Rating |
---|---|---|
An authorised user discloses data to a third party. | Development of an access control policy based on researcher compliance with ethical, legal, and regulatory obligations related to privacy, data management, and data security. Access to data on the OMOP platform is not provided if ethics approval is not verified. | Low |
Hackers attack system. | Application of strong passwords and multifactor authentication. System not connected to public network. Immediate notification of all data privacy/security breaches to the University of Melbourne to mitigate cybersecurity attack. | Low |
Individual data is identified in the dataset. | Unique hash ID identifiers cannot be accessed via OMOP-ATLAS interface. Data aggregation is possible at any level so that only required data is exported. Preview data that is due to be transmitted to researchers. | Low |
Data accessible to researchers outside approved institutions | Currently only authorised researchers are permitted to analyse OMOP data operate OMOP within the University of Melbourne environment. | Low |
Data changes or becomes corrupt. | Version control where the most recent database is always held as back up. System identification of data extract failures or omissions for immediate notification to engineers. | Low |
Researcher uses data for purposes beyond their ethics permissions. | User requires ethical approval and training to access the data. User accepts professional responsibility to adhere to boundaries of permissions. Users restricted by OMOP- database access permissions. External users are restricted by contractual agreement regarding their use of the data. Internal University staff are restricted by employment conditions and Memorandum of Understanding regarding their use of the data. | Medium |
Figure 3. The Governance Model
Across the three EMR systems we collected data on circa 5.6 million patients. The results were harmonised and converted to version 5.4 of the OMOP Common Data Model (Table 3).
Data Metric | Number | Comments |
---|---|---|
Number of Patients in dataset | 5,564,425 | Total database pool (Patients table) |
Number of active patients in dataset | 2,029,961 | Number of patients in database with at least 3 visits within any 2 year-period, or that have at least one visit from the last 2 years. |
Number of active patients in dataset with record of gender | 2,023,161 | A patient’s gender is one of: • Female (1,086,934, 53.7%) • Male (924,140, 45.7%) • Not Recorded (11,526, 0.6%) • Other (494, < 0.1%) • Unknown (67, < 0.1%) |
Number of clinical tables | 15 clinical tables, but no data recorded in NOTE, NOTE_NLP, VISIT_DETAIL | As per OHDSI CDM definition there are 15 clinical tables, 3 health system tables, 2 health economics tables, 5 derived tables and 2 metadata tables. An additional 12 vocabulary tables exist that are prepopulated in the CDM. |
Source database size | 1.38 Terra Bytes (TB) | Original SNAPSHOT: 1,383,296 MB Relevant Views rendered as tables (OMOP_SNAPSHOT_INSTANCE): 778,193 Mega Bytes (MB) |
CDM database size (after conversion) | 0.37 TB | OMOP_CDM 368,512 MB |
If a term appeared in the EMR above a frequency threshold level, it was mapped. The threshold occurrence level was determined on a table-by-table basis (Table 4).
EMR Source table | Frequency threshold | CDM Output data tables Destination table depends on data context |
---|---|---|
Reason for Visit | 50+ | CONDITION_OCCURRENCE (if a condition with a date) OBSERVATION (if an observation) |
History | 100+ | CONDITION_OCCURRENCE (if recorded with a date) OBSERVATION, MEASUREMENT (where no date) PROCEDURE_OCCURRENCE (if data is a procedure) DEVICE_EXPOSURE (if data is a device) |
Medications | 200+ | DRUG_EXPOSURE (if data is a drug) DEVICE_EXPOSURE (if data is a device) |
Immunisations | 5+ | PROCEDURE_OCCURRENCE DRUG_EXPOSURE |
Allergic reactions | 20+ | Stored as observation in OBSERVATION table |
Tests | 300+ | DEVICE_EXPOSURE, MEASUREMENT PROCEDURE_OCCURRENCE data |
To illustrate the frequency-based approach for mapping Table 5 shows the numbers that appear in the medications table in the EMR’s.
Category | Number | Term Coverage |
---|---|---|
Distinct drug terms in EMR, post-cleaning | 96,212 | 100% |
Terms with incidence 200+ | 10,051 | 96.8% (49,460,151 out of 51,071,733 drug exposure records total) |
Mapped terms (includes some with incidence < 200, including mappings inherited from OMOP conversions performed by University of NSW, or those found via direct text match) | 30,010 | 96.3% (49,193,190 out of 51,071,733 drug exposure records total) |
Unmapped terms (includes some with incidence > 200 where the term is insufficiently precise e.g., yearly influenza vaccinations, which often don’t specify the specific formulation and thus can’t be mapped to the appropriate RxNorm concept) | 66,202 | 3.7% (1,878,543 out of 51,071,733 drug exposure records total) |
The OHDSI data quality dashboard (DQD) produces metrics on data Plausibility, Conformance and Completeness. It performs over 3,000 checks on the data.
The DQD tool goes table by table and field by field to quantify the number of records in a CDM that do not conform to the given specifications. In all, over 3,570 checks are performed, each one organized into the Kahn framework [11]. For each check the result is compared to a threshold whereby a FAIL is any percentage of violating rows falling above that value. The results on the primary care OMOP database are presented below (Table 6).
Verification | Validation | Total | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pass | Fail | Total | % Pass | Pass | Fail | Total | % Pass | Pass | Fail | Total | % Pass | |
Plausibility | 1982 | 53 | 2035 | 97% | 285 | 2 | 287 | 99% | 2267 | 55 | 2322 | 98% |
Conformance | 746 | 30 | 776 | 96% | 157 | 0 | 157 | 100% | 903 | 30 | 933 | 97% |
Completeness | 289 | 14 | 303 | 95% | 6 | 6 | 12 | 50% | 50 | 295 | 20 | 94% |
Total | 3017 | 97 | 3114 | 97% | 448 | 8 | 456 | 98% | 3465 | 105 | 3570 | 97% |
The OMOP common data model proved to be a logical extension of our existing data warehouse. Once the tables and ETL scripts were established it proved to be simple to re-run the process every time we ran a new data extraction. As the CDM is primarily based around SQL it integrated well into our existing processes. SQL is a commonly used query language so resourcing the conversion work was achieved with existing staff with some additional on-line training from OHDSI and the European Health Data Evidence Network (EHDEN) Academy [19].
The OMOP clinical tables and the derived tables once connected to ATLAS were straightforward to work with. The pre-configured dashboards in ATLAS made it quick and easy to visualise data. However, the real strength in ATLAS lies in the ability to be able to quickly design cohorts and studies.
One of the key advantages of the CDM is it allows network studies to be carried out across centres. ATLAS allows for analysis packages to be designed by research groups and then shared with other groups. These analysis packages can then be imported and run-on local OMOP datasets. This allows standardised analysis to be run across research groups allowing the direct comparison of de-identified patient data between regions and countries with data never having to leave the source repository. This is advantageous for governance and security purposes. Groups that govern data access are reassured if the data does not have to leave their organisations. The only information that leaves is aggregated data. From a security perspective, existing secure data repositories can be used to store and analyse the data, with no need to secure the data elsewhere.
One significant challenge we encountered, was the large amount of free text terms found in local EMR systems. We noted that the source of truth was with the text – it is the text the General Practitioner sees not the code value. In some cases, textual descriptions did not match allocated codes and the codes were clearly erroneous. This made the task of mapping the terms time consuming with the potential to introduce errors. The ETL scripts to convert terms to the OMOP CDM can also be complex so we found it was important to document the scripts to ensure future maintainability.
A good understanding of data quality is central to the use of any dataset. It is especially important for OMOP datasets as the data is derived from source data and it has the potential to mask data quality issues, which only become apparent when comparing data to other datasets. Our data quality was assessed with the OHDSI data quality dashboard. We found minor issues with quality that needed to be corrected for instance we found some years of birth that were not plausible e.g., 1900. We also found tables with incomplete or no data, but this is normal with this type of primary care data. Some CDM tables were not populated deliberately in our data. For instance, the ‘PAYER_PLAN_PERIOD’ table refers to insurance data that is not applicable in the Australian context. We also do not extract narrative clinical notes from our practices, so the ‘NOTE’ table was not populated. The default threshold in the data quality tool in many of the tables for incomplete data is 0%, so any table or field with less than 100% of data produces a result of fail. This threshold was not appropriate for our data. As an example, the ‘CONDITION_OCCURRENCE’ table only had 62% of the patients with an entry. In our experience this figure is normal for this type of data and there are numerous reasons why a condition may not be recorded. To address this, the data quality dashboard pass/fail thresholds can be changed to consider local circumstances and apriori knowledge.
The ETL proved to be a complex and time-consuming activity. But in future studies this could be made more efficient by modifying some existing practises. The first and most obvious modification would be to have more coding in the source EMR systems rather than free text terms. This could be implemented as predicative text to reduce manual data entry for ‘time poor’ health care professionals (Ahltorp, 2013). This would have the added benefit of allowing for validation of data input reducing typographical errors. However, changing proprietary systems is not a simple endeavour and will require cooperation from vendors and regulators.
To make the mapping process more manageable we employed a frequency-based approach to mapping terms. This relied on the fact that mapping frequently occurring terms converted a high percentage of required concepts. This is a pragmatic way of reducing the effort required to map terms. It is also important to note if rarely occurring concepts falling below the frequency threshold are to be studied, they can be mapped simply on a case-by-case basis. Hence, the frequency-based approach allows efficiency and flexibility in the mapping process.
Another modification to the process that would help with subsequent primary care studies is to share the mappings produced for the ETL process. Indeed, we shared some mappings with the University of New South Wales who had already mapped some general practice terms. However, in future shared mappings could potentially be shared programmatically via an Application Program Interface (API) with other research teams using Fast Healthcare Interoperability Resources (FHIR) servers, such as the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Ontoserver [20]. This would streamline the process and cut down on manual tasks.
For the information contained in OMOP data sets to be used in research it is essential that good data governance is in place. This ensures the rights of individual patients are respected and that the data is handled responsibly. Transparency in the governance process underpins trust in the data and is fundamental to successful research. We have developed an extensive data governance process that is being adopted by our data governance committee and we believe this provides a good model for other groups managing OMOP data to adopt and learn from.
As part of this program of work, we are also investigating patient record linkage between primary care and hospital datasets. Whilst this paper describes the conversion of a primary care data repository to the OMOP-CDM, we are also in the process of converting a hospital dataset to OMOP using the same linkage keys. This provides much opportunity, as when linkage keys exist in both primary care and hospital data, records can be linked to create a more comprehensive patient record. This linkage method has previously been demonstrated by other groups (Burn, 2021), (Belenkaya, 2021). In these studies, data from EMR’s have been enhanced with more detailed health data from other sources, such as cancer registries, to increase completeness of the datasets.
Given the OMOP CDM’s wide scale international usage, support, and training available, it is an opportune way to standardise data for collaborative use. Furthermore, it is easy to share analysis packages between research groups. This allows the rapid and repeatable comparison of data between groups and countries. There is a full suite of open-source tools available to support the Common Data Model. For instance, the OHDSI Data Quality Dashboard proved especially useful in examining the quality of our data. The simplicity of the common data model and the standards-based approach makes it an easy model to adopt and integrate into existing data acquisition and processing procedures.
SQL structured query language
OHDSI Observational Health Data Sciences and Informatics
OMOP Observational Medical Outcomes Partnership
EMR Electronic Medical Record
Patron Primary Care Audit, Teaching and Research Open Network
This research was conducted under approval from the University of Melbourne General Practice Human Ethics Advisory Group (HEAG) (HREC ID 24461). Patient consent is not applicable.
All participating practices in the Patron Data for Decisions Program provided consent for data to be extracted from electronic medical records and shared for use in research on entry to the Program.
The Patron program of work (the Program) commenced with its successful ethics approval by the University of Melbourne Human Research Ethics Committee (HREC) on the 12th December 2016 (application 1647396). The terms of the ethics approval are as follows:
1. Australian Government, National Health and Medical Research Council (NHMRC), National statement on ethical conduct in human research. 2007
(Updated 2018). Australian Government, National Health and Medical Research Council: Canberra. https://nhmrc.gov.au/about-us/publications/national-statement-ethical-conduct-human-research-2007-updated-2018
2. A waiver of patient consent has been granted because the Human Research Ethics Committee has been satisfied that the Program meets the NMRC National statement on ethical conduct in human research guidelines for such a waiver. These include: that involvement carries no more than low risk, the benefits from the research justify and risks of harm associated with not seeking consent, it is impracticable to obtain consent from all participants, there is sufficient protection of participant privacy). See the Program Protocol document for more information. Australian Government, National Health and Medical Research Council (NHMRC), National statement on ethical conduct in human research. 2007 (Updated 2018), Australian Government, National Health and Medical Research Council: Canberra.
Availability of data and materials
Data is held in a secure enclave at the University of Melbourne and is not publicly available.
Competing interests
No competing interests to declare.
Funding
This project was funded by Melbourne Academic Centre for Health, as part of the Rapid Applied Research Translation 2.2 Medical Research Future Fund (MRFF), Australia.
Authors' contributions
Conceptualization R.W; methodology, R.W and D.O-S.; R.W, and DO-S. formal analysis; investigation, R.W., C.H., D.O-S. and C.C; writing—original draft preparation, R.W., D.O-S; C.H. and C.C.; writing—review and editing, R.W., C.H., D.O-S; C.C. and D.B; visualization, C.C and C.H.; funding acquisition, D.B. All authors have read and agreed to the published version of the manuscript.
Acknowledgements
This research used de-identified patient data from the Patron primary care data repository (extracted from consenting general practices), that has been created and is operated by the Department of General Practice, The University of Melbourne: www.gp.unimelb.edu.au/datafordecisions.