Data Warehouse
The dataset was sourced from a data warehouse Primary Care Audit, Teaching and Research Open Network Program (PATRON) curated by the University of Melbourne [8]. The database collects de-identified EMR data from over 129 Australian general practices, chiefly in Victoria. The repository comprises over 700 consenting GPs who work in general practices that use Best Practice™, Medical Director™, and ZedMed™ proprietary EMR systems (Table 1).
EMR data are extracted from these systems using the data extraction tool GRHANITE™ [12], and the data is then sent via encrypted transmission into the a repository. The GRHANITE™ tool de-identifies each patient by replacing the patients name with a unique patient identifier that links the patient to the individual visit data in each patient table [12]. Identifiers including patient address, date of birth, Medicare number, and general practitioner/staff member are either removed or deidentified prior to extraction to the data repository.
Each EMR system held in the primary care repository has data structures that are unique. Hence, to facilitate use of the whole database, the data from each system are harmonised to provide consistency, where possible. For instance, to provide a standardised version to the database, all data pertaining to ‘patient history’ from each EMR are merged into a single table, and likewise information relating to ‘medications prescribed’ are also merged into a table. Whilst data standardisation provides a single unified view to simplify researcher use, no data is lost in this harmonisation process (see Fig. 2).
Table 1
Types of EMR Systems Studied
EMR | URL | Approximate Percentage of Practices |
---|
Medical Director (MD) | https://www.medicaldirector.com | 40% |
Best Practice (BP) | https://bpsoftware.net | 50% |
Zedmed | https://www.zedmed.com.au | 10% |
Figure 2. EMR Harmonisation Process
Mapping Process
Primary care EMR systems contain free text terms and use limited proprietary coding such that the coding is unique to each EMR vendor. Therefore, one of the challenging aspects of extracting meaningful data is mapping the free text data to numerical codes in vocabularies such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED) and RxNorm (United States, medication terminology). It is important to note SNOMED is considered the ‘standard terminology’ for conditions in the OMOP CDM. Similarly, RxNorm is considered the ‘standard’ for medication in OMOP. The added benefit of the use of these vocabularies to make data within the repository ‘OMOP ready’, was that the mapping produced a repository that was aligned to international standards.
The mapping process was facilitated with the use of a tool called USAGI, that has been developed by a multi-stakeholder international collaborative body, Observational Health Data Sciences and Informatics (OHDSI) [7]. The USAGI tool converts text terms that have been extracted from the data warehouse into standardised SNOMED, and RxNorm terms. Although USAGI is a valuable aid, it is still a highly manual process that requires inputs from data mappers who have specific domain knowledge. We employed three final year medical students to undertake data mapping as they provided the required domain knowledge for this work. The student mappings were undertaken independently, the mappings for each student were matched in pairs, using excel to determine concordance. There was a greater than ninety percent agreement in the student mappings, where there was a discrepancy, clinical input from a physician was provided. The volume of distinct free text terms from each unique EMR presented challenges. For example, after cleaning, there were 96,000 distinct medication terms, consisting of some combination of a drug's brand name, generic name, form, strength, and packet size. Up to 10% of the source terms had an exact text match to the Snomed CT, RxNorm and LOINC standard vocabularies. As part of the manual mapping process for terms that did not have an exact match, an a priori decision rule provided a cut off value for terms that occurred with a low frequency. Based on this frequency threshold, over 95% of the unmapped terms were mapped manually. There was no direct match of a conditions table in the EMR to the ‘conditions occurrence’ table in the OMOP-CDM. Whilst the CDM has one ‘condition occurrence’ table that is expected to contain matching SNOMED term relating to a patient’s condition, the EMR has ‘reason for visit’ and ’history’ tables that provide data relating to a patient’s current condition, past condition, and clinical observations. If an entry was accompanied with a date, we deemed it to be a current condition and therefore recorded it as a ‘condition occurrence’ in the OMOP table. When a condition did not have an associated date, we considered it as a past observation and therefore recorded it as an ‘observation’ in the OMOP-CDM.
Extract Transform and Load (ETL) Structured Query Language (SQL)
We constructed tables from the mappings produced by the mapping team. These mapping tables were used in the extract, transform, and load (ETL) process to convert the EMR data to the OMOP-CDM. The term ETL, explicitly refers to the ‘extraction’ of data from a source system, where it is ‘transformed’ into a coded value as prescribed by the OMOP model, and then ‘loaded’ into the OMOP-CDM database. We created and executed SQL scripts in the ETL process. SQL is a structured query language that provides a series of syntactic commands to manage and transform data within a database.
Removal of Inactive Patients
The Royal Australian College of General Practitioners (RACGP) use active patients as a target group or denominator for reporting. A recognised RACGP definition of active patients was applied to the data set to improve data quality [13]. Based on RACGP definitions of an active patient, an inactive patient is one that has not attended a practice three or more times over the past two years [13]. Initially, this ‘inactive’ definition was applied to the dataset and the inactive patient records were excluded from the analysis. However, we found adherence to this definition resulted in the exclusion of new patients, with only 1 visit, so we also included patients with at least one visit over the last 2 years into the data set.
Data analysis
Atlas is an open-source Java application that allows visualisation and analysis of OMOP datasets it is used as a standard across the OMOP community [14]. Once the primary care data was converted into an OMOP compliant format it was securely connected to the ATLAS application for data visualisation and analysis purposes.
Data Quality
OHDSI has a standard set of checks called ‘Achilles’. These checks run a set of SQL queries that check data compliance with the OMOP-CDM standard [15]. The OHDSI Achilles checks were performed on the OMOP data set prior to connection to ATLAS. We also applied the Kahn Data Quality Framework [11] to the data set using the OHDSI Data Quality Dashboard [16]. The data quality dashboard runs a series of scripts in R Studio and produces its results in R Shiny. It has a complete set of predefined quality checks preconfigured to run on conformant OMOP datasets. The dashboard summarises the data and provides an overall quality metric expressed as a percentage.
Data Governance
All governance procedures we use are underpinned by the concept of beneficence ‘to do no harm’ [17]. For research using the OMOP-CDM, a data governance framework has been advanced building on the existing comprehensive framework implemented for primary care repository (Fig. 3) which included topics such as consent, privacy, and risk management [18].
Consent
Medical practices provided consent for their practice data to be accessed for research purposes via the primary care repository. Practices are also informed they can change their consent options or withdraw at any time, without prejudice. Regarding individual patient consent for the secondary use of their EMR data, a waiver of consent is applied. Practices inform patients their data is used for research using various communication strategies (i.e., on practice websites and practice posters), they are also informed they can withdraw consent at any point.
Patient privacy
Our data repository contains only de-identified patient data. The GRHANITE™ data extraction tool de-identifies patient data in the practice, ensuring only de-identified data is sent to our primary care repository.
Risk Management
We carried out a structured risk assessment of the entire process considering privacy, organisational, and technical risks (Table 2)
Table 2
OMOP Primary Database Assessment of Risk
Identified Risks | Risk Controls | Risk Rating |
---|
An authorised user discloses data to a third party. | Development of an access control policy based on researcher compliance with ethical, legal, and regulatory obligations related to privacy, data management, and data security. Access to data on the OMOP platform is not provided if ethics approval is not verified. | Low |
Hackers attack system. | Application of strong passwords and multifactor authentication. System not connected to public network. Immediate notification of all data privacy/security breaches to the University of Melbourne to mitigate cybersecurity attack. | Low |
Individual data is identified in the dataset. | Unique hash ID identifiers cannot be accessed via OMOP-ATLAS interface. Data aggregation is possible at any level so that only required data is exported. Preview data that is due to be transmitted to researchers. | Low |
Data accessible to researchers outside approved institutions | Currently only authorised researchers are permitted to analyse OMOP data operate OMOP within the University of Melbourne environment. | Low |
Data changes or becomes corrupt. | Version control where the most recent database is always held as back up. System identification of data extract failures or omissions for immediate notification to engineers. | Low |
Researcher uses data for purposes beyond their ethics permissions. | User requires ethical approval and training to access the data. User accepts professional responsibility to adhere to boundaries of permissions. Users restricted by OMOP- database access permissions. External users are restricted by contractual agreement regarding their use of the data. Internal University staff are restricted by employment conditions and Memorandum of Understanding regarding their use of the data. | Medium |
Figure 3. The Governance Model