Obtaining a multi-organization OMOP CDM repository from two heterogeneous EHR ecosystems: a flexible methodology based on Detailed Clinical Models

doi:10.21203/rs.3.rs-3550497/v1

Background

Standardized repositories of real-world data provide a mechanism for semantic convergence of data from different heterogeneous organizations for secondary use. However, it is common for these repositories to be populated from locally designed information systems, which generates inefficient processes that are not reusable in other organizations or projects.

Objective

Design and application of a methodology based on the Detailed Clinical Models (DCM) paradigm for allowing the flexible and harmonized implementation of a real world-data (RWD) repository from two technically and organizationally heterogeneous EHR ecosystems.

Material and methods

First, the DCM paradigm was used for the design of common information objects. Second, a set of clinical archetypes were implemented according to ISO 13606 standard. Third, an OMOP CDM muti-organization repository was implemented for COVID-19 research. Finally, the quality of the data obtained with the aforementioned process was evaluated.

Results

The main result was the proposal of a methodology for obtaining harmonized EHR-derived datasets using clinical archetypes as a convergence mechanism between local organization-dependent EHR designs. In addition, the application of this method also generated a set of reusable implementation results: (1) the catalog of clinical archetypes, (2) the definition of the transformation process from the archetypes to the OMOP CDM model, and (3) the EHR-derived dataset obtained.

Conclusions

The flexibility of the methodology made possible the adoption by two digitally mature tertiary hospitals, without altering the platforms already in place. Likewise, the method is agnostic to organizations, to persistence and exchange standards to be obtained, and to application health conditions. Therefore, it can be concluded that the implemented methodology constitutes an innovative and transferable solution to obtain RWD datasets in an efficient, flexible and reusable way.

Electronic Health Records

Clinical Research Informatics

Semantic Standardization

Clinical Archetypes

OMOP CDM

FAIR Principles

Electronic Health Record (EHR) is a digital repository of health data, longitudinal throughout the patient's life, whose primary purpose is the provision of healthcare to the subject from whom the data originated [1]. EHRs can have uses other than health care, known as secondary use, such as clinical research or health outcomes assessment [2]. For both primary and secondary purposes, it is essential that data can be extracted, exchanged, and combined between different healthcare organizations without altering their original meaning, in accordance with FAIR data principles [3]. To address this, biomedical informatics research and standardization organizations have created various specifications that harmonize these needs with full meaning and context [4]. For the purpose of designing a standard and semantically interoperable EHR [5], EN/ISO 13606 and openEHR specifications propose common reference models for the detailed, fully meaningful formalization of health knowledge through clinical archetypes, known as detailed clinical models (DCM) [6, 7]. Similarly, focused on clinical research [8], the OMOP CDM standard provides a common data model and semantics to facilitate collaborative observational research based on real-world data (RWD) [9]. Both standards, applied according to their design purposes, are useful in building advanced healthcare data platforms [10].

However, it is common for RWD repositories to be populated from specific extract, transform and load (ETL) processes from locally designed information systems, instead of starting from EHRs modeled and formalized according to clinical archetypes, which generates inefficient processes that are not reusable in other organizations or projects [11]. Therefore, it is essential to carry out a previous effort of multipurpose harmonization of the information objects available in the EHR through archetypes, allowing that the design of the correspondences to the OMOP CDM model constraints starts from these and not from the local EHR data models. In the present study, this approach was applied in two tertiary hospitals of the Spanish National Health System, Hospital Universitario 12 de Octubre de Madrid (H12O) and Hospital Clínic de Barcelona (HCB) [12, 13], in the framework of different collaborative health data projects that both organizations have led in recent years. This application was not identical in both hospitals, as the flexibility of the DCM paradigm allowed its application respecting the EHR data reuse methodologies previously implemented in both organizations [14].

Therefore, the main objective of this study was to design and implement a methodology based on the DCM paradigm to enable a flexible and harmonized implementation of RWD repositories from technically and organizationally heterogeneous EHR ecosystems. The methodology was applied for building a multi-organizational OMOP CDM repository in Spain for COVID-19 research, and involved four particular objectives:

Modeling and formalization of common information objects, through clinical archetypes, for semantic convergence between the local EHR designs of H12O and HCB.
Implementation of the clinical archetypes in both EHR ecosystems to obtain extracts according to their constraints, under the methodologies and specific technical and organizational requirements of each hospital.
Design and implementation of an ETL process from EHR extracts based on clinical archetypes to the OMOP CDM model.
Evaluation of the methodology through the quality analysis of the data obtained in the multi-organizational OMOP CDM repository.

The methodology followed in this study is based on previous studies of semantic harmonization of health information and implementation of FAIR data processes [15, 16]. First, the DCM paradigm was used for the design and formalization of the common information objects that harmonize the heterogeneous EHRs of H12O and HCB. Second, the clinical archetypes were implemented according to the different methodologies previously adopted by both organizations. Third, an OMOP CDM muti-organization repository was implemented for COVID-19 research from the obtained harmonized data extracts. Finally, the quality of the data obtained with the aforementioned process was evaluated. Figure 1 schematically describes the methodology of the study.

2.1. Common detailed clinical modeling

The DCM paradigm is based on a dual approach composed of a reference model, i.e., a common set of generic EHR components and their combination restrictions, and an archetype model, i.e., formalizations of health domain concepts built using the reference model. Clinical archetypes, in addition to defining an information structure with constraints such as the cardinality of the elements [17], allow the meaning of the concepts to be represented from the semantic binding to terminologies such as SNOMED CT or LOINC [18, 19]. In this study, the DCM-based standard employed was EN/ISO 13606 because: (1) it defines a rigorous and detailed information architecture for defining clinical domain concepts; (2) it allows the set of clinical concepts to be expanded without altering the structure of the databases; (3) there are several data platforms and tools implemented in health organizations based on this specification [20]; (4) it is proposed by the Spanish Ministry of Health and used by the different Regions as a standard for the definition of interchangeable EHR extracts in the country [21]; and (5) it was adopted by H12O and HCB for the management and governance of clinical concepts and modeling resources [15, 16].

Therefore, as a first step, health information experts from both hospitals jointly identified, in an agnostic way to their local EHR information models, a total of 11 information objects to be extracted from the EHR, these being, patient, encounter, encounter details, clinical observation, laboratory observation, health problem, diagnosis, prescribed medication, administered medication, cumulative drug dose and procedure. Then, each of the information objects was modeled with a multi-purpose approach, taking as a reference previous national and international EHR specifications [22, 23], as well as standardized RWD models. Thus, data elements that compose the information objects, as well as coded vocabularies and units of measurement, were defined and linked to the standard terminologies and classifications SNOMED CT, LOINC, ICD-10-CM and ATC [18, 19, 24. 25]. Finally, clinical archetypes based on the EN/ISO 13606 standard were implemented in Archetype Definition Language (ADL) using LinkEHR [27, 27], thus formalizing the modeling, semantics and constraints previously defined. In total, a set of 11 clinical archetypes were implemented and published. Figure 2 displays an example of the Clinical Observation archetype, showing (1) the mind map diagram and (2) an excerpt of the developed ADL code.

Convergence on a common detailed information model was essential for each organization to identify in their EHR models, in an analogous manner, the data tables to be included in the ETL process. It also made it possible to define the transformations to the OMOP CDM model, both structural and terminological, from these common information objects. Thus, the process is reusable by any organization that incorporates clinical archetypes as a proxy between their local EHR models and OMOP CDM or any other standard format, avoiding specific transformations from local data models to secondary use specifications.

2.2. Hospital-specific implementation of information resources

The set of clinical archetypes, as well as the conversion to the OMOP CDM model, were implemented in the digital ecosystem of H12O and HCB according to their own technical and functional requirements. On the one hand, H12O has a comprehensive health data platform, InfoBanco [28], which allows transformations of raw EHR data to openEHR and OMOP CDM models. On the other hand, HCB has an ontology-based clinical repository, OntoCR [29], which allows conversions between local and standard data models based on simple and reusable ontology mappings.

2.2.1. Implementation in the InfoBanco platform of H12O

InfoBanco is the comprehensive platform for the governance and use of health data in research and clinical management in the Region of Madrid, Spain [28]. Its design is based on the agnostic application of health information standards, that is, using each one for its design purpose [10]. The platform starts from the centralization of EHR data in a single data lake (DL), from which data are harmonized according to international persistence and exchange standards making use of model and terminology servers.

Therefore, the first step was to integrate the clinical archetypes into the H12O information systems. The flexibility of the DCM-based methodology allowed its application in information systems not designed under the dual-model paradigm. For this reason, the implementation of the information objects was made through virtualized SQL queries in the EHR system database, implemented with Oracle technology [30]. The result was 20 virtualized tables in the EHR database, from which raw data was extracted and loaded into the DL. Data centralized in the DL were then loaded into an openEHR repository implemented with the Better platform [31], making use of its standardized data ingestion mechanisms. This repository allowed incorporating the previously defined archetypes natively, persisting the data according to them, and querying and extracting data through the Archetype Query Language (AQL) [32], obtaining EHR extracts in JSON format. Figure 3 shows (1) the set of virtualized tables implemented in the EHR database, and (2) a raw data extract of the patient demographics view, (3) an AQL query to extract data according to clinical archetypes, and (4) an extract of data according to openEHR in JSON format.

Finally, the data persisted according to clinical archetypes were extracted and transformed, in structure and terminology, to the OMOP CDM. This ETL process was designed according to a formal data operations framework proposed in previous studies [30], defining each element with the 'Rabbit in a Hat' tool of OHDSI [33], and implemented with the open source tool Pentaho [34]. Figure 4 describes the data flow implemented in InfoBanco from the EHR to the final OMOP CDM dataset.

2.2.2. Implementation in the OntoCR platform of HCB

OntoCR platform, designed and implemented by the HCB Medical Informatics Department, uses clinical ontologies for the representation of biomedical knowledge and the instantiation of structured data [29]. Through the ontological representation of metamodels of health information standards, clinical terminologies and common data models, a flexible and scalable methodology was created that allows modeling clinical domain concepts with the level of granularity that is required locally, and then facilitating their transformation to any type of standardized representation, such as OMOP CDM [16].

In this study, locally created metamodels of OMOP CDM and the EN/ISO 13606 reference and archetype models were used to represent each clinical concept as an instance of the corresponding metaclass of these standards. The modeling process was implemented in Web Ontology Language (OWL) using Protégé [35, 36]. This method allowed terminological binding as well as the assignment of the specific node in each metamodel to either the element of the archetype, the class of the reference model or the column of the OMOP CDM table. Figure 5 shows (1) the mapping of the local class Diagnosis to (2) the class "ISO_13606:ENTRY" and to (3) the metaclass "omop:condition_source_value", as well as (4) the mapping process of the corresponding archetype element to the local property containing the diagnosis code.

Hence, raw data coming from the HCB information systems were transformed to EN/ISO 13606 extracts by mapping the local elements to the archetypes using LinkEHR Studio [26]. Subsequently, the insertion of the instantiated EHR extracts into the previously created ontologies of EN/ISO 13606 and OMOP CDM metamodels was performed. Then, we used the ontological mappings to extract the data using SPARQL queries, which are fully reusable because they are based on the archetypes [37]. Finally, through an equally scalable Python post-processing, the necessary modifications were made to generate the datasets adjusted to the OMOP model. Figure 6 describes the data flow implemented in OntoCR from the EHR to the final OMOP CDM dataset.

2.3. Multi-organization OMOP CDM repository for COVID-19 research

OMOP CDM defines a common structure and semantic representation that facilitates the combination and federation of health data, as well as their homogeneous analysis, from heterogeneous EHR databases. This specification also provides a library of specific analytical tools based on the data model. In this project, the OMOP CDM version 5.3 was used because it is the most adapted to the existing suite of tools, including the Data Quality Dashboard (DQD) [38], which is essential for verifying the quality of the datasets obtained. OMOP CDM design is condition agnostic, since it defines a set of tables and variables common to any observational study. However, due to the availability of COVID-19 data produced by previous projects of both organizations [39], this condition was selected for the application of the methodology for obtaining data in OMOP CDM format from common detailed information models.

The consolidated data were hosted in StarLife [40], an infrastructure dedicated to the biomedical field that configures a secure environment with high capacity to create, store and analyze massive volumes of data, resulting from the collaboration of the Center for Genomic Regulation (CRG), the Institute for Research in Biomedicine (IRB) and the Barcelona Supercomputing Center - National Supercomputing Center (BSC). The dataset obtained in the multi-organization repository could be employed in international RWD initiatives based on OMOP CDM such as EHDEN [41], a European federated network of observational studies, and DARWIN [42], a project leads by the European Medicines Agency (EMA) for building a European network for the use of health data throughout the life cycle of medicines. Likewise, the data could be used in national projects in Spain, such as the ISCIII-COVID Registry [43], to increase knowledge about this disease in order to provide relevant information for patient and public health management through collaborative research.

The main result of the study was the proposal of a methodology for obtaining harmonized EHR-derived datasets from heterogeneous EHR sources, using a common set of clinical archetypes as a convergence mechanism between the local organization-dependent EHR designs. This method was applied to obtain data in OMOP CDM format for COVID-19 research. Thus, this application generated a set of implementation results, reusable by other projects and organizations, such as the catalog of clinical archetypes, the definition of the transformation process from the archetypes to the OMOP CDM model, and the EHR-derived dataset finally obtained.

3.1. Common catalog of multipurpose EHR archetypes

The first result obtained was the common catalog composed of 11 clinical archetypes conforming to the EN/ISO 13606 standard. Table 1 describes each of the archetypes implemented, also indicating the terminologies used.

Table 1. Multipurpose catalog of clinical archetypes conforming to EN/ISO 13606.

Archetype	Description	Terminologies
Patient	Demographic patient data	SNOMED CT
Encounter	Event or period during which the caregiving process took place	SNOMED CT
Encounter details	Patient transfers between healthcare institution areas	SNOMED CT
Clinical observation	Clinical parameters assessed by a healthcare professional	SNOMED CT
Laboratory observation	Laboratory tests performed on a patient	LOINC, SNOMED CT
Health problem	Longitudinal health problems that were recorded for a specific patient	SNOMED CT
Diagnosis	Diagnosis that was recorded for a specific patient	ICD-10-CM, SNOMED CT
Prescribed medication	Drugs prescribed to a patient in the context of an episode	SNOMED CT, ATC
Administered medication	Drugs administered to a patient in the context of an episode	SNOMED CT, ATC
Cumulative drug dose	Cumulative dose of drugs that were administered to a patient in the context of an episode	SNOMED CT, ATC
Procedure	Procedures performed on a patient during the caregiving episode	ICD-10-PCS, SNOMED CT

This resource was published openly under a Creative Commons license for adoption by other health organizations of the Spanish National Health System [44, 45]. This led to its adoption as a reference implementation for obtaining COVID-19 data by the COVID Data Portal of Spain [39], and by the IMPaCT Data project [46]. Similarly, the archetypes were published open to international standardization communities, such as openEHR, to enrich the resources currently offered [23].

3.2. Mapping process between clinical archetypes and OMOP CDM

The second result obtained was the definition of correspondences between clinical archetypes and the OMOP CDM v5.3.1 model. Table 2 describes the correspondences between the archetypes and tables of the OMOP CDM model, as well as the terminological mappings defined.

Table 2. Definition of the mapping from clinical archetypes to OMOP CDM format.

Archetype	OMOP CDM tables	Terminological mappings
Patient	Person; Death	SNOMED CT to OMOP codes
Encounter	Visit_occurrence	SNOMED CT to OMOP codes
Encounter details	Visit_detail	SNOMED CT to OMOP codes
Clinical observation	Measurement; Observation	No mapping required
Laboratory observation	Measurement	No mapping required
Health problem	Condition_occurrence	No mapping required
Diagnosis	Condition_occurrence	ICD-10-CM to SNOMED CT
Prescribed medication	Drug_exposure	SNOMED CT and ATC to RxNorm
Administered medication	Drug_exposure	SNOMED CT and ATC to RxNorm
Cumulative drug dose	Drug_exposure	SNOMED CT and ATC to RxNorm
Procedure	Procedure_occurrence; Device_exposure	ICD-10-PCS to SNOMED CT

Subsequently, for each archetype-table pair, the correspondences between the data elements of the archetype and the fields of the OMOP CDM tables were defined. This definition of correspondences from a set of common archetypes avoids the complexity of starting the ETL process from local data models that can be altered in tool updates. Instead, the mapping from the EHR is only performed once to the clinical archetypes, which are governed by the healthcare organizations that have defined and adopted them, and the conversion process that follows is defined, either to OMOP CDM [9], in the application of the methodology in this study, or to any other secondary use specification such as i2b2 or REDCap [47, 48].

3.3. Multi-organization EHR-derived dataset for COVID-19 research

The third, and final, result was to obtain an EHR-derived dataset in OMOP CDM format from both heterogeneous organizations, and their combination in a secure data space. This COVID-19 dataset was used by different projects in which H12O and HCB participate as data providers [41-43]. Likewise, it is the demonstrator that the proposed method works, and can be adopted by other Spanish and international organizations [8, 39, 46]. Table 3 describes the set of records, concepts and patients obtained in each table of the OMOP CDM model, classified according to the data source Hospital.

Table 3. Records, concepts and patients obtained in the OMOP CDM repository from H12O and HCB.

OMOP CDM table	H12O			HCB
OMOP CDM table	Records	Patients	Concepts	Records	Patients	Concepts
Person	292,306	292,306	8	6787	6787	4
Death	7634	7634	3	768	768	2
Visit_occurrence	585,729	171,862	6	14,323	6787	5
Visit_detail	0	0	0	47,806	6787	3
Observation	883,172	191,232	561	1,109,588	5885	48
Measurement	361,522	102,528	29	9,000,822	6063	258
Condition_occurrence	1,760,209	201,668	7334	71,364	6787	2582
Drug_exposure	2,554,791	194,449	2619	192,793	5621	942
Procedure_occurrence	502,773	58,872	2490	17,682	4976	928
Device_exposure	0	0	0	6213	44	3

Likewise, on the combined dataset in the OMOP repository it was possible to run the DQD v2.4.0 data quality tool of OHDSI [38], which evaluates conformity with the standard format, as well as completeness, i.e., whether a fact about a patient is present as a data element in the EHR, and plausibility, i.e., whether a data element makes sense according to the existing knowledge about the concept it reports [49]. Table 4 shows the results obtained in the execution.

Table 4. Quality analysis of the combined data in the OMOP CDM multi-organization repository for COVID-19.

	Verification				Validation				Total
	Pass (N)	Fail (N)	Total (N)	Pass (%)	Pass (N)	Fail (N)	Total (N)	Pass (%)	Pass (N)	Fail (N)	Total (N)	Pass (%)
Plausibility	2144	27	2171	99	278	9	287	97	2422	36	2458	99
Conformance	675	53	728	93	106	0	106	100	781	53	834	94
Completeness	384	12	396	97	16	1	17	94	400	13	413	97
Total	3203	92	3295	97	400	10	410	98	3603	102	3705	97

These tables evidence that the database obtained after combining the data from both hospitals is complete, robust, and according to the standards required by the OMOP CDM model. A total of 17,416,282 records from 299,093 different patients were obtained, and 97% of the quality filters were successfully passed.

This study describes a methodology for semantic harmonization of health information, and the implementation for obtaining an OMOP CDM repository for COVID-19 from two tertiary hospitals with heterogeneous technical and organizational EHR characteristics. Both approaches, adapted to the particular requirements of each organization, are based on the use of the DCM paradigm to harmonize health information with full meaning and context, and with the modeling flexibility needed to represent the richness of the source EHRs. The DCM-based standard used was EN/ISO 13606 [7], due to its previous adoption by the Spanish Ministry of Health, as well as by H12O and HCB [15, 16, 20, 21, 29, 30]. Other standards based on or inspired by DCM were analyzed for use in this point of the methodology regarding the modeling of domain concepts. On the one hand, openEHR allows equivalent modeling and formalization of information, with EHR persistence purpose, because it is based on the same dual modeling principles as EN/ISO 13606 [6]. Thus, thanks to the compatibility between standards based on the DCM paradigm [10], and due to the recent adoption of openEHR in the Spanish National Health System [50], the conversion of 13606 archetypes to the openEHR specification was performed, allowing subsequent archetype-based data storage and querying [28]. On the other hand, the HL7 FHIR standard provides a useful solution for agile exchange across basic information resources, but without the modeling flexibility needed to build multipurpose information objects [51]. Therefore, HL7 FHIR is proposed, not as the specification for flexible modeling of clinical knowledge, but as output to be obtained, as has been OMOP CDM in this study, from common clinical archetypes for exchange purposes [52].

The conversion process to OMOP CDM format, structure and standard terminologies, was performed from the common set of clinical archetypes. In comparison with previous work on the implementation of OMOP CDM repositories [11], this study proposes an innovative method for obtaining data from a multi-purpose, detailed and standard specification of health information objects, thus avoiding building ad-hoc processes from local EHR data models. Therefore, the correspondences between each data element of the archetypes with the fields of the OMOP CDM tables were defined, as well as the necessary terminological mappings between each organization's adopted codifications and those defined by OMOP CDM [53]. First, this method allowed centralizing the definition and maintenance of the conversions between the two data specifications, with governance capacity in the evolution of the archetypes [17], independent on the release of new versions of OMOP CDM or updates in the commercial applications that make up the EHR. On the other hand, the process is applicable to other exchange or persistence specifications for secondary use since the multipurpose design approach of the clinical archetypes allows for full-meaning conversion of persisted data to, e.g., the i2b2 model and REDCap databases [54, 55]. Finally, the publication of the archetype catalog allows them to be adopted by other healthcare organizations as an intermediate resource to obtain any specific data format avoiding the complexity of their local EHR models. This is especially valuable in Spain, a country where health information governance is decentralized to its 18 regions [22].

The implementation in two organizations with high digital maturity such as H12O and HCB was possible because the flexibility of the process does not require altering the methodologies and data platforms already in place [14]. H12O used the InfoBanco platform to load raw EHR data in its DL, incorporate the archetypes into an openEHR repository and extract them through AQL queries as JSON extracts, and then transform them to OMOP CDM using Rabbit-in-a-hat and Pentaho [15]. Conversely, HCB extracted raw data from its EHR, used LinkEHR to obtain EN/ISO 13606 EHR extracts, and inserted them into semantic ontologies based on standard metamodels, which allowed obtaining data conforming to OMOP CDM with a reusable Python script [16]. These implementations demonstrate the flexible and agnostic nature of the DCM-based methodology with respect to standards and technologies. On the one hand, the incorporation of the archetypes into the EHR was performed both with virtualized queries in the EHR databases [30], and with an ontological representation of their components and constraints [29]. Likewise, the EN/ISO 13606 standard was jointly used to model information that was subsequently persisted in a repository based on the openEHR standard [28], and finally transformed to OMOP CDM, demonstrating the compatibility between both standards [56]. Finally, the ETL process from the data stored according to clinical archetypes was performed with tools such as LinkEHR Studio, Pentaho and Protégé [26, 34, 36], and different programming, markup and query languages such as ADL, AQL, OWL and SPARQL [27, 32, 35, 37].

This methodology was applied to COVID-19 data due to the availability of quality data around this health condition, as well as ongoing projects in H12O and HCB. In total, more than 17 million combined records were obtained from almost 300,000 different patients, with a data quality assessment result close to 100% using the criteria defined by OHDSI in the DQD tool [38]. At the national level in Spain, the IMPaCT-Data project, which laid the foundations for the implementation of federated clinical and genomic data repositories, performed a proof of concept of COVID-19 data combination in which the variables to be represented were chosen based on the archetypes defined jointly between H12O and HCB [46]. Likewise, the methodology for obtaining and combining data was adopted as a reference implementation by the COVID Data Portal of Spain [39], a repository of information resources for COVID-19. Finally, it allowed H12O and HCB to provide data to the regional initiative “Registro ISCIII-COVID”, a data national repository for COVID-19 research [43]. At the international level, the process implemented allowed H12O to participate as a data provider in the EHDEN Consortium [41], generating scientific evidence from real-world data [57]. Furthermore, with the experience demonstrated in this study in agile and efficient data acquisition for research, H12O was able to apply to the DARWIN project call [42]. In all of this, it is fundamental to emphasize that the methodology is application disease agnostic, so in future studies the process will be applied to other RWD use cases, such as projects in the domain of genetics, oncology, or chronicity .

This study proposed a DCM-based methodology for obtaining harmonized EHR-derived datasets, applied to the implementation of a multi-organization OMOP CDM repository used in different COVID-19 projects, with a large volume of quality data. Specifically, the EN/ISO 13606 standard was used to define a harmonized catalog of 11 multipurpose clinical archetypes, implemented in the InfoBanco platform of H12O and OntoCR platform of HCB, for the homogeneous extraction of data and the application of ETL processes towards common standard formats. Hence, the flexibility of the methodology made possible the adoption by two tertiary hospitals with different digital ecosystems, without altering the platforms and methodologies already in place. Likewise, the process is agnostic to organization-dependent EHR designs, to persistence and exchange standards to be obtained, and to application health conditions. Therefore, it can be concluded that the implemented methodology constitutes an innovative solution to obtain RWD datasets in an efficient, flexible and reusable way.

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

The clinical archetypes developed during the current study are available in the H12O Data Science Unit repository, https://github.com/DataDoce/. The ETL processes and semantic ontologies designed and implemented during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests

Funding

This study has been supported by the projects IMPaCT Data Project “IMP/00019”, “PI18/00981”, “PI18/00890”, and “PI18CIII/00019”, funded by the Carlos III Health Institute of Spain (ISCIII), and the European Regional Development Funds (FEDER), “Una manera de hacer Europa”. It constitutes a reference implementation of the “Infobanco” Platform of the Madrid Region, Spain.

Authors' contributions

MPJ: Conceptualization, Methodology, Project administration, Software, Writing- Original draft preparation, Writing- Reviewing and Editing. SF: Methodology, Project administration, Software, Writing- Original draft preparation, Writing- Reviewing and Editing. NGB: Methodology, Software, Writing- Reviewing and Editing. GBC: Methodology, Software, Writing- Reviewing and Editing. DBT: Methodology, Software, Writing- Reviewing and Editing. DMC: Methodology, Software, Writing- Reviewing and Editing. AMC: Supervision, Writing- Reviewing and Editing. PSB: Supervision, Writing- Reviewing and Editing.

Acknowledgements

We would like to thank the “Centro Nacional de Supercomputación” team (Salvador Capella, Alfonso Valencia), “Hospital Universitario Virgen del Rocío” team (Carlos Luis Parra, Sara González) and the teams of the companies involved in the development of the Infobanco Platform (NTT Data, Veratech For Health, Rhea Group).

Häyrinen K, Saranto K, Nykänen P. Definition, structure, content, use and impacts of electronic health records: A review of the research literature. Int J Med Inform. 2008;77:291–304. 10.1016/j.ijmedinf.2007.09.001.
Safran C, Bloomrosen M, Hammond E, et al. Toward a National Framework for the Secondary Use of Health. J Am Med Inf Assoc. 2007;14:1–9. 10.1197/jamia.M2273.
Parra-Calderón CL, Sanz F, McIntosh LD. The Challenge of the Effective Implementation of FAIR Principles in Biomedical Research. Methods Inf Med. 2020;59(4–05):117–8. 10.1055/s-0040-1721726.
Kalra D, Blobel BG. Semantic interoperability of EHR systems. Stud Health Technol Inform. 2007;127:231–45.
Beale T, Archetypes. Constraint-based Domain Models for Future- proof Information Systems. OOPSLA 2002 Work Behav Semant 2001;:1–69. doi:10.1.1.147.8835.
openEHR Specification. Available at: https://specifications.openehr.org/releases/RM/latest/ehr.html. Accessed October 10, 2023.
ISO 13606. Standard, Part 1: Reference model. Available at: https://www.iso.org/standard/67868.html. Accessed October 10, 2023.
Michaels M, Syed S, Lober WB. Blueprint for aligned data exchange for research and public health. J Am Med Inform Assoc. 2021;28(12):2702–6. 10.1093/jamia/ocab210.
Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574–8.
Pedrera-Jiménez M, García-Barrio N, Frid S et al. Can OpenEHR, ISO 13606 and HL7 FHIR work together? An agnostic approach for the selection and application of EHR standards from Spain. JMIR Preprints. 23/05/2023:48702.
Reinecke I, Zoch M, Reich C, Sedlmayr M, Bathelt F. The Usage of OHDSI OMOP - A Scoping Review. Stud Health Technol Inform. 2021;283:95–103. 10.3233/SHTI210546.
“12 de Octubre” University Hospital. Available at: https://www.comunidad.madrid/hospital/12octubre/. Accessed October 10, 2023.
“Hospital Clínic de Barcelona” University Hospital. Available at: https://www.clinicbarcelona.org/en/teaching. Accessed October 10, 2023.
Goossen W. Representing knowledge, data and concepts for EHRS using DCM. Stud Health Technol Inform. 2011;169:774–8. 10.3233/978-1-60750-806-9-774.
Pedrera-Jiménez M, García-Barrio N, Cruz-Rojo J, et al. Obtaining EHR-derived datasets for COVID-19 research within a short time: a flexible methodology based on Detailed Clinical Models. J Biomed Inform. 2021;115:103697. 10.1016/j.jbi.2021.103697.
Frid S, Pastor Duran X, Bracons Cucó G, et al. An Ontology-Based Approach for Consolidating Patient Data Standardized With European Norm/International Organization for Standardization 13606 (EN/ISO 13606) Into Joint Observational Medical Outcomes Partnership (OMOP) Repositories: Description of a Methodology. JMIR Med Inform. 2023;11:e44547. 10.2196/44547. Published 2023 Mar 8.
Moner D, Maldonado JA, Robles M. Archetype modeling methodology. J Biomed Inform. 2018;79:71–81. 10.1016/j.jbi.2018.02.003.
Donnelly K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud Health Technol Inform. 2006;121:279–90.
McDonald CJ, Huff SM, Suico JG, et al. LOINC, a universal standard for identifying laboratory observations: A 5-year update. Clin Chem. 2003;49:624–33. 10.1373/49.4.624.
Frid S, Fuentes Expósito MA, Grau-Corral I, et al. Successful Integration of EN/ISO 13606-Standardized Extracts From a Patient Mobile App Into an Electronic Health Record: Description of a Methodology. JMIR Med Inform. 2022;10(10):e40344. 10.2196/40344. Published 2022 Oct 12.
Health Ministry of Spain. : Clinical modeling resources, reference ISO 13606 archetypes. https://www.mscbs.gob.es/profesionales/hcdsns/areaRecursosSem/Rec_mod_clinico_arquetipos.htm. Accessed October 10, 2023.
Health Ministry of Spain. : Minimum data set for clinical reports. https://www.boe.es/eli/es/rd/2023/07/04/572. Accessed October 10, 2023.
Clinical Knowledge Manager. https://ckm.openehr.org/ckm/. Accessed October 10, 2023.
International Classification of Diseases, Revision T. Clinical Modification (ICD-10-CM). Available at: https://www.cdc.gov/nchs/icd/icd-10-cm.htm. Accessed October 10, 2023.
Anatomical Therapeutic Chemical (ATC) Classification. Available at: https://www.who.int/tools/atc-ddd-toolkit/atc-classification. Accessed October 10, 2023.
Maldonado JA, Moner D, Boscá D, Fernández-Breis JT, Angulo C, Robles M. LinkEHR-Ed: a multi-reference model archetype editor based on formal semantics. Int J Med Inform. 2009;78(8):559–70. 10.1016/j.ijmedinf.2009.03.006.
Archetype Definition Language (ADL). Available at: https://specifications.openehr.org/releases/AM/latest/ADL1.4.html. Accessed October 10, 2023.
Infobanco platform. Available at: https://cpisanidadcm.org/infobanco/?lang=en. Accessed October 10, 2023.
Lozano-Rubí R, Muñoz Carrero A, Serrano Balazote P, Pastor X. OntoCR: A CEN/ISO-13606 clinical repository based on ontologies. J Biomed Inform. 2016;60:224–33. 10.1016/j.jbi.2016.02.007.
Pedrera-Jiménez M, García-Barrio N, Rubio-Mayo P, et al. TransformEHRs: a flexible methodology for building transparent ETL processes for EHR reuse. Methods Inf Med. 2022;61(02):e89–e102. 10.1055/s-0042-1757763.
Better platform. Available at: https://www.better.care/es/better-platform/. Accessed October 10, 2023.
Archetype Query Language (AQL). Available at:https://specifications.openehr.org/releases/QUERY/latest/AQL.html. Accessed October 10, 2023.
Rabbit In. A Hat tool. Available at: https://ohdsi.github.io/WhiteRabbit/RabbitInAHat.html. Accessed October 10, 2023.
Pentaho tool. Available at: https://www.hitachivantara.com/en-us/products/pentaho-platform/data-integration-analytics.html. Accessed October 10, 2023.
Web Ontology Language (OWL). Available at: https://www.w3.org/OWL/. Accessed October 10, 2023.
Protégé tool. Available at: https://protege.stanford.edu/. Accessed October 10, 2023.
SPARQL query language. Available at: https://www.w3.org/TR/rdf-sparql-query/. Accessed October 10, 2023.
Data Quality Dashboard tool (DQD). Available at: https://github.com/OHDSI/DataQualityDashboard. Accessed October 10, 2023.
COVID-19 Data Portal for Spain. Available at: https://www.covid19dataportal.es/health-variables/. Accessed October 10, 2023.
StarLife infrastructure. Available at: https://www.bsc.es/es/marenostrum/star-life. Accessed October 10, 2023.
EHDEN consortium. Available at: https://www.ehden.eu/vision-and-mission/. Accessed October 10, 2023.
DARWIN project. Available at: https://www.darwin-eu.org/. Accessed October 10, 2023.
Registro ISCIII-COVID-19. Available at: http://hdl.handle.net/20.500.12105/11044. Accessed October 10, 2023.
EHR Archetypes catalog. Available at: https://www.safecreative.org/work/2204281022845-h12o-hcb-isciii_arquetiposhce_catalogo. Accessed October 10, 2023.
COVID-19 Observations Archetypes catalog. Available at: https://www.safecreative.org/work/2102196969593-h12o-covid-19-observations-archetypes. Accessed October 10, 2023.
IMPaCT Data project. Available at: https://impact-data.bsc.es/en/. Accessed October 10, 2023.
Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30. 10.1136/jamia.2009.000893.
Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377–81. 10.1016/j.jbi.2008.08.010.
Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144–51. 10.1136/amiajnl-2011-000681.
Catalan Information System Master Plan. Available at: https://www.openehr.org/community/organisation_partners_detail/catalan-health-service. Accessed October 10, 2023.
HL7 FHIR standard. Available at: https://www.hl7.org/fhir/. Accessed October 10, 2023.
Bosca D, Moner D, Maldonado JA, Robles M. Combining Archetypes with Fast Health Interoperability Resources in Future-proof Health Information Systems. Stud Health Technol Inform. 2015;210:180–4.
OMOP CDM vocabularies. Available at: https://github.com/OHDSI/Vocabulary-v5.0/releases. Accessed October 10, 2023.
Haarbrandt B, Tute E, Marschollek M. Automated population of an i2b2 clinical data warehouse from an openEHR-based data repository. J Biomed Inform. 2016;63:277–94. 10.1016/j.jbi.2016.08.007.
Bönisch C, Kesztyüs D, Kesztyüs T. Harvesting metadata in clinical care: a crosswalk between FHIR, OMOP, CDISC and openEHR metadata. Sci Data. 2022;9(1):659. Published 2022 Oct 28. 10.1038/s41597-022-01792-7.
Martínez-Costa C, Menárguez-Tortosa M, Fernández-Breis JT. An approach for the semantic interoperability of ISO EN 13606 and OpenEHR archetypes. J Biomed Inform. 2010;43(5):736–46. 10.1016/j.jbi.2010.05.013.
Voss EA, Shoaibi A, Yin Hui Lai L, et al. Contextualising adverse events of special interest to characterise the baseline incidence rates in 24 million patients with COVID-19 across 26 databases: a multinational retrospective cohort study. EClinicalMedicine. 2023;58:101932. 10.1016/j.eclinm.2023.101932.

No competing interests reported.

Obtaining a multi-organization OMOP CDM repository from two heterogeneous EHR ecosystems: a flexible methodology based on Detailed Clinical Models

Status:

Version 1

Abstract

Background

Objective

Material and methods

Results

Conclusions

Figures

1. Background

2. Methods

2.1. Common detailed clinical modeling

2.2. Hospital-specific implementation of information resources

2.2.1. Implementation in the InfoBanco platform of H12O

2.2.2. Implementation in the OntoCR platform of HCB

2.3. Multi-organization OMOP CDM repository for COVID-19 research

3. Results

3.1. Common catalog of multipurpose EHR archetypes

3.2. Mapping process between clinical archetypes and OMOP CDM

3.3. Multi-organization EHR-derived dataset for COVID-19 research

4. Discussion

5. Conclusions

Declarations

References

Additional Declarations

Status:

Version 1