This study describes a methodology for semantic harmonization of health information, and the implementation for obtaining an OMOP CDM repository for COVID-19 from two tertiary hospitals with heterogeneous technical and organizational EHR characteristics. Both approaches, adapted to the particular requirements of each organization, are based on the use of the DCM paradigm to harmonize health information with full meaning and context, and with the modeling flexibility needed to represent the richness of the source EHRs. The DCM-based standard used was EN/ISO 13606 [7], due to its previous adoption by the Spanish Ministry of Health, as well as by H12O and HCB [15, 16, 20, 21, 29, 30]. Other standards based on or inspired by DCM were analyzed for use in this point of the methodology regarding the modeling of domain concepts. On the one hand, openEHR allows equivalent modeling and formalization of information, with EHR persistence purpose, because it is based on the same dual modeling principles as EN/ISO 13606 [6]. Thus, thanks to the compatibility between standards based on the DCM paradigm [10], and due to the recent adoption of openEHR in the Spanish National Health System [50], the conversion of 13606 archetypes to the openEHR specification was performed, allowing subsequent archetype-based data storage and querying [28]. On the other hand, the HL7 FHIR standard provides a useful solution for agile exchange across basic information resources, but without the modeling flexibility needed to build multipurpose information objects [51]. Therefore, HL7 FHIR is proposed, not as the specification for flexible modeling of clinical knowledge, but as output to be obtained, as has been OMOP CDM in this study, from common clinical archetypes for exchange purposes [52].
The conversion process to OMOP CDM format, structure and standard terminologies, was performed from the common set of clinical archetypes. In comparison with previous work on the implementation of OMOP CDM repositories [11], this study proposes an innovative method for obtaining data from a multi-purpose, detailed and standard specification of health information objects, thus avoiding building ad-hoc processes from local EHR data models. Therefore, the correspondences between each data element of the archetypes with the fields of the OMOP CDM tables were defined, as well as the necessary terminological mappings between each organization's adopted codifications and those defined by OMOP CDM [53]. First, this method allowed centralizing the definition and maintenance of the conversions between the two data specifications, with governance capacity in the evolution of the archetypes [17], independent on the release of new versions of OMOP CDM or updates in the commercial applications that make up the EHR. On the other hand, the process is applicable to other exchange or persistence specifications for secondary use since the multipurpose design approach of the clinical archetypes allows for full-meaning conversion of persisted data to, e.g., the i2b2 model and REDCap databases [54, 55]. Finally, the publication of the archetype catalog allows them to be adopted by other healthcare organizations as an intermediate resource to obtain any specific data format avoiding the complexity of their local EHR models. This is especially valuable in Spain, a country where health information governance is decentralized to its 18 regions [22].
The implementation in two organizations with high digital maturity such as H12O and HCB was possible because the flexibility of the process does not require altering the methodologies and data platforms already in place [14]. H12O used the InfoBanco platform to load raw EHR data in its DL, incorporate the archetypes into an openEHR repository and extract them through AQL queries as JSON extracts, and then transform them to OMOP CDM using Rabbit-in-a-hat and Pentaho [15]. Conversely, HCB extracted raw data from its EHR, used LinkEHR to obtain EN/ISO 13606 EHR extracts, and inserted them into semantic ontologies based on standard metamodels, which allowed obtaining data conforming to OMOP CDM with a reusable Python script [16]. These implementations demonstrate the flexible and agnostic nature of the DCM-based methodology with respect to standards and technologies. On the one hand, the incorporation of the archetypes into the EHR was performed both with virtualized queries in the EHR databases [30], and with an ontological representation of their components and constraints [29]. Likewise, the EN/ISO 13606 standard was jointly used to model information that was subsequently persisted in a repository based on the openEHR standard [28], and finally transformed to OMOP CDM, demonstrating the compatibility between both standards [56]. Finally, the ETL process from the data stored according to clinical archetypes was performed with tools such as LinkEHR Studio, Pentaho and Protégé [26, 34, 36], and different programming, markup and query languages such as ADL, AQL, OWL and SPARQL [27, 32, 35, 37].
This methodology was applied to COVID-19 data due to the availability of quality data around this health condition, as well as ongoing projects in H12O and HCB. In total, more than 17 million combined records were obtained from almost 300,000 different patients, with a data quality assessment result close to 100% using the criteria defined by OHDSI in the DQD tool [38]. At the national level in Spain, the IMPaCT-Data project, which laid the foundations for the implementation of federated clinical and genomic data repositories, performed a proof of concept of COVID-19 data combination in which the variables to be represented were chosen based on the archetypes defined jointly between H12O and HCB [46]. Likewise, the methodology for obtaining and combining data was adopted as a reference implementation by the COVID Data Portal of Spain [39], a repository of information resources for COVID-19. Finally, it allowed H12O and HCB to provide data to the regional initiative “Registro ISCIII-COVID”, a data national repository for COVID-19 research [43]. At the international level, the process implemented allowed H12O to participate as a data provider in the EHDEN Consortium [41], generating scientific evidence from real-world data [57]. Furthermore, with the experience demonstrated in this study in agile and efficient data acquisition for research, H12O was able to apply to the DARWIN project call [42]. In all of this, it is fundamental to emphasize that the methodology is application disease agnostic, so in future studies the process will be applied to other RWD use cases, such as projects in the domain of genetics, oncology, or chronicity .