A general pattern of data governance for data hubs is defined below, using the conclusions obtained in the in-depth analysis of the 41 survey responses. To define the general pattern of governance, a common characteristic was considered if the respondents coincided by at least 60%.
Hereafter, specific profiles are defined, generating specific patterns of data governance for data hubs, using the conclusions obtained in the stratifications in terms of the kind of organisation (centralised vs. decentralised), and role (controller or processor). For the specific patterns of data governance, it has been counted from 75% (prevalence). It was needed to reduce this percentage in the case of the general one (from 75% to 60%), because when all the responses together were analysed fewer commonalities were found. For each pattern of data governance (both the general one, and the specifics), data aspects, business models, and ELSI aspects are defined, preceded by the list of actors involved in these processes.
The most frequent aspects
After performing the in-depth analysis of the 41 responses of the survey, the most frequent aspects are listed below.
Concerning the simple-choice questions (with percentages): (i) Formal procedure to find out who provides the data (84%); (ii) Quality control is applied to the data (83%); (iii) A catalogue of the different data sources is provided (79%); (iv) There are Standard Operating Procedures (SOPs) that are followed and regularly updated (79%); (v) They receive health data from different sources (76%); (vi) The data infrastructure is centrally managed (75%); (vii) Data anonymisation methods are used (65%), using pseudonymised data (80%); and (viii) A tool is used to check for errors and data integrity (61%). Figure 3 shows a graphical representation of these percentages.
Regarding multiple-choice questions (with absolute values): (i) Data come from the general population (29) or from a group of patients (24); (ii) A data accessibility mechanism is available in accordance with current regulations (28); (iii) The coverage of the data infrastructure is national (27), receiving national funding (19); (iv) They provide health data from different sources (28); (v) They are a digital platform that receives and stores data (27); (vi) They allow the discovery (findability) of health data sets (26); (vii) They have control over stored data (26); (viii) Enable discoverability of health data sets (26); (ix) They have authorisation functionality, provided by the organisation itself or by an external institution (25); and (x) The type of data source used is the electronic health record (EHR) (25). Figure 4 shows a graphical representation of these percentages.
Identifying common aspects involved in the data hubs’ governance
Below, a general pattern of data governance for data hubs is presented defining common aspects involved in the data hubs’ governance models, using the conclusions obtained in the in-depth analysis of the 41 survey responses, and taking into account the list of more frequent aspects. Data aspects, business models, and ELSI aspects are defined, preceded by the list of actors involved in these processes.
Actors
In a data hub, the data controller refers to the “party that, alone or jointly with others, determines the purposes and means of the processing of personal data” [13]. Depending on the data hub, there may be one or more data controllers and sometimes there is a data controller for each data set. The data controller can be any institution, such as a research institute, university, hospital, health service, etc.
The data processor determines who is in charge of data processing, “which processes personal data on behalf of the controller” [13]. This actor can vary depending on the particular case, it can be the same data hub, another institution, or there can be no data processor.
Regarding the organisation’s role in personal data, data hubs can be data controllers or data processors. They also can have both roles depending on the specific situation.
The data access provider is defined as “an entity which makes data available for secondary use” [13]. There may be one or several, it may be a person or a set of mechanisms.Other relevant actors such as researchers, ethical and scientific committees, advisory committees, management boards (government bodies that evaluate applications) or data protection agencies among others can be found.
Data aspects
Concerning data characteristics that are frequently present in the data hubs, these kinds of data infrastructures usually: (i) are digital platforms that receive and store data, (ii) have control over the stored data, receive data from a single source and/or multiple sources, (iii) are a digital technical infrastructure with the core mission of enabling health data sharing and providing health services data from different sources enabling the discovery of health data sets by having a published metadata discovery service and data accessibility mechanism in accordance with existing regulation that has an authorisation functionality, and (iv) provided by the data hub itself or by an external institution. In addition, although less common, a data hub can have characteristics such as generating data, being part of one or more overarching data hubs, or having a specific thematic or collected data type (e.g., a particular disease, a particular data type, etc), among others.
Related to the geographical coverage of the data infrastructure, it can be national, which is the most common, or with less frequency European, regional or international.
As far as the organisation of the data infrastructure, the most common is in a centralised way, and less frequently in decentralised (federated) way. A data hub can also be part of another data hub, although this characteristic is not very frequent.
Regarding the origin of the data, health data usually comes from the general population or from a patient group. With less frequency, health data comes from an experimental setting, among others.
Common types of data sources are EHRs, administrative data, registry data, and healthcare data, such as prescriptions, diagnoses, laboratory data, treatment, surgery, etc. Nevertheless, other types of data sources can also be clinical trials, surveys, cohorts, biobanks (biological samples), Picture Archiving and Communication System (PACS), imaging data, medical devices, clinical Research data, genomic data, biometric data, molecular data, socioeconomic data, specific disease data, survival data, population health data, interview data, customer record data or observational study data, among others less common.
Related to the level of aggregation of the data stored (individual vs. aggregated), the data hubs frequently present an individual level or both, but it also (although less common) can be aggregated only.
Most of the data hubs have a funding sustainability plan. The data hub can receive national funding (it is the most common), or international, regional, from a hospital, European, related to participation in projects, international, or private fundings.
Data hubs can receive data from different sources, providing a catalogue of these different data sources. Data is shared through a website, a secure data exchange portal, APIs, FTP, SFTP, DICOM transfer, among other options. This characteristic can depend on the specific usage request.
Business processes
Related to how the data is compiled and stored in the data hub, data retrieval, loading, ETL methods, transforming or passing, among others, can be used. The storage can be supported by technologies such as SQL, relational database, Sorl, MongoDB, Oracle, Cloud data lakes, DataOntap, DICOM, XML, RDF, CSV, JSON, DBs, or a self-developed database/geographic information system. Data can be stored in several formats such as plain text, XML, or files (which are the most common), but also in others like JSON, DICOM, tsv, RDF, FASTA, Dublin core, Parquet, Nifti, FHIR, Oracle tables, OMOP Common Data Model, SAS Data Set, etc.
Data hubs usually apply quality controls to their data and require a minimum level of data quality to be included in the data infrastructure. Sometimes, a data hub applies quality controls only for internal use. Frequently, passing quality control is not mandatory for the data but the results of quality control are available when searching the data. It is relevant for data hubs to use tools for checking errors and completeness of data. The most used is Checksum, but there are also many others such as HEX/SHACL, XSD Schemas, SQL-Scripts, R-dlookr, or even an automatic web-based check, a data submission portal and manual checks of certain variables or a specific software developed for the purpose of the network, or other options. Data hubs with low-frequency use methods to check data source legitimacy, such as a Data Utility Framework, accreditation of the data provider institute, an authentication of the data providing individual, quality/FAIRness/sustainability assessments, etc.
Related to how often the data sets are updated, this characteristic depends on each specific data set, and the most usual is to update annually, daily or irregularly, although they can also be updated monthly, weekly or even every 12 hours, among others. Another option is to perform a one-time collection without updates.
Data hubs have processes to keep track of the different versions of datasets, such as manually creating versions by saving the date and name of each update, applying a different PID each time a version is stored, tracking model or software changes, documented in the metadata management, or storing it in the log history. Also, each data type may have a different process for versioning.
On the subject of describing the logging and auditing of user actions, data hubs can time stamp the data deposition, time stamp the user contact to client service, and/or time stamp the user application to download or see the health data.
Data hubs commonly have a formal procedure to know who provides the data, practically materialised in contracts, agreements, regulations, terms of use, licence, accreditation - authentication, alliance membership, a law framework making formal requests for data collection mandatory approvals, records on data processing and provision, among others. It is also important to highlight that data hubs frequently establish standard operating procedures (SOPs) that the organisation follows and updates regularly.
It is highly recommended for data hubs to include in their websites a Data Governance section describing the used data governance model, it can be in the form of a detailed document or in a paragraph.
ELSI aspects
Concerning ethical aspects, before accepting new submissions data hubs may require ethical approval for data to be stored on the infrastructure. After receiving the ethical approval, the submission can be done.
Related to anonymisation and pseudonymisation of data, data hubs usually use anonymisation methods. The data can arrive already anonymised, which is not the most common. Additionally, the data hub itself can be in charge of anonymisation. The process can be done at the point of collection, before sharing it externally (these two are the most common), before sharing it internally, or at the point of publication. Almost all data hubs pseudonymise their data, this can be done by the data hub itself or by another external organisation.
Related to the legal aspects, when a data requester asks to access data and a data provider accepts the specific request, data hubs may offer a DAA (Data Access Agreement) to be signed between data providers and data requesters. It also can be done by data permission or by accepting a use policy. Data hubs may have a DPA (Data Processing Agreement) to be signed with the Data providers but it also can be by accepting use policy or depending on contracting situations. Besides, data hubs may have a DPIA (Data Protection Impact Assessment) model.
Data hubs usually implement mechanisms to control the access of the data (authentication and authorisation) such as authorisation with web services backed by a database, OAuth2, OpenID Connect (over HTTPs), or other options.
Profiling kinds of data hub organisation
Analysing by subgroups, two profiles were proposed, depending on the kind of data hub organisation: “data hubs managed centralised” and “data hubs managed decentralised”. The specifications or peculiarities of these profiles compared to those described in the general pattern of data governance are presented in Table 1.
Table 1. Profiles depending on the kind of data hub organisation.
|
Data hubs managed centralised
|
Data hubs managed decentralised
|
Actors
|
No peculiarities.
|
May not a single data controller.
No data management strategy.
|
Data aspects
|
Control the data stored.
Data from "General population".
Use "Text", "Numbers".
Receive and store data from: single source, multiple sources.
|
Data from "Patient groups", "General population", "Experimental settings".
Use "Text", "Images", "Numbers".
Data stored in "XML".
|
Business processes
|
Data quality control.
SOPs.
Procedure to know who provides data.
|
No peculiarities.
|
ELSI aspects
|
Pseudonymised data.
Require legal approval.
|
No peculiarities.
|
Profiling roles
Analysing by subgroups, two profiles were proposed depending on the role of the data hub: “data hubs acting as data controller” and “data hubs acting as data processor”. The specifications or peculiarities found of these profiles compared to those described in the general pattern of data governance are presented in Table 2.
Table 2. Profiles depending on the role performed by the data hub.
|
Data hubs acting as data controller
|
Data hubs acting as data processor
|
Actors
|
No peculiarities.
|
No peculiarities.
|
Data aspects
|
Managed centrally.
Pseudonymised data.
Receive and store data from: single source, multiple sources.
Data from "General population".
Use "Text".
|
Managed centrally.
Receives and stores the data.
Functional authorisation.
Data accessibility mechanism in accordance with existing regulations.
|
Business processes
|
Procedure to keep track of datasets versions. Procedure to know who provides data.
|
Procedure to know who provides data.
|
ELSI aspects
|
SOPs.
Catalogue of data sources.
|
SOPs.
Pseudonymised data.
|