Landscape analysis of available European data sources amenable for machine learning and recommendations on usability for rare diseases screening

doi:10.21203/rs.3.rs-3448625/v1

Download PDF

Research Article

Landscape analysis of available European data sources amenable for machine learning and recommendations on usability for rare diseases screening

https://doi.org/10.21203/rs.3.rs-3448625/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 Apr, 2024

Read the published version in Orphanet Journal of Rare Diseases →

You are reading this latest preprint version

Background

Patient registries and databases are essential tools for advancing clinical research in the area of rare diseases, as well as for enhancing patient care and healthcare planning. They are the only means of data pooling that can result in an adequate sample size for epidemiological and/or clinical research. They are crucial to determine the feasibility of clinical trials, and to promote the enrollment of patients. The primary aim of this study is a landscape analysis of available European data sources amenable to ML and their usability for Rare Diseases screening, in terms of FAIR, legal, and business considerations. Second, recommendations will be proposed to provide a better understanding of the health data ecosystem.

Results

In total, 330 unique replies were processed and analyzed, reflecting the same number of distinct databases (no duplicates included). In terms of geographical scope, we observed 24.2% (n = 80) national, 10.0% (n = 33) regional, 8.8% (n = 29) European, and 5.5% (n = 18) international registries coordinated in Europe. Over 80.0% (n = 269) of the databases were still active, with approximately 60.0% (n = 191) established after the year 2000 and 71.0% last collected new data in 2022. Regarding their geographical scope, European registries were associated with the highest overall FAIR adherence, while registries with regional and “other” geographical scope were ranked at the bottom of the list with the lowest proportion. Responders’ willingness to share data as a contribution to the goals of the Screen4Care project was evaluated at the end of the survey. This question was completed by 108 respondents; however, only 18 of them (16.7%) expressed a direct willingness to contribute to the project by sharing their databases. Among them, an equal split between pro-bono and paid services was observed.

Conclusions

The most important results of our study demonstrate not enough sufficient FAIR principles adherence and low willingness of the EU health databases to share patient information, combined with some legislation incapacities, resulting in barriers to the secondary use of data. The new EU governing structures, some already in action, are expecting to build trust in data providers and stimulate data sharing to promote accessibility and support ethical and sustainable innovation in healthcare.

databases

health data

electronic health records

ERNs

rare diseases

machine learning (ML)

artificial intelligence (AI)

FAIR

legislation

consent

In the European Union (EU), a rare disease (RD) is one that affects no more than 1 person per 2,000 population. There are between 6,000 and 8,000 different RDs that afflict an estimated 30 million citizens in Europe [1].

Although RDs are highly diverse in terms of etiology, pathophysiology, and clinical manifestation, they have common characteristics: many RDs are severe, chronic, and life-threatening and finding the proper diagnosis presents a significant barrier in their treatment [2]. Affected people are frequently faced with multiple years of burdensome diagnostic journey with misdiagnoses, and an average diagnosis delay of up to 8 years [3]. Furthermore, there are no approved therapies for nearly 90% of these disorders [4]. RDs are now more widely acknowledged as a serious public health issue that affects people all over the world and places an undue financial burden on patients, families, and healthcare systems [5].

Effective methods for improving medical care for RD patients include projects and networks that aim to aggregate information and expertise so that healthcare professionals can easily access and share essential data [6]. One of the most comprehensive knowledge bases for RDs is Orphanet [7], which provides information on RDs, and links to specialist centers, patient organizations, and other resources. Other European initiatives include the European Reference Networks (ERNs), the European Joint Program on Rare Diseases (EJP RD), and RDConnect [8].

Patient registries and databases are essential tools for advancing clinical research in RDs, as well as for enhancing patient care and healthcare planning. They are the only means of data pooling that can result in an adequate sample size for epidemiological and/or clinical research [9].

Hundreds of registries and other databases at the national, regional, and local levels in Europe gather information about RD patients. As this information is widely spread, an important tool for researchers, medical professionals, patients, and policymakers is the EU Rare Disease Platform. It aims to provide its users with a standardized instrument to improve knowledge, diagnosis, and treatment of RDs while harmonizing data collection and interchange at the EU level [10].

The design, development, and establishment of a registry, or another type of patient database, involves different factors, including the technicalities of coding languages and data-capture programs; ethical and legal concerns to ensure data privacy and protection while also enabling data sharing and reuse; governance and considerations regarding the various interests of patients, clinicians, researchers, policymakers, and other stakeholders. RD registries are being supported through worldwide initiatives, such as EPIRARE (European Platform for Rare Disease Registries) which seeks to address the regulatory, legal, ethical, and technical difficulties in the registration of European RD patients [11].

In addition to these collaborative efforts and global platforms, advancements in information technology (IT), particularly in the fields of artificial intelligence (AI), are important elements that can enhance the situation of patients with RDs. It is essential to increase the accessibility of data sources, including hospital information systems (HISs), electronic health records (EHRs), and health-related registries, to develop systems that could assist clinicians in their diagnostic decisions. The value of sophisticated analysis techniques like machine learning (ML) in clinical decision-making has been proven by a review of AI-based clinical decision-support technologies [12]. The usage of data sources based in Europe is closely related to legal and ethical standards within the European legislative framework, and it also needs to be facilitated through the FAIR principles for data management (Findability, Accessibility, Interoperability, and Reusability).

A recent review evaluated European RD databases in terms of fulfilment of FAIR principles and meeting EU regulation challenges while considering their potential for genetic newborn screening using AI-based tools [13]. As RDs primarily affect children and account for about 80% of all cases being of genetic origin [14], genetic screening for early RDs identification is of growing importance. The review examined key organizational, FAIR and legal challenges identified during European RD databases mapping which may impede the implementation of ML-based screening technologies for RD patients. It was elaborated within the frame of the European project Screen for Care (Screen4Care), a project aiming to shorten the path to RD diagnosis by using newborn genetic screening and digital technologies [15]. Screen4Care is focusing on the early detection of RDs via advanced IT and clinical decision support tools, using AI and ML. It includes the development of a federated metadata repository amendable to federated ML algorithms, based on existing RD databases. Regarding RD database management, challenges identified include the need for better data quality, sustainability, funding, and governance of RD registries; establishing FAIR-compliant databases and considering the necessity to adapt the legal framework for reliable data collection and accelerated interoperability across Europe, offering further opportunities for RD patients [13].

Databases’ profile

In total, 330 unique replies were processed and analyzed, reflecting the same number of distinct databases (no duplicates included). In terms of geographical scope, we observed 24.2% (n = 80) national, 10.0% (n = 33) regional, 8.8% (n = 29) European, and 5.5% (n = 18) international registries coordinated in Europe. Over 80.0% (n = 269) of the databases were still active, with approximately 60.0% (n = 191) established after the year 2000 and 71.0% last collected new data in 2022. The frequency with which a database was updated varied extensively: 15.5% (n = 51) of respondents perform the action once per month, 6.4% (n = 21) once every six months, 5.8% (n = 19) once a year, 22.4% (74) renew information using another interval approach, 2.7% (n = 9) could not provide an answer, and 47.3% (n = 156) did not respond. The median number of new cases introduced in the databases over the previous year of observation was 110 (31; 400) by 38.5% (n = 127) of the respondents. The median number of observations / cases in the databases (32.1%, n = 106) was calculated to be 966 (300, 6888). The median number of active cases / patients included in the databases (43.6%, n = 144) was reported to be 1,400 (251; 5,893). In 33.0% (n = 112) of databases, information concerning the patient's death was collected.

The multiple response sets (semi-closed enumerated questions) provided the following information – 1) Databases’ operational data (n = 629) (Fig. 1); 2) Registry types (n = 416) (Fig. 2); and 3) Type of data categories (n = 1013) included (Fig. 3).

Other sets of multiple response questions were focused on collecting information about the database characteristics relevant to RDs. We aimed to identify the RDs groups (n = 446) the databases contained information about – metabolic and endocrine disorders: 30.0% (n = 86); neurological and neuromuscular disorders: 29.3% (n = 84); hematological diseases: 17.4% (n = 50); beta oxidation disorders: 9.4% (n = 27); other rare conditions: 61.7% (n = 177); I cannot answer: 7.7% (n = 22). More detailed information about the distribution of subgroup disorders is available in Additional file 1, Table 1. Main rare disease groups and subgroup distribution.

FAIR-ness of the organization – findings and recommendations

The FAIR section of the survey aimed at evaluating the degree of adherence of the responder’s databases to the pre-defined FAIR principles [16]. The section was optional for respondents to complete based on their level of expertise. The overall FAIR-ness of the database is assessed by the question “Are the FAIR principles explicitly mentioned in the database policy of your organization?”. A positive outcome would be considered if the responder had answered, “All FAIR principles are explicitly mentioned”. Similarly, “findability” was assessed by a positive answer to the questions “Are the data produced and/or used in the database discoverable with metadata, identifiable and locatable by means of a standard identification mechanism (e.g., persistent and unique identifiers such as Digital Object Identifiers)?”. The “accessibility” dimension by answers indicated that all datasets can be accessed and made available (“Which data produced and/or used in the database could be made available?”). The “interoperability” was evaluated by positive answers to the question “Does your organization (data site) follow the FHIR (Fast Healthcare Interoperability Resources) standard to support eSource data exchange”. Finally, the outcome for “reusability” dimension was considered as a positive answer to the question “Are the data licensed to permit the re-use possible?”.

Out of a total of 330 survey respondents, 141 (42.7%) participated in this section. The distribution of responses is available in Additional file 1, Table 2. FAIR database characteristics. Regarding overall adherence to FAIR principles, a total of 25 (17.9%) reported full compliance with all 4 components, while more than half either could not provide an answer (n = 56, 40%) or indicated that the FAIR principles were not being applied on the site (n = 23, 16.4%). Only 2 (1.43%) revealed partial database compliance, specifically in terms of findability and accessibility.

Explored systematically by all FAIR components, the majority of respondents (n = 31, 51.7%) indicated that the data repository had a unique and persistent identifier (PID). Moreover, 14 (23.3%) identified several data releases with versions attached. Regarding data accessibility, almost half of the responders (n = 28; 45.9%) ensured accessibility either via a web browser or API. Full data access was reported in 15 (25.0%) of the databases. Interoperability of the data was indicated by 43 (71.7%) of the respondents, and four of them (6.8%) reported implementation of the observational medical outcome model partnership (OMOP). Data licensing enabling re-usability was applied in 20 (47.6%) of the responder’s databases.

Posterior estimates from that Bayesian model (Fig. 4) showed that the proportion of positive responses to the overall FAIR-ness was highest for genetic databases (28.12%; 95% CI 17.7–39.8%), followed by data in health-related registries (22.7%; 95% CI 14.8–32.0%), EHR (24.4%; 95% CI 11.5–40.6%), HIS (14.6%; 95% CI 6.5–25.2%), and EMR (15.5%; 95% CI 7.2–26.3%). In terms of database reusability, on average, 34% of all respondents shared a positive outcome in this fair dimension.

However, small discrepancies could be noticed according to the database type (Fig. 5). For instance, the positive reusability outcome was more frequent in hospital information systems (41.2%; 95% CI 21.2–64.5%) and genetic databases (42.3%; 95% CI 26.5–59.1%), while the least reusable were databases of electronic health records (32.3%; 95% CI 14.5–55.7%). The highest proportion of interoperable databases was reported for electronic health records (21.3%; 95% CI 6.9% − 44.2%) and hospital information systems (20.3%; 95% CI 6.3%-42.2%), while genetic data was estimated as the least interoperable (8.4%; 95% CI 2.2%-20.4%). However genetic databases were found with the highest level of accessibility (47.0%; 95% CI 30.7–63.3%), while the lowest level was estimated in electronic health records (25.8%; 95% CI 11.5–45.1%). Furthermore, genetic databases were also prominent in terms of their findable databases (53.8%; 95% CI 37.2–69.8%), while the lowest estimated proportion was in electronic health records (36,4%; 95% CI 16.5%-59.2%).

Based on the posterior estimates, a ranking-based approach was used to determine the FAIR adherence order by the database type. As shown in Table 1, genetic databases were considered the most adherent to the evaluated FAIR criteria, while health-related registries and electronic health records were at the bottom of the ranking.

Table 1

FAIR components ranking by the posterior estimates of the Bayesian logit model according to database type.
	Findability	Accessibility	Interoperability	Reusability	Overall rank
Genetic data	2	1	5	1	1
HIS	4	4	2	2	3
EMR	3	5	3	3	4
Registries	1	2	4	4	2
EHR	5	3	1	5	4

A similar approach was to evaluate the impact of the disease type reported in the dataset and the geographical scope for the responders of registry-type databases (Table 2). The highest overall FAIR adherence was listed for databases containing information for “other” RDs and neuromuscular disorders. Regarding their geographical scope, European registries were associated with the highest overall FAIR adherence, while registries with regional and “other” geographical scope were ranked at the bottom of the list with the lowest proportion.

Table 2

Overall FAIR adherence – ranking by the posterior estimates of the Bayesian logit model according to the type of diseases included in the database and geographic scope (only for registries)
Disease type	Median	2.5% CI	97.5% CI	Rank
Beta oxidation	6.2%	0.2%	27.9%	4
Endocrine	13.1%	4.8%	26.9%	3
Haematological	3.3%	0.1%	15.2%	5
Neuromuscular	18.8%	9.0%	32.7%	2
Other	22.1%	14.2%	31.6%	1
Geographical scope
European registry	50.4%	27.4%	74.5%	1
International registries coordinated in Europe	20.3%	3.2%	54.2%	3
National registry	20.6%	10.5%	34.6%	2
Other	13.8%	0.7%	53.2%	4
Regional registry	3.6%	0.2%	17.1%	5

Legal and business characteristics – findings and recommendations

The legal and business section included questions, focused on the management of health data which responders’ organization collects and operates with. Six domains of interest were identified: consent and data ownership; sensitive information; data protection; legislation; data sharing and fees.

The consent and data ownership domain were evaluated by 3 questions in the survey. The first outcome for this domain was a positive answer to the question “Do you require re-consent of patients when data is used in ways that do not fall within the original purpose of the registry?” As a second outcome, positive answers (all types of consent) to the question “What type of consent do you collect?” were used. Finally, the third outcome was a negative answer (No consent) to the question “Which of the following consent models do you apply for sharing anonymized patient health information in network electronic exchange for research purposes?”.

Data ownership was evaluated by the question, “Are patients aware that their information may be used for further research, monitoring performance, service planning, auditing, quality assurance purposes, etc.?” and the outcome for that domain was a negative response. Data protection was assessed by the question “Are there national health data security policies regarding the technical standards to be used to ensure health data for primary use are processed and stored securely?”. Responders who answered “There is one national data security policy” or “There are several national data security policies” were combined and used as a positive outcome for this domain. The legislation was evaluated by the estimated share of responders who answered “Yes” to the question “Are there legislative provisions concerning the primary and secondary use of data?”. The final domain was estimated by the proportion of respondents who were willing to share their databases as a contribution to the goals of the Screen4Care project.

The overall characteristics of database management were shown in Additional file 1, Table 3. Overall characteristics of database management – legal and business information. The vast majority of responders indicated that there was at least one national data security policy regulating the technical standards for data sharing (n = 84; 75.0%). Moreover, almost all respondents in this section shared that patients were aware that their information may be used for research, monitoring, service planning, and other data processes (n = 94; 83.0%). Less than a quarter of the databases did not require re-consent when data was used in ways that fell outside the original purpose of data collection (n = 25; 22.3%). Opt-in consent was the preferred model mode in the anonymized patient network-sharing process among participants (n = 18, 16.3%). In terms of the type of consent collected, most of the databases indicated that consent was collected for every use of data (n = 34; 30.4%). Genetics data as a sensitive data item was collected in 61 of the databases (48.4%), while mental health data was gathered in 19 (15.1%) of the responders’ databases. Ownership of the data (n = 101; 92.7%), along with data anonymization (n = 102; 93.6%), were observed as primary items for consideration in the data sharing process among participants. However, less noticeable items were “Salting of the database or the use of fake data to uncover unauthorized use and copying of the database” and “Fees to be charged and protection of the licensee from fee creep,” not considered in 38 (35.5%) and 34 (31.5%) of the database, respectively.

Responders’ willingness to share data as a contribution to the goals of the Screen4Care project was evaluated at the end of the survey. This question was completed by 108 respondents; however, only 18 of them (16.7%) expressed a direct willingness to contribute to the project by sharing their databases. Among them, an equal split between pro-bono and paid services was observed.

Posterior estimates result for the models concerning legal and business characteristics of databases are presented in Fig. 6 and Fig. 7. Regarding the consent dimension, the lowest share of databases in which re-consent is required was estimated for electronic health records (68.5%; 95% CI 46.8–85.0%), while the highest proportion for that outcome was estimated for genetic databases (79.9%; 95% CI 66.6–89.8%). The highest estimates for the second consent outcome were observed for hospital information systems (HIS) (89.1%; 95% CI 71.2–94.2%). The “no-consent” model for sharing anonymized patient health information in network electronic exchange for research purposes is the most common proportion estimated in health-related registries (12.7%; 95% CI 6.6% − 21.1%). Furthermore, this model was estimated at the lowest value for hospital information systems (HIS) (7.2%; 95% CI 1.8–18.7%).

Data ownership represented by patient awareness of data usage for further research, monitoring performance, service planning, audit, and quality assurance purposes was estimated to have the highest share in genetic databases (93.7%; 95% CI 84.3–98.4%). Moreover, genetic databases were found to collect the highest proportion of genetically sensitive information (29.6%; 95% CI 22.0–37.9%). The provision of mental health data as a sensitive information item is estimated to have the highest share in health-related registries (8.5%; 95% CI 5.0–13.2%).

Legislative provisions concerning the primary and secondary use of data were estimated to have the highest share in the hospital information system (87.8%; 95% CI 74.9–95.8%). The lowest share of legislative provisions was estimated for health-related registries (78.5%; 95% CI 68.4–86.2%). Regarding data security, at least one national data security policy is observed in electronic medical records databases (90.8%; 95% CI 79.8–97.1%). In addition, EMRs were estimated with the lowest value of immediate willingness to share data as a contribution to the goals of the Screen4Care project (14.1%; 95% CI 5.8–27.8%). The highest reported proportion of data sharing dimension was found in the genetic database (18.7%; 95% CI 9.1–32.0%).

Based on the posterior estimates, a ranking-based approach was used to determine the re-consent requirements, patients’ awareness their information may be used for further research and legislative provisions concerning the primary and secondary use of data order geographical scope for the responders of registry type databases and by the database type (Additional file 1, Table 4. Legal items – ranking by the posterior estimates of the Bayesian logit model according to geographical scope for the responders of registry-type databases and the database type). The highest ranking of the re-consent requirement was listed for international registries coordinated in Europe and European registries, containing health data of patients with neurological and neuromuscular disorders as well as metabolic and endocrine disorders. Patients included in international registries coordinated in Europe and European registries demonstrated the highest awareness about the use of their health information for further research, monitoring performance, service planning, audit, and quality assurance purposes. The existence of legislative provisions concerning the primary and secondary use of data was with the highest rank for European and regional Beta oxidation disorders and metabolic and endocrine disorders registries.

To estimate the effect of GDPR (General Data Protection Regulation) on database management characteristics, we compared the predefined outcome of interest in the consent, ownership, and sensitive data collection dimensions across databases established before and after the GDPR implementation (2018). One hot encoding was used to create a dummy variable based on the reported year of database establishment. Bayesian models then were fitted using the new variable as a predictor and the predefined outcome of interest as the response variable. The results are shown in Fig. 8 and Fig. 9.

Regarding data ownership, evaluated by the proportion of databases reporting patients’ awareness of their information being used for further research, monitoring performance, service planning, audit, and quality assurance purposes, the databases established before 2018 are estimated to have a lower posterior probability (80%; 95% CI 73.0–89.46%) compared to those established after 2018 (95.8%; 95% CI 82.6–99.7%). The median posterior difference between the two groups is 12.9 percentage points (95% CI: -1.0 p.p. to 23.7 p.p.). The GDPR effect on data ownership can be measured by the relative share of posterior draws suggesting a positive difference compared to those with a negative or null value. The latter is an estimate of the magnitude of the effect and is usually noted as the Bayes Factor (BF). In this case, the BF is 26.59, indicating that the data is 26.59 times more likely to favor the alternative hypothesis (proportion difference) compared to the null hypothesis (no difference). This suggests that the effect of GDPR on data ownership is significant.

The same approach was used for the consent dimension evaluated by the re-consent database requirement. Before GDPR implementation, the estimated positive outcome was 68.9% (95% CI 58.06–78.3%), while after 2018, the estimated positive outcome was 81.9% (95% CI 62.6–94.3%). The median posterior difference between the two groups is 12.6 percentage points (95% CI: -0.8 p.p. to 30.6 p.p.). The Bayes Factor is 8.54, indicating moderate evidence against the no-difference hypothesis.

Finally, sensitive data provision was compared between databases established before and after the GDPR regulations. Even though two outcomes—genetic and mental health data—were included in this dimension, a contracting relationship between them was observed. For instance, genetic data was found to be more likely to be provided after 2018 (37.6%; 95% CI 24.2–53.5%), while the opposite was observed for mental health data (before GDPR: 8.0%; 95% CI 4.6–12.3%; after GDPR: 7.5%; 95% CI 2.0–17.8%). The estimated difference for genetic data (16.0 p.p., 95% CI 1.1 p.p.–32.9 p.p.) was associated with the corresponding BF of 59,6 indicating strong evidence in favor of alternative hypotheses and therefore a significant impact of GDPR enabling databases for genetic data exchange. However, the estimated values for mental health data provision indicated a decline compared to databases established after 2018. The observed difference (-0.4 p.p.; 95% CI -7.7 p.p. 10.3 p.p.) was associated with a BF of 1.6, indicating weak evidence in favor of the null hypothesis.

ML and AI are increasingly being used in healthcare and genetic research to improve patient outcomes and advance scientific knowledge. Further, their application can be beneficial for early diagnosis, improving prognosis, and treatment decisions in the field of RDs [17]. However, such technology relies heavily on the availability of high-quality data, which in the RD domain is often scarce and fragmented [18]. Moreover, the lack of standardized data sharing practices and interoperability standards across different domains can hinder the progress of these innovations [19].

Assessing the FAIR principles that were developed to address these challenges [16] is intended to address specific recommendations and strengthen the process of AI and ML implementation to help RD patients. In the presented study, it was found that the overall adherence to FAIR principles varied significantly among the databases. A total of 25 respondents (17.9%) reported full compliance with all four FAIR components, indicating a significant portion of databases that have successfully implemented these principles. However, more than half of the respondents either could not provide an answer (n = 56, 40%) or indicated that the FAIR principles were not being applied on the site (n = 23, 16.4%). The latter, combined with the relatively low response rate (42.7%) for this section, suggests a lack of awareness and limited implementation of FAIR principles among the surveyed database stakeholders. This is consistent with other studies that have found low levels of FAIR compliance [20]. Therefore, the results underline the importance and need for targeted interventions to promote FAIR compliance and standardize data sharing practices across different domains. Such efforts are proven to be efficient and impactful in the RD and genetic research domains [21, 22].

In this research, a deeper exploration of each FAIR component aimed to identify specific strengths and weaknesses in the surveyed database management and sharing practices. A significant portion of respondents (51.7%) indicated that their data repository utilized a PID, a critical factor that enhances the discoverability and traceability of data [23]. Furthermore, 14 respondents (23.3%) reported multiple data releases with attached versions, which improves data tracking and version control [24]. These findings indicate that efforts have been made to ensure comprehensive access to datasets, fostering openness and transparency, which is proven to be an effective strategy for enhancing quality in RD registries [25]. However, only 45.9% of the respondents ensured accessibility either via a web browser or API, enabling data retrieval through standard protocols. This suggests that there is potential for enhancement in this aspect, specifically in clinical databases, where accessibility is comparatively lower. This can be a barrier to the development of database workflows needed for ML and AI technologies [26] Moreover, only 20 databases (33.9%) reported implementing data licensing, enabling reusability. This result emphasizes the considerable limitations of collaborative research and knowledge dissemination [27].

Database characteristics such as database type, diseases included in the data, and geographical scope of the database were also investigated as potential factors influencing FAIR adherence. Databases containing information on neuromuscular disorders and those with European scope demonstrated the highest overall FAIR adherence. Notably, genetic databases showed the highest proportion of positive responses to overall FAIR adherence, suggesting that these databases have made significant strides in adopting FAIR practices. This may be attributed to the emphasis on data sharing and standardization within the genetics research community [28]. In contrast, low FAIR assessment is found for databases of EHR and HIS. These clinical databases face unique challenges related to data privacy, security, and interoperability, which hinder their ability to fully implement the FAIR principles. [29].

The higher FAIR compliance in databases focused on neuromuscular disorders could be attributed to the relatively specialized nature of these databases and patient advocacy which facilitated more focused and standardized data management practices [30, 31]. Additionally, recent initiatives specially designed for data on neuromuscular disorders might have contributed to the higher FAIR adherence in these databases [32, 33]. The influence of EU policies, efforts, and funding assistance that promote data sharing and FAIR implementation may be related to the observed FAIR adherence in databases with a European focus [34, 35]. European databases might also benefit from standardized data sharing frameworks and infrastructure, enabling smoother data exchange and collaboration across European countries [36].

By following the suggested recommendations, researchers could identify databases that adhere to FAIR principles, providing high-quality, easily accessible, and standardized data. These qualities are vital for effectively implementing ML and AI technologies in RD research. Utilizing such databases will lead to more precise and meaningful results, ultimately contributing to improved patient care and the advancement of scientific knowledge in this complex area of study.

Standardization Databases with standardized data sharing practices and data formats ensuring consistency and interoperability across different databases should be used as a primary data source for ML and AI applications.

When selecting databases for training datasets it is crucial to prioritize those that utilize persistent identifiers (PIDs). PIDs enhance the discoverability and traceability of data repositories ensuring consistent input for ML and AI algorithms.

To facilitate integration with ML and AI technologies databases should ensure data accessibility through web browsers or APIs. This allows for the retrieval and analysis of data.

Databases that offer multiple data releases with attached versions should be preferred, as data versioning enables improved data tracking and version control, which are vital for accurate ML and AI model training.

For ML and AI applications, it may be beneficial to consider specialized databases focused on RD domains like disorders. Such databases often provide standardized data suitable, for these applications. Researchers should actively pursue research projects to discover databases that follow practices, for data sharing and adhere to common protocols, for data exchange. This will greatly facilitate the integration of ML and AI.

Databases focused on specific RD domains, such as those for neuromuscular disorders, should be considered, as they may offer more comprehensive and standardized data suitable for ML and AI applications.

Collaborative research initiatives should be sought by researchers to identify databases with standardized data sharing practices and adherence to common data exchange protocols, facilitating ML and AI integration.

A thorough assessment of the database’s documentation should be conducted to ensure transparency and comprehensive information about data quality, format, and metadata, which are crucial for ML and AI model development.

European-scope databases, with their emphasis on data sharing and standardized practices, should be considered, as they may provide robust and FAIR data suitable for ML and AI research, particularly in the RD domain.

The GDPR provides enhanced protection for health care information in the EU, as reflected in the member countries implementing laws. The GDPR, which entered into force on 24 May 2016 and is applicable from 25 May 2018, creates a harmonized set of rules applicable to all personal data processing taking place in the EU [37]. National data protection authorities are responsible for monitoring and enforcing the application of the GDPR and other national data protection legislation that may be applicable in their territories. In our study, 75% of the respondents (with a proportion over 80% for HER and EMR) declared that there is at least one national data security policy regarding the technical standards to be used to ensure health data for primary use are processed and stored securely and 37.5% of them pointed out the existence of several. The results are similar to those reported in another European study on the topic [38]. Moreover, a regional health authority is traditionally primarily responsible for the containment of individual cases. Thus, it will depend on Member State legislation when in that chain data will be anonymized. Clarifications are however needed under which conditions the further processing of data in order to render them anonymous for the purpose of scientific research would be legitimate [39].

In our study, 76.8% of the respondents confirmed the presence of legislative provisions concerning the primary and secondary use of data. It can be particularly challenging to strike the correct balance between enabling good data use and protecting privacy when it comes to secondary use. Secondary use involves processing data for purposes other than those originally intended when information is gathered, and it may also involve data processors other than the primary data collectors, in contrast to primary use, where data are collected and then used for a specific purpose [38, 40, 41, 42]. In contrast with the study of Skovgaard et al., published in 2019, our results demonstrate that 83.9% of the respondents declare that patients are aware that their information may be used for further research, monitoring performance, service planning, audit, and quality assurance purposes etc. [43]. Moreover, awareness is of key importance for patients involved in RD research, and it could be argued that this becomes even more evident in data sharing, with the onus on researchers, institutions, and collaborations to recognize this as a responsibility. Rare disease patients’ perspectives are needed to contribute to the debate on the management, sharing and protection of data, in order to reconcile tensions within the research process with what matters most to patients [44]. There is also a risk of too much privacy protection in the RD context. Formal legal safeguards and strict transparency requirements leave organizations with less flexibility to share samples and data about RD patients, especially internationally, even where researchers seek explicit patient consent and/or patient involvement in data sharing governance [45].

The informed consent of the citizen is essential for data exchange [46]. The voluntary expression of consent is fundamental to ethical research practices. While patients with RDs often expect that data are shared for scientific advances, they are also concerned about being identified, a risk enhanced in the RD context [47]. In RD research, the consent processes have become increasingly complex, considering the current landscape of technological and genomic advances, together with the extensive collection and dissemination of data worldwide. This has been confirmed by the multiple components included in the consent process and authorization mechanisms for health records exchange in the various databases examined by us. In our study, the most commonly used consent models applied for sharing anonymized patient health information in network electronic exchange for research purposes are opt-in (16.2%) and opt-in with restrictions (10.8%). An additional challenge is the different types of collected consent, including consent for every use of data (30.4%), consent for broader categories of research (27.7%) and consent for all research (17.9%). The need for improving informed consent processes in international collaborative RD research is broadly discussed, namely, there is a need for effective consent in order to conduct effective research. To achieve this aim, the procedure shall address possible ethical and legal hurdles that could hamper research in the future, including opt-in, re-consent and opt-out strategies [48]. We consider this especially relevant while examining informed consent for RD research, in particular, when there are re-consenting requirements for data used in ways that do not fall in the original purpose of the respective registry, or other database research, which we found to be mandatory for 70.5% of the databases we collected responses from.

Although the GDPR harmonizes the regulations governing the processing of sensitive data, such as individual health information, Member States still have the option to establish legal grounds for processing health information. Furthermore, Article 9(4) clearly states that Member States are free to maintain or enact new restrictions, including requirements, in relation to the processing of genetic, biometric, or health data [37]. This could indicate that the GDPR would not be administered uniformly across all Member States in the domain of health. It may also imply that there may be disparities in how the GDPR is implemented within a single Member State, particularly where local law is in effect [39]. The findings of our survey show that 48.4% of the participants collect genetic data, and this is more likely to occur following GDPR's enforcement in 2018.

The responsible sharing of genetic and other health-related data shall be a foundational principle in data collection program management, including compliance with the obligations and norms set by international and national law and policies [40, 49]. According to the Framework for Responsible Sharing of Genomic and Health-Related Data [50], several core elements of responsible data sharing shall be respected, including transparency, data quality and security, privacy, data protection and confidentiality. The terms of data usage are a main quality element of a registry and by prioritizing ethical and legal standards, high quality registries can provide access to data on a platform that ensures data security and patient confidentiality [51]. A very small relative part of the participants (16.6%) declared willingness to share their database as a contribution to the goals of the Scree4Care EU project.

The EU is preparing governance frameworks that permit access to data in the near future. The aim is to increase trust in data intermediaries and boost data sharing inside the EU and between sectors in order to promote data availability and assist ethical and sustainable research and development processes [52].

The following recommendations could be given to facilitate the process of obtaining health information from various data sources for the development of ML algorithms for the screening and early detection of RDs:

Good practices as transparent data use and providing patients with information on how their data might be used for future research, performance monitoring, service planning, audit, and quality assurance purposes, among other things.

Precise legal grounds should be established for the data processing and provide special consideration to the use of informed consent.

Re-consenting requirements should be considered when selecting particular databases.

A solid understanding of data protection law should be obtained to guarantee that IT security standards are strictly followed.

If the results of data processing may benefit the identification of RD patients, pseudonymization of the data should be applied.

Researchers should be aware that data is collected in a manner that permits its utilization across systems without compromising their integrity and that it's readily available where needed.

Limitations

The outcomes of our research should be considered in terms of the limitations of our study design and sampling methods. This was a cross-sectional questionnaire survey that gives an illustration of the current context of health-related databases to respond to the need for rapid identification of RDs using ML technology. Thus, no changes in this environment could be examined over time. This is critical when discussing breakthrough fields such as ML, whose exponential development has already resulted in new EU legislation and initiatives for health-specific data sharing. The geographical scope of the study comprised EU and EEA (European Economic Area)-based health-related datasets, which may limit the generalizability of the results. Although the convenience sampling method was a relevant choice for the narrow and well-defined pool of respondents, the combination with the heterogeneity of the questionnaire might result in nonresponse bias. Given this disadvantage, the questionnaire design included an option for respondents to refer to experts regarding FAIR principles and legal and business information. To limit nonresponse bias, a questionnaire was sent out to 3032 individuals, 2212 of whom were ERN specialists, with the expectation that the most knowledgeable would fill out the survey and answer as many specific questions about the database as possible. Although many definitions and clarifications about organizational, FAIR and legal domains were provided, and the Screen4Care consortium aligned the questionnaire content and design, the survey concepts were complex and heterogeneous; thus, some respondents may not have fully understood the information included. In addition, selection bias cannot be ruled out, as respondents may have been more informed about ML than nonparticipants.

The technological innovation that brought digital transformation in healthcare – telehealth, ML, AI-enabled medical devices, blockchain electronic health records, automation, Internet of Things, Big data, etc. – demands large amounts of health data to be fed with. The most important results of our study demonstrate not enough sufficient FAIR principals’ adherence and low willingness of the EU health databases to share patient information, combined with some legislation incapacities, resulting in barriers to the secondary use of data. This landscape should be transformed in the near future by EU initiatives that already started as European Open Science Cloud (EOSC) and recently adopted EU Data Governance Act, followed by Personal Data Spaces (PDS) and the European Health Data Space (EHDS). These new EU governing structures are expected to build trust in data providers and stimulate data sharing to promote accessibility and support ethical and sustainable innovation in healthcare.

Aim

The primary aim of this study is a landscape analysis of available European-wide data sources amenable for ML and their usability for Rare Diseases screening, in terms of FAIR, legal, and business considerations. Second, recommendations will be proposed to provide a better understanding of the health data ecosystem – accessibility, sharing, interoperability, legislation, etc. – to inform Screen4Care Project tasks of further steps.

Design of the study and participants’ profile

In the period of March 2022 to December 2022, a cross-sectional study using a semi-structured questionnaire was conducted in accordance with the Checklist for Reporting Results of Internet E-Surveys (CHERRIES) [53, 54] (Additional file 2, Table 1. Checklist for Reporting Results of Internet E-Surveys (CHERRIES)). A non-random convenience sampling method was used in recruiting the participants and a list of potential respondents was prepared, including all individuals that had been identified as eligible to answer the questions – the main contact person of a database that could be: health-related registry, EMR, EHR, HIS, and repositories for genomics. Participants were recruited by using individual emails. In addition, based on the heterogeneity of the questionary a non-random snowball sampling method was applied to target experts with extensive experience in FAIR principles for database management, organization, level of access and metadata or broad knowledge about legal, ethical, and business practices in data collection and operation with a focus on consent and data ownership, sensitive information, data protection, legislation, data sharing and fees.

Settings

Screen4Care is a European research project, run by an international public-private consortium of 35 partners. The project is funded by the Innovative Medicines Initiative (IMI 2JU), a joint undertaking of the European Union and the European Federation of Pharmaceutical Industries and Associations (EFPIA) and thus the geographical scope of our study aims to databases operating in EU and EEA countries.

Eligibility criteria

EU and EEA health-related databases as registries, EMR, EHR, HIS, and repositories for genomics that include information about clinical data; laboratory tests; neurological assessment, or other specialized investigation; medical history, including premature rupture of membranes (PROM – Premature rupture of membranes); imaging studies: X-ray, MRI; diagnosis/confirmed diagnosis (e.g. ICD codes); operations/other interventions; medications/therapy; devices/type and collected parameter; health services; genetic data (including human phenotype ontology - HPO); and administrative and billing data.

On-line questionnaire

The design of the self-completed questionnaire survey instrument was based on information drawn from relevant scientific publications, quantitative and qualitative research, and scoping review on challenges in mapping European RD databases, relevant to ML-based screening technologies in terms of organizational, FAIR and legal principles [13]. The survey contained 81 questions distributed over six main panels: 1) introduction; 2) administrative; 3) screening; 4) FAIR-ness; 5) legal and business; and 6) end of the survey. Question types were closed-ended single-choice questions, semi-closed selective questions with a text answer, semi-closed enumerated questions with/without a text answer, a matrix of questions and open-ended questions. Detailed information about the questionnaire’s structure and content is available in Additional file 3, Table 2. Structure and content of the questionnaire survey instrument.

An electronic questionnaire form on a landscape analysis of available data sources and their usability of RDs screening in Europe was developed using the LimeSurvey platform Enterprise plan version and was distributed to 3032 potential respondents. The questionnaire introductory panel started with a general description of the Screen4Care project, outlined the aim of collecting particular information on the topic of interest and ended with a consent statement for the use of the anonymous data, which was agreed to before filling out the survey questions by the respondent. The survey participation was entirely voluntary. In addition, the anonymous nature of the survey did not require ethics committee approval. The study was conducted according to ethical guidelines established by the Declaration of Helsinki.

Statistical analysis

Descriptive statistics were used to present a univariate analysis of the data. Discrete variables were presented with absolute numbers and proportions, whereas median and 25th and 75th percentiles were used for continuous ones. To determine database characteristics associated with the adherence to the FAIR principles, legal and business aspects of database management Bayesian models were fitted. Positive response for the outcomes of interest was used as a dependent variable in the models. Non-informative priors from the binomial distributions were applied. The outcomes were measured by “yes” and “no” coding to predefined questions of interest. Then the total number of responders for each defined question was used for proportion estimation. The models assumed a logit link function and were fitted using the rstanarm package [55] with four Monte-Carlo chains stimulation for each 2000 iteration per model fit. Convergence was assessed using the Gelman-Rubin diagnostic [56] and the effective sample size (ESS) was calculated for each parameter. The results were presented as posterior median and 95% credible intervals (95% CI). The models were fitted using the R software version 4.3.1 [57].

AI: Artificial intelligence

API: Application Programming Interface

CHERRIES: Checklist for Reporting Results of Internet E-Surveys

EEA: European Economic Area

EFPIA: European Federation of Pharmaceutical Industries and Associations

EHDS: European Health Data Space

EJP RD: European Joint Program on Rare Diseases

EMR: Electronic medical records

EOSC: European Open Science Cloud

EPIRARE: European Platform for Rare Disease Registries

ERN: European Reference Networks

ESS: Effective sample size

EU: European Union

FAIR: Findability, Accessibility, Interoperability, and Reusability

FHIR: Fast Healthcare Interoperability Resources

GDPR: General Data Protection Regulation

HER: Electronic health records

HIS: Hospital information system

HPO: Human phenotype ontology

ICD: International Classification of Diseases

IMI: Innovative Medicines Initiative

IQR: Interquartile range

IT: Information technology

ML: Machine learning

MRI: Magnetic resonance imaging

OMOP: Observational Medical Outcomes Partnership

PDS: Personal Data Spaces

PID: Persistent Unique Identifier

PROM: Premature rupture of membranes

RD: Rare disease

Screen4Care: Screen for Care

Ethics approval and consent to participate

The anonymous nature of the survey did not require ethics committee approval. The study was conducted according to ethical guidelines established by the Declaration of Helsinki. At the beginning of the survey tool, a consent statement for the use of the anonymous data was incorporated, which was agreed to before filling out the survey questions by the respondent.

Consent for publication

“Not applicable”

Availability of data and materials

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request. Data are located in controlled access data storage at the Institute for Rare Diseases (Plovdiv, Bulgaria).

Competing interests

"The authors declare that they have no competing interests."

Funding

The Screen4Care EU-IMI project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 101034427. The JU receives support from the European Union's Horizon 2020 research and innovation programme and EFPIA.

Disclaimer: This study reflects only the author's view, and the JU is not responsible for any use that may be made of the information it contains.

Authors' contributions

Conceptualization, RS, GS, RRay; methodology, RRay, KK; software, RRay, KK; validation, RS, GI, GS, EM; formal analysis, RRay, KK; investigation, KK, RRay, EM; resources, RRay, KK, EM; data curation, RS, GS, GI; writing—original draft preparation, RRay, KK, EM; writing—review and editing, RS, GI, MV, KE, JN, Y-SM, EG, JZ, JS, RRöt; visualization, KK, RRay; supervision, RS, GS, GI; project administration, GS; funding acquisition, GS. All authors have read and agreed to the published version of the manuscript.

Acknowledgements

Screen4Care Team (https://screen4care.eu/) for the critical reading and insights during the internal consortium article review. We are grateful for valuable feedback during its early conception by Josef Schepers and Julian Nam.

European Parliamentc and the council Of European Union. Decision no 1295/1999/EC of the european parliament and of the council of 29 april 1999 adopting a programme of community action on rare diseases within the framework for action in the field of public health (1999 to 2003) [Internet]. 1999. Available from: https://ec.europa.eu/health/archive/ph_overview/previous_programme/rare_diseases/raredis_wpgm99_en.pdf
Kölker S. Gleich F. Mütze U. Opladen T. Rare disease registries are key to evidence-based personalized medicine: Highlighting the european experience. Front Endocrinol [Internet]. 2022 Mar;13. Available from: https://doi.org/10.3389%2Ffendo.2022.832063
EURORDIS-homepage [Internet]. https://www.eurordis.org; Available from: https://www.eurordis.org
Denis A. Mergaert L. Fostier C. Cleemput I. Simoens S. A comparative study of european rare disease and orphan drug markets. Health Policy [Internet]. 2010 Oct;97(2-3):173–9. Available from: https://doi.org/10.1016%2Fj.healthpol.2010.05.017
Lopes-Júnior LC. Ferraz VEF. Lima RAG. Schuab SIPC. Pessanha RM. Luz GS. et al. Health policies for rare disease patients: A scoping review. IJERPH [Internet]. 2022 Nov;19(22):15174. Available from: https://doi.org/10.3390%2Fijerph192215174
Schaefer J. Lehne M. Schepers J. Prasser F. Thun S. The use of machine learning in rare diseases: A scoping review. Orphanet J Rare Dis [Internet]. 2020 Jun;15(1). Available from: https://doi.org/10.1186%2Fs13023-020-01424-6
Orphanet-the portal for rare diseases and orphan drugs [Internet]. https://www.orpha.net/consor/cgi-bin/index.php; Available from: https://www.orpha.net/consor/cgi-bin/index.php
Thompson R. Johnston L. Taruscio D. Monaco L. Béroud C. Gut IG. et al. RD-connect: An integrated platform connecting databases. registries. biobanks and clinical bioinformatics for rare disease research. J GEN INTERN MED [Internet]. 2014 Jul;29(S3):780–7. Available from: https://doi.org/10.1007%2Fs11606-014-2908-8
Orphanet report series - rare disease registries in europe - [Internet]. https://www.orpha.net/orphacom/cahiers/docs/GB/Registries.pdf; 2021. Available from: https://www.orpha.net/orphacom/cahiers/docs/GB/Registries.pdf
European platform on rare disease registration | EU RD platform [Internet]. https://eu-rd-platform.jrc.ec.europa.eu/_en; Available from: https://eu-rd-platform.jrc.ec.europa.eu/_en
Hageman IC. Rooij IALM van. Blaauw I de. Trajanovska M. King SK. A systematic overview of rare disease patient registries: Challenges in design. quality management. and maintenance. Orphanet J Rare Dis [Internet]. 2023 May;18(1). Available from: https://doi.org/10.1186%2Fs13023-023-02719-0
Faviez C. Chen X. Garcelon N. Neuraz A. Knebelmann B. Salomon R. et al. Diagnosis support systems for rare diseases: A scoping review. Orphanet J Rare Dis [Internet]. 2020 Apr;15(1). Available from: https://doi.org/10.1186%2Fs13023-020-01374-z
Raycheva R, Kostadinov K, Mitova E, Bogoeva N, Iskrov G, Stefanov G, Stefanov R. Challenges in mapping European rare disease databases, relevant for ML-based screening technologies in terms of organizational, FAIR and legal principles: scoping review. Frontiers in Public Health. 2023;11. Rare diseases, common challenges. Nature Genetics (2022) 54:215–215.
The project | Screen4Care [Internet]. screen4care.eu; Available from: https://screen4care.eu/
Wilkinson MD. Dumontier M. Aalbersberg IjJ. Appleton G. Axton M. Baak A. et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data [Internet]. 2016 Mar;3(1). Available from: http://dx.doi.org/10.1038/sdata.2016.18
Visibelli A, Roncaglia B, Spiga O, Santucci A. The impact of artificial intelligence in the odyssey of rare diseases. Biomedicines [Internet]. 2023 Mar;11(3):887. Available from: https://doi.org/10.3390%2Fbiomedicines11030887
Santos Vieira B dos, Bernabé CH, Zhang S, Abaza H, Benis N, Cámara A, et al. Towards FAIRification of sensitive and fragmented rare disease patient data: Challenges and solutions in european reference network registries. Orphanet J Rare Dis [Internet]. 2022 Dec;17(1). Available from: https://doi.org/10.1186%2Fs13023-022-02558-5
Seastedt KP, Schwab P, O’Brien Z, Wakida E, Herrera K, Marcelo PGF, et al. Global healthcare fairness: We should be sharing more, not less, data. Aneja S, editor. PLOS Digital Health [Internet]. 2022 Oct;1(10):e0000102. Available from: https://doi.org/10.1371%2Fjournal.pdig.0000102
Bradley SH, Hemphill S, Markham S, Sivakumar S. Healthcare systems must get fair value for their data. BMJ [Internet]. 2022 Jun;e070876. Available from: https://doi.org/10.1136%2Fbmj-2022-070876
Hughes LD, Tsueng G, DiGiovanna J, Horvath TD, Rasmussen LV, Savidge TC, et al. Addressing barriers in FAIR data practices for biomedical data. Sci Data [Internet]. 2023 Feb;10(1). Available from: https://doi.org/10.1038%2Fs41597-023-01969-8
Kaliyaperumal R, Rosinach NQ, Burger K, Silva Santos LOB da, Hanauer M, Roos M. Enabling FAIR discovery of rare disease digital resources. In: Navigating healthcare through challenging times [Internet]. IOS Press; 2021. Available from: https://doi.org/10.3233%2Fshti210101
Meadows A, Haak L. How persistent identifiers can save scientists time. FEMS Microbiology Letters [Internet]. 2018 Jun;365(15). Available from: https://doi.org/10.1093%2Ffemsle%2Ffny143
Bellgard M, Beroud C, Parkinson K, Harris T, Ayme S, Baynam G, et al. Dispelling myths about rare disease registry system development. Source Code Biol Med [Internet]. 2013 Oct;8(1). Available from: https://doi.org/10.1186%2F1751-0473-8-21
Kodra Y, Weinbach J, Posada-de-la-Paz M, Coi A, Lemonnier S, Enckevort D van, et al. Recommendations for improving the quality of rare disease registries. IJERPH [Internet]. 2018 Aug;15(8):1644. Available from: https://doi.org/10.3390%2Fijerph15081644
Gawron P, Hoksza D, Pinero J, Chilet MP, Esteban M, Fernandez JL, et al. Disease and pathway maps for rare diseases. 2020 Jul; Available from: https://doi.org/10.37044%2Fosf.io%2Fgmbjv
Decherchi S, Pedrini E, Mordenti M, Cavalli A, Sangiorgi L. Opportunities and challenges for machine learning in rare diseases. Front Med [Internet]. 2021 Oct;8. Available from: https://doi.org/10.3389%2Ffmed.2021.747612
Corpas M, Kovalevskaya NV, McMurray A, Nielsen FGG. A FAIR guide for data providers to maximise sharing of human genomic data. Ouellette F, editor. PLoS Computational Biology [Internet]. 2018 Mar;14(3):e1005873. Available from: https://doi.org/10.1371%2Fjournal.pcbi.1005873
Chen MT, Lin TH. A provable and secure patient electronic health record fair exchange scheme for health information systems. Applied Sciences [Internet]. 2021 Mar;11(5):2401. Available from: https://doi.org/10.3390%2Fapp11052401
Winder TL, Tan CA, Klemm S, White H, Westbrook JM, Wang JZ, et al. Clinical utility of multigene analysis in over 25,000 patients with neuromuscular disorders. Neurol Genet [Internet]. 2020 Mar;6(2):e412. Available from: https://doi.org/10.1212%2Fnxg.0000000000000412
Lin N van, Paliouras G, Vroom E, Hoen PAC ’t, Roos M. How patient organizations can drive FAIR data efforts to facilitate research and health care: A report of the virtual second international meeting on duchenne data sharing, march 3, 2021. JND [Internet]. 2021 Nov;8(6):1097–108. Available from: https://doi.org/10.3233%2Fjnd-210721
Schoser B, Laforêt P, Kruijshaar ME, Toscano A, Doorn PA van, Ploeg AT van der. 208th ENMC international workshop: Formation of a european network to develop a european data sharing model and treatment guidelines for pompe disease naarden, the netherlands, 2628 september 2014. Neuromuscular Disorders [Internet]. 2015 Aug;25(8):674–8. Available from: https://doi.org/10.1016%2Fj.nmd.2015.04.006
Bladen CL, Salgado D, Monges S, Foncuberta ME, Kekou K, Kosma K, et al. The TREAT-NMD DMD global database: Analysis of more than 7,000 duchenne muscular dystrophy mutations. Human Mutation [Internet]. 2015 Mar;36(4):395–402. Available from: https://doi.org/10.1002%2Fhumu.22758
Stocker M, Stokmans M, Reisen M van. Agenda setting on FAIR guidelines in the european union and the role of expert committees. Data Intelligence [Internet]. 2022;4(4):724–46. Available from: https://doi.org/10.1162%2Fdint_a_00168
Bahim C, Casorrán-Amilburu C, Dekkers M, Herczog E, Loozen N, Repanas K, et al. The FAIR data maturity model: An approach to harmonise FAIR assessments. Data Science Journal [Internet]. 2020;19. Available from: https://doi.org/10.5334%2Fdsj-2020-041
Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N, Saunders G, et al. The european variation archive: A FAIR resource of genomic variation for all species. Nucleic Acids Research [Internet]. 2021 Oct;50(D1):D1216–20. Available from: https://doi.org/10.1093%2Fnar%2Fgkab960
Complete guide to GDPR compliance. GDPR.EU. Available at: https://gdpr.eu/ Accessed on 23 June 2023.
Tehdas. Why health is a special case for data governance. TEHDAS Consortium Partners, 2021. Available at: https://tehdas.eu/app/uploads/2021/06/tehdas-why-health-is-a-special-case-for-data-governance-2021-06-23.pdf Accessed on 20 June 2023.
Hansen, Johan, Petra Wilson, Eline Verhoeven, Madelon Kroneman, Mary Kirwan, Robert Verheij, and Evert-Ben van Veen. "Assessment of the EU Member States' rules on health data in the light of GDPR." European Commission (2021).
Martani A, Geneviève LD, Pauli-Magnus C, McLennan S, Elger BS. Regulating the Secondary Use of Data for Research: Arguments Against Genetic Exceptionalism. Front Genet. 2019 Dec 20;10:1254. doi: 10.3389/fgene.2019.01254. PMID: 31956328; PMCID: PMC6951399.
Becker, R., Chokoshvili, D., Comandé, G., Dove, E. S., Hall, A., Mitchell, C., Molnár-Gábor, F., Nicolàs, P., Tervo, S., & Thorogood, A. (2022). Secondary Use of Personal Health Data: When Is It “Further Processing” Under the GDPR, and What Are the Implications for Data Controllers?, European Journal of Health Law, 30(2), 129-157. doi: https://doi.org/10.1163/15718093-bja10094
EMA. The General Data Protection Regulation: Secondary Use of Data for Medicines and Public Health Purposes. Discussion Paper for Medicines Developers, Data Providers, Research-Performing and Research-Supporting Infrastructures. European Medicines Agency, 2020. Available at: https://www.encepp.eu/events/documents/Discussionpaper.pdf. Accessed on 20 June 2023.
Skovgaard LL, Wadmann S, Hoeyer K. A review of attitudes towards the reuse of health data among people in the European Union: The primacy of purpose and the common good. Health policy. 2019 Jun 1;123(6):564-71.
Kolitsi Z, Dipak K, Petra W. DigitalHealthEurope Recommendations on the European Health Data Space: Supporting Responsible Health Data Sharing and Use through Governance, Policy and Practice. Policy and Practice. DigitalHealthEurope. 2021.
Thorogood A. International data sharing and rare disease: The importance of ethics and patient involvement. Rare Diseases. 2020. doi:10.5772/intechopen.91237
McClelland, R., & Harper, C. M. (2022). Information Privacy in Healthcare — The Vital Role of Informed Consent, European Journal of Health Law (published online ahead of print 2022). doi: https://doi.org/10.1163/15718093-bja10097
Nguyen M, Goldblatt J, R I. Model consent clauses for rare disease research. BMC Medical Ethics; 2019. p. 55.
Gainotti S, Turner C, S W. Improving the informed consent process in international collaborative rare disease research: Effective consent for effective research. European Journal of Human Genetics. 2016;24: 1248–1254.
Starkbaum, J., & Felt, U. (2019). Negotiating the reuse of health-data: Research, Big Data, and the European General Data Protection Regulation. Big Data & Society, 6(2). https://doi.org/10.1177/2053951719862594
Knoppers BM. Framework for responsible sharing of genomic and health-related data. The HUGO journal. 2014 Dec;8(1):3.
Rare disease patient registries: Guidelines for establishment, governance, and operation. 2017.
Panagopoulos A, Minssen T, Sideri K, Yu H, Compagnucci MC. Incentivizing the sharing of healthcare data in the AI Era. Computer Law & Security Review, Volume 45, 2022, 10567 https://doi.org/10.1016/j.clsr.2022.105670.
Eysenbach G. Improving the quality of Web surveys: the Checklist for Reporting Results of Internet E-Surveys (CHERRIES). J Med Internet Res. 2004 Sep 29;6(3):e34. doi: 10.2196/jmir.6.3.e34. Erratum in: doi:10.2196/jmir.2042. PMID: 15471760; PMCID: PMC1550605.
Eysenbach G. Correction: Improving the Quality of Web Surveys: the Checklist for Reporting Results of Internet E-Surveys (CHERRIES) J Med Internet Res 2012;14(1):e8
stan Development Team. RStan: The R interface to Stan [Internet]. 2023. Available from: https://mc-stan.org/
Gelman A. Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science [Internet]. 1992 Nov;7(4). Available from: http://dx.doi.org/10.1214/ss/1177011136
R Core Team. R: A language and environment for statistical computing [Internet]. Vienna. Austria: R Foundation for Statistical Computing; 2021. Available from: https://www.R-project.org/

Download PDF

Journal Publication

published 06 Apr, 2024

Read the published version in Orphanet Journal of Rare Diseases →

Editorial decision: Major revision
22 Nov, 2023
Reviewers agreed at journal
18 Oct, 2023
Reviewers invited by journal
18 Oct, 2023
Editor assigned by journal
17 Oct, 2023
First submitted to journal
16 Oct, 2023

You are reading this latest preprint version

Landscape analysis of available European data sources amenable for machine learning and recommendations on usability for rare diseases screening

Status:

Journal Publication

Version 1

Abstract

Background

Results

Conclusions

Figures

Background

Results

Databases’ profile

FAIR-ness of the organization – findings and recommendations

Legal and business characteristics – findings and recommendations

Discussion

Limitations

Conclusions

Materials and methods

Aim

Design of the study and participants’ profile

Settings

Eligibility criteria

On-line questionnaire

Statistical analysis

Abbreviations

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1