ML and AI are increasingly being used in healthcare and genetic research to improve patient outcomes and advance scientific knowledge. Further, their application can be beneficial for early diagnosis, improving prognosis, and treatment decisions in the field of RDs [17]. However, such technology relies heavily on the availability of high-quality data, which in the RD domain is often scarce and fragmented [18]. Moreover, the lack of standardized data sharing practices and interoperability standards across different domains can hinder the progress of these innovations [19].
Assessing the FAIR principles that were developed to address these challenges [16] is intended to address specific recommendations and strengthen the process of AI and ML implementation to help RD patients. In the presented study, it was found that the overall adherence to FAIR principles varied significantly among the databases. A total of 25 respondents (17.9%) reported full compliance with all four FAIR components, indicating a significant portion of databases that have successfully implemented these principles. However, more than half of the respondents either could not provide an answer (n = 56, 40%) or indicated that the FAIR principles were not being applied on the site (n = 23, 16.4%). The latter, combined with the relatively low response rate (42.7%) for this section, suggests a lack of awareness and limited implementation of FAIR principles among the surveyed database stakeholders. This is consistent with other studies that have found low levels of FAIR compliance [20]. Therefore, the results underline the importance and need for targeted interventions to promote FAIR compliance and standardize data sharing practices across different domains. Such efforts are proven to be efficient and impactful in the RD and genetic research domains [21, 22].
In this research, a deeper exploration of each FAIR component aimed to identify specific strengths and weaknesses in the surveyed database management and sharing practices. A significant portion of respondents (51.7%) indicated that their data repository utilized a PID, a critical factor that enhances the discoverability and traceability of data [23]. Furthermore, 14 respondents (23.3%) reported multiple data releases with attached versions, which improves data tracking and version control [24]. These findings indicate that efforts have been made to ensure comprehensive access to datasets, fostering openness and transparency, which is proven to be an effective strategy for enhancing quality in RD registries [25]. However, only 45.9% of the respondents ensured accessibility either via a web browser or API, enabling data retrieval through standard protocols. This suggests that there is potential for enhancement in this aspect, specifically in clinical databases, where accessibility is comparatively lower. This can be a barrier to the development of database workflows needed for ML and AI technologies [26] Moreover, only 20 databases (33.9%) reported implementing data licensing, enabling reusability. This result emphasizes the considerable limitations of collaborative research and knowledge dissemination [27].
Database characteristics such as database type, diseases included in the data, and geographical scope of the database were also investigated as potential factors influencing FAIR adherence. Databases containing information on neuromuscular disorders and those with European scope demonstrated the highest overall FAIR adherence. Notably, genetic databases showed the highest proportion of positive responses to overall FAIR adherence, suggesting that these databases have made significant strides in adopting FAIR practices. This may be attributed to the emphasis on data sharing and standardization within the genetics research community [28]. In contrast, low FAIR assessment is found for databases of EHR and HIS. These clinical databases face unique challenges related to data privacy, security, and interoperability, which hinder their ability to fully implement the FAIR principles. [29].
The higher FAIR compliance in databases focused on neuromuscular disorders could be attributed to the relatively specialized nature of these databases and patient advocacy which facilitated more focused and standardized data management practices [30, 31]. Additionally, recent initiatives specially designed for data on neuromuscular disorders might have contributed to the higher FAIR adherence in these databases [32, 33]. The influence of EU policies, efforts, and funding assistance that promote data sharing and FAIR implementation may be related to the observed FAIR adherence in databases with a European focus [34, 35]. European databases might also benefit from standardized data sharing frameworks and infrastructure, enabling smoother data exchange and collaboration across European countries [36].
By following the suggested recommendations, researchers could identify databases that adhere to FAIR principles, providing high-quality, easily accessible, and standardized data. These qualities are vital for effectively implementing ML and AI technologies in RD research. Utilizing such databases will lead to more precise and meaningful results, ultimately contributing to improved patient care and the advancement of scientific knowledge in this complex area of study.
-
Standardization Databases with standardized data sharing practices and data formats ensuring consistency and interoperability across different databases should be used as a primary data source for ML and AI applications.
-
When selecting databases for training datasets it is crucial to prioritize those that utilize persistent identifiers (PIDs). PIDs enhance the discoverability and traceability of data repositories ensuring consistent input for ML and AI algorithms.
-
To facilitate integration with ML and AI technologies databases should ensure data accessibility through web browsers or APIs. This allows for the retrieval and analysis of data.
-
Databases that offer multiple data releases with attached versions should be preferred, as data versioning enables improved data tracking and version control, which are vital for accurate ML and AI model training.
-
For ML and AI applications, it may be beneficial to consider specialized databases focused on RD domains like disorders. Such databases often provide standardized data suitable, for these applications. Researchers should actively pursue research projects to discover databases that follow practices, for data sharing and adhere to common protocols, for data exchange. This will greatly facilitate the integration of ML and AI.
-
Databases focused on specific RD domains, such as those for neuromuscular disorders, should be considered, as they may offer more comprehensive and standardized data suitable for ML and AI applications.
-
Collaborative research initiatives should be sought by researchers to identify databases with standardized data sharing practices and adherence to common data exchange protocols, facilitating ML and AI integration.
-
A thorough assessment of the database’s documentation should be conducted to ensure transparency and comprehensive information about data quality, format, and metadata, which are crucial for ML and AI model development.
-
European-scope databases, with their emphasis on data sharing and standardized practices, should be considered, as they may provide robust and FAIR data suitable for ML and AI research, particularly in the RD domain.
The GDPR provides enhanced protection for health care information in the EU, as reflected in the member countries implementing laws. The GDPR, which entered into force on 24 May 2016 and is applicable from 25 May 2018, creates a harmonized set of rules applicable to all personal data processing taking place in the EU [37]. National data protection authorities are responsible for monitoring and enforcing the application of the GDPR and other national data protection legislation that may be applicable in their territories. In our study, 75% of the respondents (with a proportion over 80% for HER and EMR) declared that there is at least one national data security policy regarding the technical standards to be used to ensure health data for primary use are processed and stored securely and 37.5% of them pointed out the existence of several. The results are similar to those reported in another European study on the topic [38]. Moreover, a regional health authority is traditionally primarily responsible for the containment of individual cases. Thus, it will depend on Member State legislation when in that chain data will be anonymized. Clarifications are however needed under which conditions the further processing of data in order to render them anonymous for the purpose of scientific research would be legitimate [39].
In our study, 76.8% of the respondents confirmed the presence of legislative provisions concerning the primary and secondary use of data. It can be particularly challenging to strike the correct balance between enabling good data use and protecting privacy when it comes to secondary use. Secondary use involves processing data for purposes other than those originally intended when information is gathered, and it may also involve data processors other than the primary data collectors, in contrast to primary use, where data are collected and then used for a specific purpose [38, 40, 41, 42]. In contrast with the study of Skovgaard et al., published in 2019, our results demonstrate that 83.9% of the respondents declare that patients are aware that their information may be used for further research, monitoring performance, service planning, audit, and quality assurance purposes etc. [43]. Moreover, awareness is of key importance for patients involved in RD research, and it could be argued that this becomes even more evident in data sharing, with the onus on researchers, institutions, and collaborations to recognize this as a responsibility. Rare disease patients’ perspectives are needed to contribute to the debate on the management, sharing and protection of data, in order to reconcile tensions within the research process with what matters most to patients [44]. There is also a risk of too much privacy protection in the RD context. Formal legal safeguards and strict transparency requirements leave organizations with less flexibility to share samples and data about RD patients, especially internationally, even where researchers seek explicit patient consent and/or patient involvement in data sharing governance [45].
The informed consent of the citizen is essential for data exchange [46]. The voluntary expression of consent is fundamental to ethical research practices. While patients with RDs often expect that data are shared for scientific advances, they are also concerned about being identified, a risk enhanced in the RD context [47]. In RD research, the consent processes have become increasingly complex, considering the current landscape of technological and genomic advances, together with the extensive collection and dissemination of data worldwide. This has been confirmed by the multiple components included in the consent process and authorization mechanisms for health records exchange in the various databases examined by us. In our study, the most commonly used consent models applied for sharing anonymized patient health information in network electronic exchange for research purposes are opt-in (16.2%) and opt-in with restrictions (10.8%). An additional challenge is the different types of collected consent, including consent for every use of data (30.4%), consent for broader categories of research (27.7%) and consent for all research (17.9%). The need for improving informed consent processes in international collaborative RD research is broadly discussed, namely, there is a need for effective consent in order to conduct effective research. To achieve this aim, the procedure shall address possible ethical and legal hurdles that could hamper research in the future, including opt-in, re-consent and opt-out strategies [48]. We consider this especially relevant while examining informed consent for RD research, in particular, when there are re-consenting requirements for data used in ways that do not fall in the original purpose of the respective registry, or other database research, which we found to be mandatory for 70.5% of the databases we collected responses from.
Although the GDPR harmonizes the regulations governing the processing of sensitive data, such as individual health information, Member States still have the option to establish legal grounds for processing health information. Furthermore, Article 9(4) clearly states that Member States are free to maintain or enact new restrictions, including requirements, in relation to the processing of genetic, biometric, or health data [37]. This could indicate that the GDPR would not be administered uniformly across all Member States in the domain of health. It may also imply that there may be disparities in how the GDPR is implemented within a single Member State, particularly where local law is in effect [39]. The findings of our survey show that 48.4% of the participants collect genetic data, and this is more likely to occur following GDPR's enforcement in 2018.
The responsible sharing of genetic and other health-related data shall be a foundational principle in data collection program management, including compliance with the obligations and norms set by international and national law and policies [40, 49]. According to the Framework for Responsible Sharing of Genomic and Health-Related Data [50], several core elements of responsible data sharing shall be respected, including transparency, data quality and security, privacy, data protection and confidentiality. The terms of data usage are a main quality element of a registry and by prioritizing ethical and legal standards, high quality registries can provide access to data on a platform that ensures data security and patient confidentiality [51]. A very small relative part of the participants (16.6%) declared willingness to share their database as a contribution to the goals of the Scree4Care EU project.
The EU is preparing governance frameworks that permit access to data in the near future. The aim is to increase trust in data intermediaries and boost data sharing inside the EU and between sectors in order to promote data availability and assist ethical and sustainable research and development processes [52].
The following recommendations could be given to facilitate the process of obtaining health information from various data sources for the development of ML algorithms for the screening and early detection of RDs:
-
Good practices as transparent data use and providing patients with information on how their data might be used for future research, performance monitoring, service planning, audit, and quality assurance purposes, among other things.
-
Precise legal grounds should be established for the data processing and provide special consideration to the use of informed consent.
-
Re-consenting requirements should be considered when selecting particular databases.
-
A solid understanding of data protection law should be obtained to guarantee that IT security standards are strictly followed.
-
If the results of data processing may benefit the identification of RD patients, pseudonymization of the data should be applied.
-
Researchers should be aware that data is collected in a manner that permits its utilization across systems without compromising their integrity and that it's readily available where needed.
Limitations
The outcomes of our research should be considered in terms of the limitations of our study design and sampling methods. This was a cross-sectional questionnaire survey that gives an illustration of the current context of health-related databases to respond to the need for rapid identification of RDs using ML technology. Thus, no changes in this environment could be examined over time. This is critical when discussing breakthrough fields such as ML, whose exponential development has already resulted in new EU legislation and initiatives for health-specific data sharing. The geographical scope of the study comprised EU and EEA (European Economic Area)-based health-related datasets, which may limit the generalizability of the results. Although the convenience sampling method was a relevant choice for the narrow and well-defined pool of respondents, the combination with the heterogeneity of the questionnaire might result in nonresponse bias. Given this disadvantage, the questionnaire design included an option for respondents to refer to experts regarding FAIR principles and legal and business information. To limit nonresponse bias, a questionnaire was sent out to 3032 individuals, 2212 of whom were ERN specialists, with the expectation that the most knowledgeable would fill out the survey and answer as many specific questions about the database as possible. Although many definitions and clarifications about organizational, FAIR and legal domains were provided, and the Screen4Care consortium aligned the questionnaire content and design, the survey concepts were complex and heterogeneous; thus, some respondents may not have fully understood the information included. In addition, selection bias cannot be ruled out, as respondents may have been more informed about ML than nonparticipants.