Federated learning has been used to address certain aspects of privacy while developing statistical models on real world data [1, 2, 3, 4, 5, 6]. The Personal Health Train (PHT) paradigm [7] adopts a privacy-by-design approach, to protect research subject identities by only exchanging aggregated statistical information (i.e., model coefficients and cohort summaries) instead of individual patient-level data. Federated learning is an active and rapidly growing topic due to advancement in privacy laws, such as the European Union (EU) General Data Protection Regulation (GDPR) [8]. However, in federated learning studies, much of the present focus has been on algorithms and models, thus data preparation is a crucial aspect that may sometimes be overshadowed.
Since the data will not be processed by human experts during analysis and model building, it is necessary that the data can be acted upon by autonomous software algorithms. The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles emphasizes certain attributes needed to make data interoperable for machine-based processors, and not only to humans operators [9]. Without prescribing any kind of “master” schema or one specific method, the principles identify fifteen attributes of FAIR data [10]. The idea of FAIR data is that a given community, e.g., researchers and cancer clinicians, can achieve a high degree of interoperability and reusability on each other’s data, but the data itself need not be open. This community might be governed by various means, such as research consortia membership, and the community itself defines acceptable clinical use cases for its data, and these use cases may also evolve over time.
Real world clinical data has certain attractive characteristics such as unbiased patient population, large volume, heterogeneity, and ubiquity, but these benefits have to be weighed against risks of confounding. Re-use of multi-center clinical data at large scale presents two main challenges. First is a lack of syntactic interoperability i.e. due to differences in database organization as well as inability to comprehend each other’s human language. Secondly, data of clinical studies are commonly horizontally partitioned; healthcare institutions tend to own similar sets of data fields but exclusively on their own human subjects.
With regards to interoperability, researchers might be able to agree, in advance, on strictly controlled definitions, clinical protocols and data dictionaries in order to collaborate. These are typically implemented in prospective clinical trials. Approaches such as OHDSI (Observational Health Data Sciences and Informatics) [11] and SPHN (Swiss Personalized Health Network) [12] aim to overcome interoperability issues by insisting on a universal “master” schema, data dictionary and specific data codes, such that everyone must re-cast their data into the prespecified schema as a condition of cooperation.
A potential alternative would be an “open world” paradigm focusing on semantic interoperability where routine clinical data can be queried and retrieved by independent secondary users [13] without having to know details in advance about database structure nor native coding schema of the data. It would always be ideal to make standardized structured data at the source, but there are situations where this is not feasible for historical clinical data or standards might diverge between institutions over time. Therefore, annotating the data with semantic meaning after collection may be a more flexible option instead of converting the data at the source.
In regards to horizontal data partitioning, one has to exercise great care when transferring patient clinical data into a centralized repository, due to it the highly sensitive nature. This is an issue where federated learning might offer a distinct advantage. For some data-owning institutions, keeping their data strictly in-house, or on their own trusted servers, may be preferred rather than sending data outside the institution.
The Linked Data [14] standard elegantly captures some of the needs of FAIR by assigning machine-readable unique resource identifiers (URIs) to data elements, as well as capturing the relationships between data elements. A given URI (and its synonyms) that appears in multiple datasets creates a persistent linkage between them. This attribute of linked data can be exploited to integrate disparate pools of data, even if they comprise different domains, e.g., clinical examinations versus image-based biomarkers extracted from radiological scans. Linked data that can be read by machines over Secure Hypertext Transfer Protocol (HTTPS) makes up a worldwide “Semantic Web” of FAIR data. Semantic web standards define a set of essential tools such as the Resource Descriptor Framework (RDF) [15] and a SPARQL Protocol and RDF Query Language (SPARQL) [16], for storing data and querying data, respectively.
Within this work, a clinically meaningful use case for federated learning in multi-institutional datasets is shown. Importantly, this work relies on efficiently converting structured data to RDF and then applying a local annotation to make the data more FAIR. An interactive dashboard, based on a common SPARQL query was used to explore the data and visualize contents across five unique datasets for head-and-neck cancer, without transferring individual patient data across a network. A prognostic model for survival was developed on clinical and image-based features, and on a combination of both. The models were then validated through an internal-external validation procedure. The specific challenges address in this study were : (1) mapping local data contents to a universally accessible semantic ontology without changing any of the original data, (2) allowing multiple simultaneous annotations to co-exist, e.g. using alternative terminologies, (3) visualizing multi-institutional data in a cohesive manner and (4) developing robust statistical models that use different sources of data.