FAIR-ification of structured Head and Neck Cancer clinical data for multi-institutional collaboration and federated learning

doi:10.21203/rs.3.rs-2705743/v1

Download PDF

Research Article

FAIR-ification of structured Head and Neck Cancer clinical data for multi-institutional collaboration and federated learning

https://doi.org/10.21203/rs.3.rs-2705743/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 05 Mar, 2024

Read the published version in BJR|Artificial Intelligence →

Version 1

posted

You are reading this latest preprint version

Federated learning has been demonstrated as an acceptable clinical research methodology for producing analyses and models on dispersed datasets, without the need for exchanging individual patient-level data. Attention needs to be given to making repositories of clinical data Findable, Accessible, Interoperable and Reusable (FAIR) in order to realize the potential of such clinical data in federated learning applications. This work draws attention to FAIR-ification structured clinical data of Head and Neck cancer patients, generated in different parts of the world with incompatible terminologies. We began with an “open world” approach by converting the native datasets into the Resource Descriptor Framework format, and then applying a customized local annotation for each dataset to map the data fields to open access ontologies. This approach allows interactive data exploration by means of a federated SPARQL query-based dashboard. The annotations and dashboard visualizations were constructed without using the individual patient-level data. It is feasible to develop and validate multi-institutional statistical models with federated learning on top of the annotations that make the data FAIR. Findings are robust and potentially scalable to a larger number of participating institutions. The annotation methodology proposed here supports multiple simultaneous mappings (such as the data being re-used in multiple different projects) while keeping the native data the same. Future work may be to include certain rules and requirements for classes and predicates, and using the Shapes Constraint Language for checking the validity of the data.

FAIR

Knowledge graphs

Linked Data

Semantic Web

Ontologies

SPARQL

RDF

Federated Learning

Federated learning has been used to address certain aspects of privacy while developing statistical models on real world data [1, 2, 3, 4, 5, 6]. The Personal Health Train (PHT) paradigm [7] adopts a privacy-by-design approach, to protect research subject identities by only exchanging aggregated statistical information (i.e., model coefficients and cohort summaries) instead of individual patient-level data. Federated learning is an active and rapidly growing topic due to advancement in privacy laws, such as the European Union (EU) General Data Protection Regulation (GDPR) [8]. However, in federated learning studies, much of the present focus has been on algorithms and models, thus data preparation is a crucial aspect that may sometimes be overshadowed.

Since the data will not be processed by human experts during analysis and model building, it is necessary that the data can be acted upon by autonomous software algorithms. The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles emphasizes certain attributes needed to make data interoperable for machine-based processors, and not only to humans operators [9]. Without prescribing any kind of “master” schema or one specific method, the principles identify fifteen attributes of FAIR data [10]. The idea of FAIR data is that a given community, e.g., researchers and cancer clinicians, can achieve a high degree of interoperability and reusability on each other’s data, but the data itself need not be open. This community might be governed by various means, such as research consortia membership, and the community itself defines acceptable clinical use cases for its data, and these use cases may also evolve over time.

Real world clinical data has certain attractive characteristics such as unbiased patient population, large volume, heterogeneity, and ubiquity, but these benefits have to be weighed against risks of confounding. Re-use of multi-center clinical data at large scale presents two main challenges. First is a lack of syntactic interoperability i.e. due to differences in database organization as well as inability to comprehend each other’s human language. Secondly, data of clinical studies are commonly horizontally partitioned; healthcare institutions tend to own similar sets of data fields but exclusively on their own human subjects.

With regards to interoperability, researchers might be able to agree, in advance, on strictly controlled definitions, clinical protocols and data dictionaries in order to collaborate. These are typically implemented in prospective clinical trials. Approaches such as OHDSI (Observational Health Data Sciences and Informatics) [11] and SPHN (Swiss Personalized Health Network) [12] aim to overcome interoperability issues by insisting on a universal “master” schema, data dictionary and specific data codes, such that everyone must re-cast their data into the prespecified schema as a condition of cooperation.

A potential alternative would be an “open world” paradigm focusing on semantic interoperability where routine clinical data can be queried and retrieved by independent secondary users [13] without having to know details in advance about database structure nor native coding schema of the data. It would always be ideal to make standardized structured data at the source, but there are situations where this is not feasible for historical clinical data or standards might diverge between institutions over time. Therefore, annotating the data with semantic meaning after collection may be a more flexible option instead of converting the data at the source.

In regards to horizontal data partitioning, one has to exercise great care when transferring patient clinical data into a centralized repository, due to it the highly sensitive nature. This is an issue where federated learning might offer a distinct advantage. For some data-owning institutions, keeping their data strictly in-house, or on their own trusted servers, may be preferred rather than sending data outside the institution.

The Linked Data [14] standard elegantly captures some of the needs of FAIR by assigning machine-readable unique resource identifiers (URIs) to data elements, as well as capturing the relationships between data elements. A given URI (and its synonyms) that appears in multiple datasets creates a persistent linkage between them. This attribute of linked data can be exploited to integrate disparate pools of data, even if they comprise different domains, e.g., clinical examinations versus image-based biomarkers extracted from radiological scans. Linked data that can be read by machines over Secure Hypertext Transfer Protocol (HTTPS) makes up a worldwide “Semantic Web” of FAIR data. Semantic web standards define a set of essential tools such as the Resource Descriptor Framework (RDF) [15] and a SPARQL Protocol and RDF Query Language (SPARQL) [16], for storing data and querying data, respectively.

Within this work, a clinically meaningful use case for federated learning in multi-institutional datasets is shown. Importantly, this work relies on efficiently converting structured data to RDF and then applying a local annotation to make the data more FAIR. An interactive dashboard, based on a common SPARQL query was used to explore the data and visualize contents across five unique datasets for head-and-neck cancer, without transferring individual patient data across a network. A prognostic model for survival was developed on clinical and image-based features, and on a combination of both. The models were then validated through an internal-external validation procedure. The specific challenges address in this study were : (1) mapping local data contents to a universally accessible semantic ontology without changing any of the original data, (2) allowing multiple simultaneous annotations to co-exist, e.g. using alternative terminologies, (3) visualizing multi-institutional data in a cohesive manner and (4) developing robust statistical models that use different sources of data.

A schematic illustration of the overall workflow to convert structured raw data to FAIR semantic web data is provided in Fig. 1. Our tooling consists of three principal parts, (1) a graphical user interface (GUI) to select tabular data or a relational database for serialization into RDF triples. The GUI also guides the data owner to attach some initial descriptive information (metadata) from a list of widely-used concepts; (2) a collaborative annotation step to attach one or more relevant domain ontologies and define a fit-for-purpose graph data structure; and (3) means to query data across multiple data graphs with the same federated SPARQL query.

We packaged these components in a Docker container to make them platform-independent and easier to deploy; these are made open access (see Availability and requirements section). In this work, we converted four open access datasets plus one private dataset to RDF (see Availability and requirements for accessing public data). We then annotated using the Radiation Oncology Ontology (ROO), National Cancer Institute Thesaurus (NCIt) and Radiomics Ontology (RO).

Structured data conversion to FAIR data

The first step converts data into RDF using triplifier [17], a Java-based tool that automatically serializes a table into RDF, specifically as Terse-Triple Language (TTL) output [18]. Triplifier also compiles the schema of the ingested table as a database-specific Web Ontology Language (OWL) [19] schema file. A basic GUI starts triplifier and then prompts the user to add initial metadata, such as data type of each field (continuous, discrete, categorical ordinal, or patient identifier). The user can attach some pre-selected definitions to their data fields (e.g., biological sex, tumor staging, etc.) and/or include supplementary information in a free text box. These pre-user annotations are directly appended to the OWL file. The resulting TTL and OWL files are automatically saved in a graph database (free version of GraphDB by Ontotext).

The aforementioned procedure intentionally splits the actual contents of the data (RDF) from its schema and pre-annotations (OWL). This permits an external collaborator to work on semantic annotations only using the OWL without needing to read the actual contents of the data which generally contains individually-identifiable information. Example RDF and OWL files may be inspected by the reader when they work through the demonstration shared in the online materials.

The data owner may either share their OWL file publicly (e.g., via Zenodo) or privately within a pre-defined collaboration. In either case, a wider group of researchers, domain experts, clinicians and other data owners can work collaboratively on the OWL files, without revealing data contents to each other. To each unique OWL file, we must apply an overarching semantically meaningful mapping of entities using an openly accessible semantic web ontology (e.g., ROO and NCIt) through consensus and discussion among collaborators.

In the online materials, we have provided example Python scripts for annotating the respective OWL files of four distinct head-and-neck open datasets, using the ROO and NCIt ontologies. Each annotation is saved as a graph object (“annotation.local”) into the same GraphDB repository as the RDF and OWL.

It is important to note that when the clinical use case changes (e.g., a new research question or future re-use of the same data) our local annotations procedure can be repeated, but the original OWL and RDF does not need to be changed. This supports interoperability and re-usability, because alternative ontologies and distinct use case-specific data graphs are all allowed to simultaneously co-exist on top of the original data. An example of such a SPARQL query is provided in the Additional Files section.

Federated exploration of annotated FAIR data

Vantage6 is an open source infrastructure for privacy-preserving federated learning [20, 21] that allows exploration of multi-institutional datasets without the need to exchange individual patient-level data among collaborators. Aggregation of cohort summaries or statistical model coefficients would be conducted by a mutually trusted third-party server, and direct peer-to-peer connection between institutions are prohibited.

We provided a prototype exploratory dashboard for patient characteristics (such as age, sex, and tumor staging) using federated learning. One open dataset was geographically hosted in Canada, three open datasets were hosted in a cloud workspace in Europe, and the single private dataset was hosted within its originating institution in The Netherlands. Each host comprised of an Ubuntu virtual machine instance with unique public IP address and distinct network firewall, and were all simultaneously connected to a Vantage6 trusted server (also hosted in a cloud workspace). This setup is illustrated schematically in Fig. 2. We distributed a single SPARQL query through the abovementioned infrastructure to obtain aggregated cohort statistics (e.g. mean, range, etc.) from the dispersed datasets, referencing the annotations.local file we had created, then presented these results in two ways – (i) an interactive visual dashboard built from Python PlotLy and Dash components, and (ii) an aggregated summary of patient clinical characteristics that can be downloaded as a Comma-Separated Values (CSV) file.

Federated training of Cox models using clinical features and image-based features

After exploring the clinical characteristics in the aforementioned datasets, we selected a subset of patients with comparable attributes and broadly similar treatment regimen as a representative clinical use case of federated Cox Proportional Hazards (CPH) regression for prognosis of overall survival. These were all primary OPC patients treated with first-line (chemo) radiotherapy. We took the federated CPH algorithm that had been successfully used in a previous study [22]. The plan was to find a parsimonious multivariable overall survival (OS) model via step-wise backwards regression. First, we used the clinical features (hereafter our “C-Model”) that had been converted to RDF and annotated as described above.

From prior investigations on these patients, we also had the radiotherapy planning computed tomography scans (CT) from which we extracted 103 radiomics features per subject from the patient’s primary Gross Tumor Volume (GTV). Software for the image-based features have been previously described elsewhere in the O-RAW [23] and PyRadiomics [24] software packages.

We processed all of the radiomics features into RDF as above, and annotated only five of them at the present time using the Radiomics Ontology [25]. These were the five hand-picked radiomics features of interest identified in previous studies of prognostic performance and vulnerabilities [26, 27]; there was no imaging feature selection study performed within this present work. We fitted a federated CPH model of overall survival to predict OS (hereafter “R-Model”) as a complement to the aforementioned C-Model after applying Box-Cox transformation, centering and z-scaling on these image-based feature values.

To combine the C-Model and R-Model, we first used the Vantage6 infrastructure to distribute a script that computes linear predictors of the already-fitted C-Model and R-Model. We then re-fitted coefficients for a hybrid (clinical plus radiomics) federated CPH prognostic model (“CR-Model”) using only linear predictors of the C-Model and R-Model as variables.

To assess discrimination performance of each of the C-Model, R-Model and CR-model, we also used the federated infrastructure to distribute a Harrell Concordance Index (HCI) validation algorithm, then we applied the internal-external validation as recommended by Steyerberg et al [28].

Annotated FAIR data

The open access datasets can be downloaded directly by the interested reader. By visual inspection, the organization of the clinical data table, its column labels and data coding are all different among these datasets. A hypothetical fragment of TTL and OWL generated by triplifier is provided in the supplementary materials (Fig. S1).

A graphical representation of as-serialized TTL content for one subject is shown on the left side of Fig. 3. For argument's sake, the original contents might not be syntactically usable to the reader (e.g., the human-readable labels are in the Dutch language) and are not yet semantically interoperable since not all the entities and encodings are unambiguously recognizable. On the right side of Fig. 3, we show how dataset-specific annotations mapped to the ROO and NCIt ontologies (including new descriptive predicates) render this data more FAIR. The inserted annotations are strictly additive, i.e., it does not alter or over-write the as-serialized contents. For simplicity of visualization in the figure, we omitted some of the schema classes and predicates auto-generated by triplifier. One can readily look up the unique URIs - C25364, C28421 and C16576 in the NCIt and find definitions for “patient identifier”, “sex” and “female”, respectively. Likewise, in the ROO, the URIs P100061, P100018 and P100042 refer to “has_identifier”, “has_biological_sex” and “has_value”, respectively.

Although not explicitly shown in Fig. 3 for brevity, graphs of arbitrarily complex geometries could be constructed out of serialized triples, e.g., each subject may have one or more neoplasms, and each neoplasm is associated with its own set of diagnostic findings, cancer staging, follow-ups, etc. that is disjoint from other previous, concurrent, or subsequent neoplasms. Where needed, numerical values may be supplemented by extra predicates and classes indicating the exact units of the measure, e.g., age in years, and follow-up time as intervals of days, or months, or years. Examples of hierarchical relationships are indeed included for the reader in the online materials.

Data exploration dashboard

A snapshot of the interactive visual dashboard is given as Fig. 4. This allows a dashboard user to see in one glance what data is available across the geographically dispersed FAIR data repositories. It is also possible to download a case-mix aggregated summary directly from the dashboard as a CSV file. The latter was reformatted and presented here as Table 1.

Training and validating statistical models via federated learning

The C-Model was developed on a subset of 1,492 subjects treated by (chemo) radiotherapy for primary OPC (HN1 = 88, HNSCC = 492, OPC = 606, HEAD-NECK = 203 and HN3 = 63). After matching primary GTV (Gross Tumor Volume) for these subjects and linking to their previously computed radiomics features, the R-Model was constructed from 1,321 subjects (HN1 = 80, HNSCC = 396, OPC = 582, HEAD-NECK = 203 and HN3 = 60). The difference in sample size was due to either not having some patient’s pre-processed radiomics features beforehand, or there was no primary GTV delineated near the oropharynx.

Global coefficients of the C-Model and R-Model were downloaded from the central server and are presented as Supplementary Fig.S2 and Fig.S3, respectively. The infrastructure does support the computation of linear predictors using coefficients of C-Model and R-Model separately, and then to fit a combined CR-Model using the aforementioned linear predictors. The results of the CR-Model are presented as Supplementary Fig.S4.

HCI results from internal validation and for cyclical internal-external validation (fitted with 4 other nodes, validated in the excluded node, as an estimate of potentially over-optimistic discrimination) are summarized in Supplementary Table S1. The federated CPH performance varies from one dataset to another, but there is no suggestion of over-optimism. Overall, the C-Model (HCI range 0.62–0.73) performed better than the R-Model (HCI range 0.61–0.68), and the combination CR-Model (HCI range 0.60–0.72) was not improved relative to the C-Model.

Table 1

Clinical case-mix summary statistics from each of the five data nodes.
	HN1	HNSCC	OPC	HEAD-NECK	HN3
Sample size	137	492	606	298	165
Age in years Mean Range	61.9 44–83	57.8 28–87	60.5 33–89	63.3 18–90	62.6 29–84
Sex Female Male	26 111	69 423	125 481	71 227	43 122
Tumour stage T1 T2 T3 T4 Tx	35 32 24 46 -	92 203 117 80 -	103 198 183 122 -	39 109 94 46 10	14 31 68 52 -
Nodal stage N0 N1 N2 N3 Nx	60 16 58 3 -	45 53 378 16 -	101 61 397 47 -	59 40 180 19 -	48 45 54 18 -
Metastasis stage M0 M1 Mx	136 1 -	492 0 -	606 - -	294 0 4	165 0 -
Overall stage (7th ed.) I II III IV` Unspecified	24 11 23 79 -	3 16 67 406 -	11 38 85 472 -	4 27 61 204 2	- - - - -
Tumour location Nasopharynx Oropharynx Hypopharynx Larynx Unknown	- 88 - 49 -	- 492 - - -	- 606 - - -	28 203 13 45 9	- 63 31 64 -
HPV status Positive Negative Unknown	23 58 56	248 44 200	356 143 107	78 46 174	34 29 102
Radiotherapy type Radiotherapy Chemoradiotherapy	100 37	57 435	309 297	48 250	104 61
Survival status Censored Deceased	63 74	376 116	347 259	242 56	77 88

In this work, a clinically meaningful use case for federated learning was implemented starting first by FAIR data preparation and a visual data exploration dashboard. A prognostic model for survival was developed on clinical and image-based features, and a combination of both, for multi-institutional datasets on head-and-neck cancer, four open access plus one private. The models were validated through an internal-external validation procedure. The data preparation and dashboard visualization steps were performed without the need for exchanging individual-level patient data.

This work relates to privacy-preserving federated learning, in the sense where individual data is protected; only statistical summaries and modelling coefficients are shared, thus the chance of re-identifying a specific individual is low. This work could be enhanced in future with (1) outlier data detection and (2) additional safeguards against SPARQL queries that (iteratively) try to isolate the summary statistics of single individuals. The federated approach to dashboard visualization supports exploration between datasets among collaborating partners, without the need to transfer data between partners. If many open datasets are annotated in this way, they can be efficiently queried en masse with a single SPARQL query as in the example we have shown. SPARQL-based queries are able to handle a mixture of repositories in private networks and the public web.

It is a known risk that prognostic models based on real-world data with many possible predictors can be sensitive to over-fitting, confounding and lack of generalizability. The ability to study inter-group heterogeneity and associations with outcome of datasets at large scale is an essential tool for understanding some of the risks with inferencing using real-world routine patient data.

The volume of open “donated data” (e.g., in TCIA) is growing, but it is currently relatively small compared to the total amount of clinical data generated in routine care. While open data should be strongly encouraged and incentivized, donated data typically contains preferential/idiosyncratic coding and schema and potential semantics, such that using this assumes a high degree of implicit domain knowledge and is poorly amenable for machine-assisted processing. Enforcement of a universal master schema would significantly aid machine processability, but there remain practical implementation challenges with universal schemas. The FAIR data standards do not specify the annotations nor universal schemas, but it leaves it to a community to define its “domain-relevant” standards (i.e., principle R1.3).

One of the most important FAIR principles is to assign a globally unique and persistent identifier to both the metadata and (if available) the actual data (principle F1). The four open datasets here are unambiguously referenced using a Digital Object Identifier (DOI). Though private dataset HN3 is not openly accessible, we do openly disseminate the readable description of the dataset plus the schema (OWL) and its semantic ontology annotations (“annotation.local”) in an open Zenodo repository with its unique persistent DOI for the metadata. This DOI is cross-referenced with this open access publication and our open access software repository.

With regards to accessibility, the data and metadata is reachable via standard web communication protocols, universal resource locators and queries that adhere to the W3C Semantic Web recommendations. Where the dataset is not openly accessible, we have provided links and procedures to apply for access to the data, with the expectation that non-commercial and legitimate interests (i.e., scientific research) activities would be viewed as reasonably acceptable.

Assuming that transforming the data into a universal master schema is not already done, our method proposes a flexible and adaptable means of applying semantic interoperability by means of annotation with open semantic ontologies (e.g., NCIt and ROO). In the Linked Data paradigm, each data entity as well as its relationship to other data entities is traceably and collectively mapped to a unique and persistent identifier. Every instance of the same identifier must mean semantic equivalence, entirely irrespective of the human-readable label, which is generally in the data owner’s own language. The ontologies not only establish the terminology and definitions but includes some knowledge representation that allows the possibility to apply machine-assisted logical inferencing.

The annotation method as we showed in connection with Fig. 3 works because of inferencing in RDF databases, that is, it is easy to add assertions that one class shall be deemed equivalent (i.e. synonymous) with another class. This assertion propagates to every instance of that class in a given database. Asserted equivalencies also apply to predicates in the same way as classes. The predicates used in this work have been taken from a previous work [29]. At present, we do not impose constraints on what kinds of classes can be joined by a given predicate, but this will be very useful to implement as future work. Specifically, the Shapes Constraint Language (SHACL) can be used, as has been done in the SPHN work [12], to apply some structural constraints on the kinds of data graphs that can be made as well as defining some standards for data quality validation.

Finally, the five multi-institutional datasets for head-and-neck cancer are combined here for the first time, and the collaborative federated CPH analysis is likely more robust than modelling on the institutional datasets one by one. In the C-Model, elderly patients and poor overall health is a strong counter-indication against receiving concurrent chemo-radiotherapy, which is an obvious bias that could be potentially detected using the dashboard. Likewise, the R-Model performance might be improved if we had re-done feature selection and feature value normalization from the very beginning. However, these were not the principal objectives of this study. Our data conversion to RDF and making FAIR methodology can be applied to clinical features and imaging-derived features, and more generally to any other structured data types. Though the current models (C-Model, R-Model and combined CR-Model) are not yet ready for clinical use, this work shows that federated learning can develop models utilizing multiple sources of features as well as scale up by adding other institutional datasets.

Federated learning has been used to date to develop statistical models, while addressing privacy concerns and avoid sharing of institutional data, however attention must be given to data preparation. Our proposed approach is to process structured data into RDF and then annotate the datasets with semantic ontologies without changing the original data. This approach allows interactive data exploration by means of a federated query-base dashboard. It is feasible to develop and validate statistical models on federated learning infrastructure, such that findings are robust and potentially scalable to a larger number of participating institutions.

CPH: Cox Proportional Hazards

CSV: Comma-Separated Values

CT: Computed (x-ray) tomography

DOI: Digital Object Identifier System

EU: European Union

FAIR: Findable Accessible Interoperable Reusable

GDPR: General Data Protection Regulation

GTV: Gross Tumor Volume

HTTPS: Hypertext Transfer Protocol - Secure

IRB: Institutional review board

NCIt: National Cancer Institute Thesaurus

O-RAW: Ontology-guided Radiomics Analysis Workflow

OHDSI: Observational Health Data Sciences and Informatics

OPC: Oropharyngeal Cancer

OWL: Web Ontology Language

PET: Positron emission tomography

PHT: Personal Health Train

ROO: Radiation Oncology Ontology

SHACL: Shapes Constraint Language

SPARQL: SPARQL Protocol and RDF Query Language

SQL: Structured Query Language

TCIA: The Cancer Imaging Archive

TTL: Terse RDF Triple Language

URI: Uniform Resource Identifier

Ethics approval and consent to participate

Approval to use the private dataset HN3 for this investigation was granted by the Institutional Review Board (IRB) of MAASTRO Clinic, The Netherlands (reference P0415) as the TRAIN retrospective observational study protocol (ClinicalTrials.gov Identifier NCT04655469).

Consent for publication

Not applicable.

Availability of data and materials

Four public datasets were obtained from The Cancer Imaging Archive (TCIA) [30]. RADIOMICS-HN1 [31] comprises clinical data, volumetric CT and PET of 137 patients with laryngeal carcinoma and OPC treated by RT alone or currently with either cisplatin or cetuximab and had been used as model validation dataset by Aerts et al. [32]. HNSCC contains clinical data and contrast-enhanced CT scans of 627 oropharyngeal cancer (OPC) patients [33]. OPC-Radiomics has clinical data and CT scans of 606 OPC subjects, treated by either radiotherapy or chemo-radiotherapy between 2005 and 2010 [34], that was part of a radiomics prognostication study [35]. HEAD-NECK-PET-CT [36] comprised 298 subjects with multiple subsites of HNC each with clinical descriptors, PET and planning CT, treated between April 2006 and November 2014, and was also the subject of a radiomics prognostication study [37]. The HN3 dataset is not publicly available at the present time due to material that is potentially identifiable of an individual. Redacted patient data, subject to a data transfer agreement, is available from authors LW and FH upon reasonable request.

Ontologies used openly accessible in Bioportal, (https://bioportal.bioontology.org/ontologies/ROO, https://bioportal.bioontology.org/ontologies/NCIT, https://bioportal.bioontology.org/ontologies/RO).

Competing interests

The authors declare that they have no competing interests.

Funding

VG, AC, AD and LW acknowledge financial support from the Dutch Research Council (NWO) (TRAIN project, dossier 629.002.212) and the Hanarth Foundation.

LW and JK acknowledge support via a Erasmus staff mobility travel grant.

Authors' contributions

VG performed all of the data processing, Python coding and statistical experimentation. VG prepared the majority of the manuscript text with assistance from JvS and LW. AC was mainly responsible for setting up the Vantage6 infrastructure, and assisted VG with the statistical experimentation. FH and FW are clinicians responsible for the data collections HN1 and HN3, and assisted with clinical interpretation of the statistical experiments. MW and SK were responsible for preparation of the OPSCC dataset and setup of the infrastructure in Toronto, and both assisted with the statistical experimentations. JK is a clinician who contributed to the visualization dashboard concept. AD and BH-K contributed significantly to editing of the manuscript. JvS and LW were jointly responsible for the overall supervision of the work, and for editing of the manuscript. All authors have inspected the manuscript prior to submission and given their consent to publication.

Acknowledgements

None.

Availability and requirements

Project name: Flyover (tagged release: v1.0, preprint demonstration project branch: MedRxiv)

Project home page: https://github.com/MaastrichtU-CDS/projects_flyover_project

Zenodo Repository DOI: https://doi.org/10.5281/zenodo.7190551

Operating system(s): Platform independent

Recommended processor, memory and storage: Intel i5 – 8^th gen equivalent or higher, 8GB RAM, at least 16GB storage.

Programming language: Python

Other requirements: Either Docker Desktop (Windows/MacOS) or Docker Engine (Linux), plus Docker Compose 3.7.

License: Docker Desktop/Engine is licensed as part of either a free or paid Docker subscription (https://www.docker.com/pricing/faq/). Jupyter/base-notebook image is distributed under a Modified BSD License. GraphDB Free Edition is distributed as freeware.

Restrictions on use by non-academics: Creative Commons By Attribution Non Commercial only version 4.0 International (CC BY-NC 4.0).

Deist TM, Jochems A, van Soest J, Nalbantov G, Oberije C, Walsh S, et al. Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT. Clin Translational Radiation Oncol. 2017 Jun;1:4:24–31.
Jochems A, Deist TM, El Naqa I, Kessler M, Mayo C, Reeves J et al. Developing and Validating a Survival Prediction Model for NSCLC Patients Through Distributed Learning Across 3 Countries. International Journal of Radiation Oncology*Biology*Physics. 2017 Oct 1;99(2):344–52.
Deist TM, Dankers FJWM, Ojha P, Scott Marshall M, Janssen T, Faivre-Finn C et al. Distributed learning on 20 000 + lung cancer patients – The Personal Health Train. Radiotherapy and Oncology. 2020 Mar 1;144:189–200.
Choudhury A, Theophanous S, Lønne PI, Samuel R, Guren MG, Berbee M, et al. Predicting outcomes in anal cancer patients using multi-centre data and distributed learning - A proof-of-concept study. Radiother Oncol. 2021 Jun;159:183–9.
Dayan I, Roth HR, Zhong A, Harouni A, Gentili A, Abidin AZ, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat Med. 2021 Oct;27(10):1735–43.
Pati S, Baid U, Edwards B, Sheller M, Wang SH, Reina GA et al. Federated learning enables big data for rare cancer boundary detection.Nat Commun. 2022 Dec5;13(1):7346.
The Personal Health Train Network. | The Personal Health Train [Internet]. Available from: https://pht.health-ri.nl/personal-health-train-network
General Data Protection Regulation (GDPR). – Official Legal Text [Internet]. General Data Protection Regulation (GDPR). Available from: https://gdpr-info.eu/
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar;15(1):160018.
FAIR Principles [Internet]. GO FAIR. Available from: https://www.go-fair.org/fair-principles/
OHDSI – Observational Health. Data Sciences and Informatics [Internet]. [cited 2023 Feb 21]. Available from: https://www.ohdsi.org/
Touré V, Krauss P, Gnodtke K, Buchhorn J, Unni D, Horki P, et al. FAIRification of health-related data using semantic web technologies in the Swiss Personalized Health Network. Sci Data. 2023 Mar;10(1):127.
de Mello BH, Rigo SJ, da Costa CA, da Rosa Righi R, Donida B, Bez MR et al. Semantic interoperability in health records standards: a systematic literature review. Health Technol. 2022 Mar 1;12(2):255–72.
Data - W3C [Internet]. Available from: https://www.w3.org/standards/semanticweb/data
RDF - Semantic Web Standards [Internet]. Available from: https://www.w3.org/RDF/
SPARQL - Semantic Web Standards [Internet]. Available from: https://www.w3.org/2001/sw/wiki/SPARQL
van Soest J, Choudhury A, Gaikwad N, Sloep M, Dekker A. Annotation of existing databases using Semantic Web technologies: making data more FAIR.:8.
Turtle - Terse RDF Triple Language [Internet]. Available from: https://www.w3.org/TeamSubmission/turtle/
OWL - Semantic Web Standards [Internet]. Available from: https://www.w3.org/OWL/
vantage6 documentation [Internet]. Available from: https://docs.vantage6.ai/en/main/
Moncada-Torres A, Martin F, Sieswerda M, Van Soest J, Geleijnse G. VANTAGE6: an open source priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange. AMIA Annu Symp Proc. 2020;2020:870–7.
Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X et al. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. Journal of the American Medical Informatics Association. 2015 Nov 1;22(6):1212–9.
Shi Z, Traverso A, van Soest J, Dekker A, Wee L. Technical Note: Ontology-guided radiomics analysis workflow (O-RAW). Med Phys. 2019;46(12):5677–84.
Radiomics. [Internet]. Available from: https://www.radiomics.io/pyradiomics.html
Radiomics Ontology. - Summary | NCBO BioPortal [Internet]. Available from: https://bioportal.bioontology.org/ontologies/RO
Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.Nat Commun. 2014 Jun3;5(1):4006.
Welch ML, McIntosh C, Haibe-Kains B, Milosevic MF, Wee L, Dekker A et al. Vulnerabilities of radiomic signature development: The need for safeguards.Radiotherapy and Oncology. 2019 Jan1;130:2–9.
Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal–external, and external validation.Journal of Clinical Epidemiology. 2016 Jan1;69:245–7.
Kalendralis P, Shi Z, Traverso A, Choudhury A, Sloep M, Zhovannik I, et al. FAIR-compliant clinical, radiomics and DICOM metadata of RIDER, interobserver, Lung1 and head-Neck1 TCIA collections. Med Phys. 2020;47(11):5931–40.
Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. J Digit Imaging. 2013 Dec 1;26(6):1045–57.
Wee L, Dekker A. (2019). Data from Head-Neck-Radiomics-HN1 [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/tcia.2019.8kap372n
Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.Nat Commun. 2014 Jun3;5(1):4006.
Grossberg A, Elhalawani H, Mohamed A, Mulder S, Williams B, White AL, Zafereo J, Wong AJ, Berends JE, AboHashem S, Aymard JM, Kanwar A, Perni S, Rock CD, Chamchod S, Kantor M, Browne T, Hutcheson K, Gunn GB, Frank SJ, Rosenthal DI, Garden AS, Fuller CD, Neck Quantitative Imaging Working Group. M.D. Anderson Cancer Center Head and (2020) HNSCC [ Dataset ]. The Cancer Imaging Archive. DOI: https://doi.org/10.7937/k9/tcia.2020.a8sh-7363
Kwan JYY, Su J, Huang SH, Ghoraie LS, Xu W, Chan B, Yip KW, Giuliani M, Bayley A, Kim J, Hope AJ, Ringash J, Cho J, McNiven A, Hansen A, Goldstein D, de Almeida JR, Aerts HJ, Waldron JN, Haibe-Kains B, O'Sullivan B, Bratman SV, Liu FF. Data from Radiomic Biomarkers to Refine Risk Models for Distant Metastasis in Oropharyngeal Carcinoma. The Cancer Imaging Archive. 2019. 10.7937/tcia.2019.8dho2gls.
Kwan JYY, Su J, Huang SH, Ghoraie LS, Xu W, Chan B et al. Radiomic Biomarkers to Refine Risk Models for Distant Metastasis in HPV-related Oropharyngeal Carcinoma.Int J Radiat Oncol Biol Phys. 2018 Nov15;102(4):1107–16.
Vallières M, Kay-Rivest E, Perrin LJ, Liem X, Furstoss C, Khaouam N, Wang C-S, Khalil Sultanem. The Cancer Imaging Archive. 2017. 10.7937/K9/TCIA.2017.8oje5q00. Data from Head-Neck-PET-CT.
Vallières M, Kay-Rivest E, Perrin LJ, Liem X, Furstoss C, Aerts HJWL, et al. Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer. Sci Rep. 2017 Aug;31(1):10117.

No competing interests reported.

Supplementarymaterial.docx

Download PDF

Journal Publication

published 05 Mar, 2024

Read the published version in BJR|Artificial Intelligence →

Version 1

posted

You are reading this latest preprint version

FAIR-ification of structured Head and Neck Cancer clinical data for multi-institutional collaboration and federated learning

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Implementation

Structured data conversion to FAIR data

Federated exploration of annotated FAIR data

Federated training of Cox models using clinical features and image-based features

Results

Annotated FAIR data

Data exploration dashboard

Training and validating statistical models via federated learning

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1