Background: Data discovery, the ability to find datasets relevant to an analysis, accelerates science, increases scientific opportunity, and improves scientific rigour. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.
Methods: A core set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model for the purpose of data discovery and conducting feasibility analyses. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.
Results: Core variables covered 15 of 18 data themes used by the C-Surv data model. Correspondence between the harmonised data schema and cohort-specific data models was close for 61 variables that were directly mapped, 48 variables that were transformed to a binary (yes /no) format and for 6 variables than were standardised to the Z distribution. Harmonisation was more challenging for the remaining 9 socioeconomic and lifestyle variables; these requiring value judgements prioritising inclusivity over informativeness to be made.
Conclusions: Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to increase the core variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.