Evaluating the harmonisation potential of diverse cohort datasets

doi:10.21203/rs.3.rs-1668271/v2

Background: Data discovery, the ability to find datasets relevant to an analysis, accelerates science, increases scientific opportunity, and improves scientific rigour. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.

Methods: A core set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model for the purpose of data discovery and conducting feasibility analyses. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.

Results: Core variables covered 15 of 18 data themes used by the C-Surv data model. Correspondence between the harmonised data schema and cohort-specific data models was close for 61 variables that were directly mapped, 48 variables that were transformed to a binary (yes /no) format and for 6 variables than were standardised to the Z distribution. Harmonisation was more challenging for the remaining 9 socioeconomic and lifestyle variables; these requiring value judgements prioritising inclusivity over informativeness to be made.

Conclusions: Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to increase the core variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.

Bioinformatics

Epidemiology

Ontology

Cohort

Harmonisation

Epidemiology

Data Model

Data discovery

The importance of consistent approaches to managing and discovering data grows as data assets increase in volume and complexity.
For population cohort data the problem is particularly acute due to the diversity and serial nature of measurement.
In data from four diverse population cohort studies, we found that simple harmonisation rules based on widely used conventions can be meaningfully applied with marginal loss of granularity.
This allows the availability of data at individual variable level, and in combination, to be assessed rapidly across datasets.
This information can then be used to assess the feasibility and likely statistical power available to a proposed analysis, and to incentivise the development of more powerful data discovery tools.

Data discovery, the ability to find datasets relevant to an analysis, is a critical component of a productive science environment. Not only does it help researchers discover critical data elements for an analysis, but is a key infrastructure component underlying the FAIR principles of data management (ref).

For research cohorts and trials, the complexity and variety of longitudinal datasets presents particular challenges as data structures and labelling conventions are highly variable and typically under-documented. This generates high transaction costs for data discovery. Efficient data discovery increases scientific opportunity, improves rigour, and accelerates activity. For neurodegeneration, the rise of data platforms such as Dementias Platform UK (DPUK) (1) and other global data initiatives (2-11) which support multiple datasets, provide impetus for the development of more efficient data discovery solutions.

A tool for increasing the efficiency of data discovery is data harmonisation. The goal of data harmonisation is to achieve inferential equivalence between two or more variables. This is in contrast to data standardisation, where data are organised according to a standard data model (12). Underlying the inferential equivalence of two or more variables is an implicit latent construct that the variables are considered to represent. For example, different reaction time tasks may be considered to represent the latent constrict of cognitive processing speed. However, the utility of a latent construct is purpose-specific. For example, a latent construct of cognitive processing speed may be sufficiently precise for one hypothesis-test, but not for another. Harmonisation is also not an exact science. Once an appropriate latent construct has been identified, rules for translating variables into the latent construct will vary across datasets and among scientists. For example, some reaction time distributions may be log-normal whereas some may be highly skew. These differences require different solutions.

For the purpose of data discovery, in contrast to a formal data analysis, the extent to which data can be meaningfully harmonised has implications for the development of data discovery tools as well as the utility of data platforms. High levels of harmonisation enable the feasibility of a proposed analysis to be evaluated prior to making a formal data access request. High levels of harmonisation also enable feasibility analyses (in contrast to a formal hypothesis test) to be conducted prior to a formal hypothesis test. These reduce the risk (and administrative burden) of accessing and processing datasets that are uninformative for a specific scientific question. This applies particularly to combined analyses of multiple datasets.

To evaluate the harmonisation potential of population cohort data in relation to dementia, harmonisation for the purpose of data discovery is attempted for a core set of neurodegeneration related variables, and applied to four diverse population cohort datasets.

Variable selection

An initial set of core variables, relevant to neurodegeneration, was identified using the DPUK Cohort Explorer data discovery tool (13). Cohort Explorer is an interactive search tool developed to assist researchers to identify informative cohorts prior to making a data access request. It describes 32 variables from 11 cohort datasets (n=123554) frequently requested in dementia related data access requests. These variables cover demographic, health status, lifestyle and cognitive performance data, and include metadata on biosample availability. Variables were selected to reflect the frequency of being requested in DPUK data access proposals. The initial variable set was expanded by iterative discussion amongst the ADDI Harmonisation Action group to include a range of modifiable and non-modifiable risk factors across data modalities. From these start-points a list of 124 core variables was identified.

Schema development

Variables were curated using C-Surv as the common data model (12). This provides a standard data structure across datasets. The harmonisation schema was optimised to be inclusive of datasets by using relatively simple harmonisation rules and widely used value labelling conventions wherever possible. Three strategies for harmonisation, as described in the Maelstrom harmonisation guidelines (14, 15) were used.

Simple calibration, using direct mapping between the source variable and the harmonised variable, was adopted for widely used standard metrics such as weight or height. Direct mapping, including cut-off points was used for validated clinical scales. The Gregorian calendar was used for dates and conventional units were used for age (years), durations (hours), concentrations (mg/ml), volumes (mm³), etc.

Algorithmic transformation was used for non-clinical questionnaire responses including lifestyle. The algorithm was selected to be as inclusive as possible by using a relatively simple transformation and was developed iteratively as it was applied to each dataset. Gender was transformed as male, female; smoking as current, past, and never; and ethnicity as white, black, Asian, mixed, other. Cohabitation was coded as single, married/cohabiting, divorced/separated, widowed, whilst education was considered as educational experience and transformed into junior or less, secondary, degree or equivalent, postgraduate or equivalent. For type of accommodation a straightforward transformation was house/bungalow, apartment, sheltered/residential, other.

Non-clinical cognitive performance scores were standardised into z-scores by default, with an option for refining this rule on a scale-by-scale basis according to the variable distribution. More sophisticated methods such as latent variable modelling or multiple imputation were not used.

Schema evaluation

The utility of the variable selection and initial harmonisation rules was tested using four DPUK collaborating cohorts. These were selected on the basis of having diverse primary scientific objectives, providing longitudinal multimodal data, and being frequently requested by DPUK users. The cohorts were the Airwave Health Monitoring Study (Airwave); an occupational cohort (16), the English Longitudinal Study of Ageing (ELSA); a social science focussed study (17), Generation Scotland; a genetics cohort (18), and Memento; a neurodegeneration cohort (19). The coverage of each cohort and overlap of variables across cohorts was assessed, along with the utility of the harmonisation rules.

Core variables

The core variable list comprised a range of modifiable and non-modifiable risk factors and metadata (Table 1). Of the 124 core variables, most variables (n=103) were present in the baseline data. However, for ELSA data, 18 variables were collected in subsequent waves. For Memento, two outcomes were collected through linkage to health records. For both ELSA and Memento, genetics data are available independently of study wave. The core variables covered 15 out of the 18 data themes represented by C-Surv data model. Themes not represented were linkage data (theme 14, healthcare utilisation data (theme 15), and device data (theme 18).

Representation and distribution

Most variables (n=120; 97%) were found in one or more cohorts. Memento, being primarily designed to investigate neurodegeneration, included most variables (n=92). The other cohorts, designed to address a broader range of questions had fewer neurodegeneration-focused variables (Table 2). A full description of the representation of variables across cohorts is provided in the supplementary materials (Table S1). Of the 4 variables that were not found in any cohort one was related to air pollution (pm_2.5concentration) and another was loneliness assessment. That MCI status was not available in any cohort reflects the difficulty of capturing these data in a population setting. That ADAS-Cog score was not available reflects the use of this scale primarily in trials than in cohorts. The distribution of variables across cohorts also varied, with 34 variables being common to all cohorts, 10 in three cohorts, 30 in two cohorts and 46 in one cohort (Figure 1). This shows the diversity of cohort data and reflects the range of scientific purpose underlying these datasets. For example, that ELSA and Memento include 13 and 26 unique variables respectively reflects the distinctive scientific foci of these studies; ELSA being focussed on social factors underlying ageing, and Memento focussed more specifically on neurodegeneration.

Utility

Of the 124 core variables, 61 (49%) were directly mapped. Direct mapping was generally straightforward but did involve truncation of dates, and the interpretation of text for primary cause of death and medications. . For alcohol consumption, although using units per week is translatable with most datasets an ‘other’ option was allowed for when consumption was present but not quantifiable.

Fifty seven (46%) variables were transformed by algorithm. This was relatively straightforward for 48 variables using a yes/no (present/absent) format. For ELSA, the presence or absence of a medical condition was inferred from the data of diagnosis, or a symptom rating score. For Generation Scotland the presence or absence of angina and myocardial infarction was inferred from self-reported heart disease. For Memento several outcomes were indicated as present by interpreting a rating scale score or by a clinical diagnosis. More challenging were nine sociodemographic and lifestyle variables (Table 3). For smoking, there was close concordance between the harmonisation rules and the raw data with some interpolation required for ELSA data. For ethnicity, the harmonisation rules were more detailed than found in the cohorts. Ethnicity was missing in Memento as by law these data are not permitted to be collected in France. Harmonising education was difficult as all the cohorts used qualifications as the index and these varied in detail and across jurisdiction (UK and France). The decision to harmonise on the basis of educational experience rather than qualifications provided a basis for integrating these data, although it may be argued that the harmonised scale is too coarse and less informative. Similarly, for cohabitation and housing type, simplified scales were applied to the more detailed raw data. For exercise (vigorous, moderate and walking) a simple quantification was not possible due to the diversity of measurement and harmonisation was limited to presence or absence. For household income local currency was used and aggregated into four quantiles of annual income.

The six cognitive performance scores were standardised to the Z distribution. The distributions for immediate recall (skew = -.42), delayed recall (skew = -.42), digit symbol substitution (skew = -.11), verbal fluency (skew = .31), fluid intelligence (skew = -.54) and choice reaction time (skew = 1.09) were considered sufficiently Gaussian for Z-scores to be meaningful. A fuller description of the harmonisation strategy is provided in the supplementary materials (Table S2).

For a core set of 124 variables, selected for relevance to neurodegeneration, a harmonisation schema designed for data discovery, was applied to data from 4 diverse population cohorts. Correspondence between the harmonised data schema and cohort-specific data models was complete or close for 115 (93%) variables that were able to be directly mapped, transformed to a binary value, or standardised.

For the remainder, the level of interpretation required varied. For smoking, ethnicity, education, cohabitation, housing type, and household income, moderate interpretation was required (Table 3). However, for three exercise related variables, granularity was lost with only a yes/no format being applicable. For total household income, the use of local currency obviated the problem of exchange rates and differential inflationary pressures. However, it means that the distribution of values is more meaningful than the absolute income values. Aggregating raw data to accommodate these variables was possible but involved value judgements around prioritising inclusiveness of datasets, over scientific informativeness. For several variables, for example, diagnosis of a medical condition, whether these variables should be construed as directly mapped or algorithmically transformed is moot. However, loss of informativeness and harmonisation methods are unlikely to have any material impact on data discovery.

These findings indicate there is value in harmonisation for data discovery. However, this is not a mature process. That the model was developed using only four cohorts, and that not all cohorts had data on all variables, suggests the present harmonisation rules be considered provisional rather than definitive. For many variables the ‘Yes/No’ indicator was sufficiently generic that it could be interpreted as direct mapping or an algorithmic transformation. Applying the schema to further cohorts will clarify this issue. For the processing of free text, the manual interpretation of free text data used here is not scalable or consistent; the potential of natural language processing for rapid and consistent textual interpretation should be explored. The availability of biosamples was included in the core variable list. Technically these are metadata, but were considered informative for data discovery. For cognitive performance, although the standardisation process was straightforward, the extent to which tests are considered to assess the same latent construct is hypothesis-specific. Without claiming aetiologic commonality, grouping tests according to widely used cognitive domains was considered a pragmatic solution. Harmonisation was not applied to longitudinal data. This was intentional to simplify the problem. However, the inclusive and generic nature of the harmonisation schema suggest that applying it longitudinally would be relatively straightforward.

The selection of the 124 core variables, alongside frequency of request in DPUK and C-Path, undoubtedly reflected the research interest of the ADDI data harmonisation group. However, they did represent a broad range of variables; covering 15 of the 18 data themes represented by the C-Surv data model. That several variables were not found in any of the four cohorts does not necessarily suggest they be dropped at this stage. Nevertheless, from this limited variable set estimates of feasibility are at risk if, as is likely, other variables are also required for an analysis. This makes a strong case for incrementally expanding the range of harmonised variables.

Key to harmonisation is the adoption of a common data model, whether this be implicit or explicit. For data discovery, simplifying and standardising variable definitions increases the usability of data accelerates science. Adoption of common data models across data platforms is a key component of their interoperability. Widespread use of a common data model incentivises the development of data discovery software. In a broader context, the development of common data models, each optimised for different use cases, may be anticipated. Of interest will be how these models translate to each other to enable federated analyses using diverse data standards.

To conclude, the importance of consistent approaches to managing and discovering data grows as data assets increase in volume and complexity. For population cohort data the problem is particularly acute due to the diversity and serial nature of measurement. In data from four diverse population cohort studies, we found that simple harmonisation rules based on widely used conventions can be meaningfully applied with marginal loss of granularity. This allows the availability of data at individual variable level, and in combination, to be assessed rapidly across datasets. This information can then be used to assess the feasibility and likely statistical power available to a proposed analysis, and to incentivise the development of more powerful data discovery tools.

Supplementary data

Supplementary data are available at IJE online

Funding

Dementias Platform UK: The Medical Research Council supports DPUK through grant MR/TO333771 PI John Gallacher

Data availability

Data from the cohorts used in the study are available as anonymised cohort data by request at https://portal.dementiasplatform.uk/.

Acknowledgements

We would like to acknowledge the pioneering work of Professor Isobel Fortier of McGill University in developing the field of data harmonisation for cohort data.

Airwave: The Airwave Health Monitoring Study is funded by the Home Office (grant number 780-TETRA) with additional support from the National Institute for Health Research (NIHR) Biomedical Research Centre. The Airwave Study uses the computing resources of the UK MEDical BIOinformatics Partnership (UK MED-BIO: supported by the Medical Research Council MR/L01632X/1).

ELSA: The English Longitudinal Study of Ageing was developed by a team of researchers based at University College London, NatCen Social Research, the Institute for Fiscal Studies, the University of Manchester and the University of East Anglia. The data were collected by NatCen Social Research. The funding is currently provided by the National Institute on Aging in the US, and a consortium of UK government departments coordinated by the National Institute for Health Research. Funding has also been received by the Economic and Social Research Council

Generation Scotland: Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates [CZD/16/6] and the Scottish Funding Council [HR03006]. Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the GSMAPP Access Policy v6-7 December 2016 FINAL 8 Wellcome Trust Clinical Research Facility, Edinburgh, Scotland and was funded by the Medical Research Council UK and the Wellcome Trust (Wellcome Trust Strategic Award “STratifying Resilience and Depression Longitudinally” (STRADL) Reference 104036/Z/14/Z.

Memento: The Memento cohort was supported by a grant from the Fondation Plan Alzheimer (Alzheimer Plan 2008-15 2012) and sponsored by the Bordeaux University Hospital. This work was also conducted by the following: CIC 1401-EC, Bordeaux University Hospital, Inserm, and Bordeaux University

Mike Nalls: This research was supported in part by the Intramural Research Program of the NIH, National Institute on Aging (NIA), National Institutes of Health, Department of Health and Human Services; project number Z01 AG000535, as well as the National Institute of Neurological Disorders and Stroke. Mike Nalls participation in this project was part of a competitive contract awarded to Data Tecnica International LLC by the National Institutes of Health to support open science research, he also currently serves on the scientific advisory board for Clover Therapeutics and is an advisor to Neuron23 Inc.

Author contributions

JG wrote the first draft of the manuscript and SB provided substantial information and input into the compilation and subsequent drafts. SB compiled the harmonised dataset for the project along with input from the ADDI Harmonisation Action Group. All authors contributed to review and comment on content, revisions and drafts. Cohort owners were consulted for clarity and accuracy of data reporting. All authors approved the final version and the decision to submit for publication.

Conflict of interest

All authors declare no conflict of interest in participation of the preparation of this manuscript.

1. Bauermeister S, Orton C, Thompson S, Barker RA, Bauermeister JR, Ben-Shlomo Y, et al. The Dementias Platform UK (DPUK) Data Portal. Eur J Epidemiol. 2020;35(6):601-11.

2. The Global Alzheimer’s Association Interactive network (GAAIN) [11/04/2022]. Available from: https://gaain.org/

3. Dementias Platform Australia (DPAU) [11/04/2022]. Available from: https://www.dementiasplatform.com.au/

4. Alzheimer's Disease Workbench [08/02/2022]. Available from: https://www.alzheimersdata.org/ad-workbench

5. Critical Path Institute [11/04/2022]. Available from: https://c-path.org/

6. European Medical Framework for Alzheimer’s Diesease (EMIF-AD) [08/02/2022]. Available from: http://www.emif.eu/

7. The EU Joint Programme for Neurodegernative Disease (JPND) [08/02/2022]. Available from: https://www.neurodegenerationresearch.eu/search-our-database/

8. Integrative Analysis of Longitudinal Studies of Aging (IALSA) [08/02/2022]. Available from https://www.ialsa.org/

9. Centre for Addiction and Mental Health (CAMH) [11/04/2022]. Available from: https://www.camh.ca/

10. Cohen Veterans Bioscience [11/04/2022]. Available from: https://www.cohenveteransbioscience.org/

11. Fraunhofer SCAI [11/04/2022]. Available from: https://www.scai.fraunhofer.de/en.html

12. Bauermeister S, Bauermeister JR, Bridgman R, Felici C, Newbury M, North L, et al. Research- ready data: The C-Surv data model. Preprint: Research Square; 2021.

13. DPUK Cohort Explorer [08/02/2022]. Available from: https://portal.dementiasplatform.uk/CohortExplorer

14. Fortier I, Dragieva N, Saliba M, Craig C, Robson PJ, with the Canadian Partnership for Tomorrow Project's scientific d, et al. Harmonization of the Health and Risk Factor Questionnaire data of the Canadian Partnership for Tomorrow Project: a descriptive analysis. CMAJ Open. 2019;7(2):E272-E82.

15. Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol. 2017;46(1):103-5.

16. Elliott P, Vergnaud AC, Singh D, Neasham D, Spear J, Heard A. The Airwave Health Monitoring Study of police officers and staff in Great Britain: rationale, design and methods. Environ Res. 2014;134:280-5.

17. Steptoe A, Breeze E, Banks J, Nazroo J. Cohort profile: the English longitudinal study of ageing. Int J Epidemiol. 2013;42(6):1640-8.

18. Smith BH, Campbell A, Linksted P, Fitzpatrick B, Jackson C, Kerr SM, et al. Cohort Profile: Generation Scotland: Scottish Family Health Study (GS:SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol. 2013;42(3):689- 700.

19. Dufouil C, Dubois B, Vellas B, Pasquier F, Blanc F, Hugon J, et al. Cognitive and imaging markers in non-demented subjects attending a memory clinic: study design and baseline findings of the MEMENTO cohort. Alzheimers Res Ther. 2017;9(1):67.

Table 1 Core variable list

#	C-Surv Theme	Variable	Strategy	Harmonisation Rule
1	Administration: theme 1	Cohort ID	SC	Anonymised by cohort
2		Assessment date	SC	Gregorian calendar (yyyy-mm-dd)
3		Date of birth	SC	Gregorian calendar (yyyy-mm-dd)
4		Date of death	SC	Gregorian calendar (yyyy-mm-dd)
5		Cause of death	SC: text	ICD-11 categories 1-18
6		DNA extracted	SC	1 Yes; 0 No
7		Plasma collected	SC	1 Yes; 0 No
8		Serum collected	SC	1 Yes; 0 No
9		CSF collected	SC	1 Yes; 0 No
10	Sociodemographic: theme 2	Age	SC	Value: years 1-130
11		Gender	SC	1 male; 2 female
12		Ethnicity	AT	1 white; 2 Black; 3 Asian; 4 mixed 5 other
13		Cohabitation	AT	1 single; 2 married/cohabiting; 3 separated/divorced;4 widowed other
14		Years education	SC	Value: years range
15		Educational level	AT	1 postgrad; 2 degree; 3 secondary; 4 junior or less
16		Income	AT	Quantiles using local currency
17	Early life experience: theme 3	Childhood physical abuse	SC	1 Yes; 0 No
18		Adolescent physical abuse	SC	1 Yes; 0 No
19		Sexual abuse	SC	1 Yes; 0 No
20		Parental smoking behaviour	SC	1 Yes; 0 No
21	Medical history: theme 4	Type 1 diabetes diagnosis	AT	1 Yes; 0 No
22		Type 2 diabetes diagnosis	AT	1 Yes; 0 No
23		AD diagnosis	AT	1 Yes; 0 No
24		AD FTD diagnosis	AT	1 Yes; 0 No
25		AD mixed diagnosis	AT	1 Yes; 0 No
26		VaD diagnosis	AT	1 Yes; 0 No
27		PD diagnosis	AT	1 Yes; 0 No
28		Depression diagnosis	AT	1 Yes; 0 No
29		Self-report visual difficulty	AT	1 Yes; 0 No
30		Self-report hearing difficulty	AT	1 Yes; 0 No
31		Angina diagnosis	AT	1 Yes; 0 No
32		MI diagnosis	AT	1 Yes; 0 No
33		Hypertension diagnosis	AT	1 Yes; 0 No
34		Stroke diagnosis	AT	1 Yes; 0 No
35		Head injury	AT	1 Yes; 0 No
36		COPD diagnosis	AT	1 Yes; 0 No
37		Arthritis diagnosis	AT	1 Yes; 0 No
38		Current pain	AT	1 Yes; 0 No
39		Self-report general health	AT	1 Yes; 0 No
40		Medications	SC: text	Value: number prescribed
41	Family disease history: theme 5	Dementia parent	SC	1 Yes; 0 No
42		Dementia grandparent	SC	1 Yes; 0 No
43		Dementia sibling	SC	1 Yes; 0 No
44		AD parent	SC	1 Yes; 0 No
45		AD grandparent	SC	1 Yes; 0 No
46		AD sibling	SC	1 Yes; 0 No
47		VaD parent	SC	1 Yes; 0 No
48		VaD grandparent	SC	1 Yes; 0 No
49		VaD sibling	SC	1 Yes; 0 No
50		PD parent	SC	1 Yes; 0 No
51		PD grandparent	SC	1 Yes; 0 No
52		PD sibling	SC	1 Yes; 0 No
53		CHD parent	SC	1 Yes; 0 No
54		CHD grandparent	SC	1 Yes; 0 No
55		CHD sibling	SC	1 Yes; 0 No
56		Stroke parent	SC	1 Yes; 0 No
57		Stroke grandparent	SC	1 Yes; 0 No
58		Stroke sibling	SC	1 Yes; 0 No
59	Psychological status: theme 6	GHQ score	AT	Scale score
60		Self-report depression	AT	1 Yes; 0 No
61		Loss of interest	AT	1 Yes; 0 No
62		Depression score	AT	Scale score
63		EPQ Neuroticism	AT	Scale score
64		EQP Extraversion	AT	Scale score
65		Life satisfaction score	AT	Scale score
66		Job satisfaction score	AT	Scale score
67		Quality of Life score	AT	Scale score
68		Loneliness scale score	AT	Scale score
69	Cognitive status: theme 7	Immediate recall score	S	Z score
70		Delayed recall score	S	Z score
71		Digit symbol substitution score	S	Z score
72		Verbal fluency score	S	Z score
73		Choice reaction time mSec	S	Z score
74		Fluid intelligence score	S	Z score
75		MMSE score	SC	Scale score
76		ADAS cog total score	SC	Scale score
77		CDR total score	SC	Scale score
78		Subjective memory complaint	AT	1 Yes; 0 No
79		MCI diagnosis	AT	1 Yes; 0 No
80	Lifestyle: theme 8	Alcohol consumption	AT	Alcohol units per week, other
81		Smoking status	AT	0 never smoked; 1 past smoker; 2 current
82		Vigorous exercise	AT	1 Yes; 0 No
83		Moderate exercise	AT	1 Yes; 0 No
84		Walking	AT	1 Yes; 0 No
85		Sleep quality scale	AT	Scale score
86		Sleep hours per night	SC	Hours per night
87	Life functionality: theme 9	ADL score	AT	Scale score (higher value higher independence)
88	Life functionality: theme 9	IADL score	AT	Scale score (higher value higher functioning)
89	Physical environment: theme 10	Number of house occupants	SC	Value (occupants)
90		Number of rooms	SC	Value (rooms)
91		Type of accommodation	AT	1 house/bungalow, 2 apartment, 3 residential/sheltered/ other
92		Pollution (grime in house)	SC	1 Yes; 0 No
93	Social environment: theme 11	Number of contacts/month	SC	Value (number of social contacts)
94		Social media sites used	SC	Value (number of sites used)
95		Social media use daily	SC	Value (types used daily)
96	Physical examination: theme 12	Height	SC	Value (cm)
97		Weight	SC	Value (kg)
98		BMI	SC	Value (ratio m²/kg)
99		Grip strength	SC	Value (kg)
100		Gait (walking) speed	SC	Value (m/sec)
101		Systolic BP	SC	Value (mm/hg)
102		Diastolic BP	SC	Value (mm/hg)
103	Imaging: theme 13	White matter volume	SC	Value (mm³ standardised)
104		Grey matter volume	SC	Value (mm³ standardised)
105		Left hippocampal volume	SC	Value (mm³ standardised)
106		Right hippocampal volume	SC	Value (mm³ standardised)
107		WM hyperintensities	SC	Value (mm³ standardised)
108		Amyloid PiB SUVR	SC	Ratio
109	Biosample assays: theme 16	Haemoglobin	SC	Value (mg/dl)
110		White cell count	SC	Value (mg/dl)
111		RBC count	SC	Value (mg/dl)
112		Total cholesterol	SC	Value (mg/dl)
113		HDL cholesterol	SC	Value (mg/dl)
114		Creatinine	SC	Value (mg/dl)
115		Glucose	SC	Value (mg/dl)
116		CRP	SC	Value (mg/dl)
117		Cortisol decrease	SC	Value (mg/dl)
118		Abeta 1-42	SC	Value (pg/ml)
119		Abeta 1-40	SC	Value (pg/ml)
120		Abeta 1-42	SC	Value (pg/ml)
121		Abeta 1-40	SC	Value (pg/ml)
122		Total tau	SC	Value (pg/ml)
123		P tau	SC	Value (pg/ml)
124	Molecular: theme 17	APOE status	SC	1 2/2; 2 2/3; 3 2/4; 4 3/3; 5 3/4; 6 4/4)

SC= Simple Calibration; AT=Algorithmic Transformation; S=Standardisation

Table 2 Distribution of core variables across cohorts

Harmonised dataset		Number of variables per cohort
C-Surv theme	Variables included: n	Airwave	ELSA	Generation Scotland	Memento
Administration	9	6	7	6	9
Sociodemographic	7	6	7	7	6
Early life environment	4	0	4	0	0
Medical history	20	10	13	10	18
Family disease history	18	0	0	15	12
Psychological status	10	3	7	3	5
Cognitive status	11	4	5	4	6
Lifestyle	7	3	7	2	6
Life functionality	2	0	2	0	2
Physical environment	4	1	2	3	2
Social environment	3	0	3	0	0
Physical examination	7	5	7	5	6
Imaging	6	0	0	0	6
Biosample assays	15	8	8	4	13
Molecular data	1	1	1	1	1

Totals	124	47	73	60	92

Table 3 Application of algorithmic transformations across cohorts
Variable	Transformation	Cohort
Variable	Transformation	Airwave	ELSA	Generation Scotland	Memento
Smoking	Never	Never	Ever smoked Yes/no	Never	Never
	Ex	Ex		Ex	Ex
	current	current		current	current
ethnicity	White	White	white	White	-
	Black	-	-	-	-
	Asian	-	-	Asian	-
	Mixed	-	-	Mixed	-
	other	other	Non white	Other	-
education	Post grad equivalent	Post grad	-	-	Higher dipl
	Degree equivalent	Deg. equiv	Deg. Equiv	College/Uni	Degree
	Secondary	-	Higher ed	-	General Bac
		A level NVQ3	NVQ3	Highers	Tech Bac
		GCSE NVQ2	Nvq2	Standards	CAP/BEP
		NVQ1	Nvq1	CSE equiv	Elementary
	<=Primary	-	-	Certificate	Primary
		None	None	None	None
		-	Foreign/other	other	-
cohabitation	Single	Single	Single	Are you living as a couple? yes/no	Single
	Married/cohabiting	Married	Married		Married/cohabiting
		-	Remarried		-
		cohabiting	-		-
	Divorced/separated	Divorced	Divorced		Divorced/separated
	Divorced/separated	Separated	separated		-
	widowed	-	Widowed		widowed
	other	Other	-		-
Housing type	House/bungalow	-	-	House/bungalow	Single family dwelling
	Apartment	-	-	Apartment/flat	Apartment
	Sheltered/residential	-	-	Hostel	Residential
		-	-	Mobile/caravan	Sheltered
		-	-	Sheltered	Religious community
		-	-	Homeless	Care home
	other	-	-	other	other
Household income	Four quantiles using local currency	Annual: <£25999, 26000-37999, 38000-59999, 600000+	Gross monthly and annual in Pounds Sterling	-	Monthly: €400-<800 800-<1200 1200-<1800 1800-<2500 2500<4000 4000-<6000 6000+
Vigorous exercise	Yes/No	-	Do you attend sports clubs, gym, exercise classes?	-	Days per week Hours per day Minutes per day No vigorous exercise
Moderate exercise	Yes/No	-	Do you attend sports clubs, gym, exercise classes?	-	Days per week Hours per day Minutes per day No moderate exercise
Walking	Yes/No	-	-	-	Days per week Hours per day Minutes per day No walking

TableS1Distributionofvariablesaccordingtocohort.pdf
S1: Distribution of variables according to cohort
TableS2Descriptionofharmonisationstrategy.pdf
S2: Description of harmonisation strategy

Evaluating the harmonisation potential of diverse cohort datasets

Archived Versions:

Version 3

Version 2

Version 1

Abstract

Figures

Key Messages

Introduction

Methods

Variable selection

Schema development

Schema evaluation

Results

Core variables

Representation and distribution

Utility

Discussion

Declarations

References

Tables

Supplementary Files

Archived Versions:

Version 3

Version 2

Version 1