To design and implement a data model (C-Surv) for the discovery and selection of research cohort data using neurodegeneration as a use case.
The design adopted is a simple four level taxonomy intended to capture the breadth of data typically collected in research cohorts. This tiered structure supports grouped and individual variable selection. Class membership and naming at levels one to three were pragmatic decisions based on Data Portal user behaviour and the desire to maintain a four level structure for tool development purposes. Level four described the data object i.e. the measured variable. At this level naming was designed to uniquely identify the variable in the context of a longitudinal study.
Data model structure
C-Surv uses a four level acyclic structure comprising 18 data themes (level 1) leading to >130 data ‘domains’ (level 2), >500 data ‘families’ (level 3) and then to a growing number of data ‘objects’ (level 4). Typically data objects are variable level observations, or in the case of complex measures, such as psychometric tests, test scores (Figure 1). To the extent that evidence was available from DPUK access requests, the organisation of each level reflected the types of variable requests that are more frequently made. For example, typically a request would be made for all processing speed variables, rather than just choice reaction time, and so processing speed was used as a domain category.
Key to utility is an informative ‘object’ (variable) name. Objects are defined pragmatically as the level of measurement used in most analyses. The object name is a complex proposition with 5 elements comprising cohort, data category, measurement, serialisation (repeated measurement within a single data capture period), and study wave (repeated measurement between data capture episodes). These elements are considered to be the minimum required to uniquely and conveniently identify an object in dataspace. An example object name is given below:
The cohort is identified using a three-digit alphabetic character (GEN for Generation Scotland), and data category by a two-digit numeric character (04 for medical history). The measurement is described by an alphanumeric abbreviation (PAINCHESTEVR for: ‘Do you ever get pain or discomfort in your chest?’). This is followed by an integer representing the location of the variable within a sequence of repeat measurements within a study wave (_0 indicates there were no repeat measurements). Finally, an integer suffix indicates study wave (_1 for recruitment, _2 for first follow-up, etc.).
For survey data the measurement abbreviation is limited to 12 characters. For imaging, omics, and device data it is limited to 17 characters. Where questionnaire item level measurement is relevant, q# is added to the object name. For example, GEN06_SPQq1_0_1 is an item from Generation Scotland (GEN) within the Psychological Status category (category 10), from the Schizotypal Personality Questionnaire (SPQ), question 1 (q1), administered with no repeat measurement in wave 1.
Abbreviations are selected to reflect the meaning of the full variable name used in data capture. They are upper-case, syllable based, using word fragments as abbreviations and numeric characters to facilitate easy interpretation. Consistency of abbreviations is maintained where possible. Constants are lower case for example, just as ‘q’ is used to represent question (or item), ‘r’ is used to represent range and ‘d’ is used to represent a decimal point. For example, AVG08H00r08H59 is an item from accelerometry data (average acceleration between 08h00 and 08h59).
Value labelling conventions
To provide correspondence between native data (that transferred to the Data Portal by data controllers) and curated data, native data value labels are retained. However, for widely used measures, value labels are standardised using common conventions. For example, missing is scored ‘.’ following the Stata (19) convention, gender is scored ‘2’ for female and ‘1’ for male. For several widely used measures imperial scaling is converted to metric. For example, height is recorded in centimetres and weight in kilograms.
Use case: Cohort Explorer
To explore the potential for C-Surv to support data discovery, it was used to develop the Cohort Explorer feasibility tool. Cohort Explorer allows users to establish the number of participants with data, according to variable, across cohorts, prior to making a data access request. It enables users to avoid requesting combinations of variables that collectively have high levels of missingness.
Assessing feasibility in a multi-cohort environment requires the harmonisation of data across datasets. Harmonisation (the equivalence of values and/or distributions for variables across datasets) goes beyond the conventions of a common data model. However, a common data model does provide context for evaluating the suitability of variables for harmonisation. To test this C-Surv was applied to 11 collaborating DPUK cohorts (n=123,554) and the results used to generate a 31 variable harmonised dataset (Table 1). The selection of variables reflects the frequency of variables requested in dementia focussed DPUK data access applications. These variables represent a wide range of modalities and formats including imaging, genetic, and survey data. The C-Surv structure was able to accommodate all the data types and formats found in the native cohort data. C-Surv also provided a structured overview of scientific activity. For example, the omission of life functionality, and physical and social environment variables, suggests either relatively little attention is paid to these areas, or that these data are sparse. The harmonised dataset was used to populate Cohort Explorer.
Cohort Explorer can be found at Cohort Explorer - DPUK Data Portal (dementiasplatform.uk). As the tool uses individual-level cohort data it requires a DPUK account to access. This can be obtained upon application to Register - DPUK Data Portal (dementiasplatform.uk). The tool provides an interactive dashboard allowing users to select cohorts, variables and value ranges of interest. For example, of the 123,554 members of the 11 cohorts, 57,499 are aged 50+ and of these 21,867 are lifetime non-smokers (Figure 2). However if APOE4 status (homozygous or heterozygous) is added the numbers drop to 1,666. This is critical information when planning an analysis.