Structure of the data quality framework
In accordance with existing data quality concepts [2, 3, 5], completeness and correctness are the two core aspects of data quality (Table 1). Completeness is represented as a single dimension while correctness is subdivided into the two dimensions consistency and accuracy. The reason for this separation is introduced in the paragraph correctness. A precondition for successfully conducting any data quality assessment is the correct technical setup of study data and metadata. Related aspects are targeted within the integrity dimension.
Table 1
Data Quality Dimensions and Domains
Name
Dimension
Domain
|
Definition
|
Primary reference objects to detect data quality issues
|
Primary reporting metrics of indicators
|
Integrity
|
The degree to which the data conforms to structural and technical requirements.
|
|
|
Structural data set error
|
The observed structure of a data set differs from the expected structure.
|
Data elements,
data records
|
N
|
Relational data set error
|
The observed correspondence between different data sets differs from the expected correspondence.
|
Data sets
|
N
|
Value format error
|
The technical representation of data values within a data set does not conform to the expected representation.
|
Data fields
|
N, %
|
Completeness
|
The degree to which expected data values are present.
|
|
|
Crude missingness
|
Metrics of missing data values that ignore the underlying reasons for missing data.
|
Data fields
|
N,%
|
Qualified missingness
|
Metrics of missing data values that use reasons underlying missing data.
|
Data fields, data elements, data record
|
N,%
|
Consistency
|
Consistency
|
|
|
Range and value violations
|
Observed data values do not comply with admissible data values or value ranges.
|
Data fields
|
N,%
|
Contradictions
|
Observed data values appear in impossible or improbable combinations.
|
Data fields
|
N,%
|
Accuracy
|
The degree of agreement between observed and expected distributions and associations.
|
|
|
Unexpected distributions
|
Observed distributional characteristics differ from expected distributional characteristics.
|
Data elements,
data records
|
Diverse statistical measures*
|
Unexpected associations
|
Observed associations differ from expected associations.
|
Data elements,
data records
|
Diverse statistical measures*
|
Disagreement of repeated
measurements
|
Disagreement between repeated measurements of the same or similar objects under specified conditions.
|
Data elements,
data records
|
Diverse statistical measures*
|
N: number of issues; %: the percentage of issues relative to the number of assessed elements in a data structure. |
* A wide range of statistical metrics may apply such as location, scale or shape parameters, correlation coefficients, measures of agreement. |
Each dimension is subdivided into different data quality domains, an overview on dimensions and domains is provided in Table 1. The domains differ mainly in terms of the methodology used to assess data quality. The next level defines data quality indicators (Table 2). Currently, 34 indicators are distinguished. They describe quality attributes of the data at the level of single data fields, data records, data elements, and data sets [33]. Figure 1 displays the hierarchical structure. Figure 2 illustrates the used nomenclature of terms for data structures within the framework.
Table 2
Overview on Data Quality Indicators with Definitions
ID
|
Name of indicator
|
Definition
|
Integrity
|
DQI-1001
|
Unexpected data elements
|
The set of available data elements does not match the expected set.
|
DQI-1002
|
Unexpected data records
|
The set of available data records does not match the expected set.
|
DQI-1003
|
Duplicates
|
The same data elements or data records appear multiple times.
|
DQI-1004
|
Data record mismatch
|
Data records across different data sets do not match as expected.
|
DQI-1005
|
Data element mismatch
|
Data elements across different data sets do not match as expected.
|
DQI-1006
|
Data type mismatch
|
The observed data type does not match the expected data type.
|
DQI-1007
|
Inhomogeneous value formats
|
The observed data values have inhomogeneous format across different data fields.
|
DQI-1008
|
Uncertain missingness status
|
System indicated missing values (e.g. NA/./Null…) appear where a qualified missing code is expected.
|
Completeness
|
DQI-2001
|
Missing values
|
Data fields without a measurement value.
|
DQI-2002
|
Non-response rate
|
The proportion of eligible observational units for which no information could be obtained.
|
DQI-2003
|
Refusal rate
|
The proportion of eligible individuals who refuse to give the information sought.
|
DQI-2004
|
Drop-out rate
|
The proportion of all participants who only partially complete the study and prematurely abandon it.
|
DQI-2005
|
Missing due to specified reason
|
Information in a data collection that is missing due to a specified reason.
|
Consistency
|
DQI-3001
|
Inadmissible numerical values
|
Observed numerical data values are not admissible according to the allowed ranges.
|
DQI-3002
|
Inadmissible time-date values
|
Observed time-date values are not admissible according to the allowed time and date ranges.
|
DQI-3003
|
Inadmissible categorical values
|
Observed categorical data values are not admissible according to the allowed categories.
|
DQI-3004
|
Inadmissible standardized vocabulary
|
Data values are not admissible according to the reference vocabulary.
|
DQI-3005
|
Inadmissible precision
|
The precision of observed numerical data values does not match the expected precision.
|
DQI-3006
|
Uncertain numerical values
|
Observed numerical values are uncertain or improbable because they are outside the expected ranges.
|
DQI-3007
|
Uncertain time-date values
|
Observed time-date values are uncertain or improbable because they are outside the expected ranges.
|
DQI-3008
|
Logical contradictions
|
Different data values appear in logically impossible combinations.
|
DQI-3009
|
Empirical contradictions
|
Different data values appear in combinations deemed impossible based on empirical reasoning.
|
Accuracy
|
DQI-4001
|
Univariate outliers
|
Numerical data values deviate markedly from others in a univariate analysis.
|
DQI-4002
|
Multivariate outliers
|
Numerical data values deviate markedly from others in a multivariate analysis.
|
DQI-4003
|
Unexpected locations
|
Observed location parameters differ from expected location parameters.
|
DQI-4004
|
Unexpected shape
|
The observed shape of a distribution differs from the expected shape.
|
DQI-4005
|
Unexpected scale
|
Observed scale parameters differ from expected scale parameters.
|
DQI-4006
|
Unexpected proportions
|
Observed proportions differ from expected proportions.
|
DQI-4007
|
Unexpected association strength
|
The observed strength of an association deviates from the expected strength of the association.
|
DQI-4008
|
Unexpected association direction
|
The observed direction of an association (e.g. negative, positive) deviates from the expected direction.
|
DQI-4009
|
Unexpected association form
|
The observed form of an association (e.g. linear, quadratic, exponential...) deviates from the expected form.
|
DQI-4010
|
Inter-Class reliability
|
Differences between classes (e.g. examiners) when measuring the same or similar objects under specified conditions.
|
DQI-4011
|
Intra-Class reliability
|
Differences within classes (e.g. examiners) when measuring the same or similar objects under specified conditions.
|
DQI-4012
|
Disagreement with gold standard
|
Differences with a gold standard when measuring the same or similar objects under specified conditions.
|
The term “expected” refers to a test criterion as annotated in metadata fields. |
Integrity
Integrity related analyses are guided by the question: Do all data comply with pre-specified structural and technical requirements? Addressing this as an independent step is necessary in any data quality assessment, because study data and metadata are often deficient. The three domains within this dimension address:
-
the structurally correct representation of data elements or data records within data sets (structural data set error), e.g. a mismatch of observed and expected number of data records;
-
the correspondence between multiple data sets (relational data set error), e.g. the appropriate integration of multiple study data sets; and
-
the correct representation of data values within data sets (value format error), e.g. a mismatch between the expected and observed data type.
Deficits at the integrity level may invalidate any findings at subsequent stages of data quality assessments and for any substantial scientific analyses. Assessments of metadata are confined to the integrity domain.
Completeness
Completeness related assessments are guided by the question: Are the expected data values available? Results provide knowledge about the frequency and distribution of missing data. Two domains within completeness treat missing data differently. Within the “crude missingness” domain, any specific reasons that underlie missing data are ignored because missing data are often improperly coded and meaningful indicators must nevertheless be computable. A common example is the provision of system-indicated missing values only such as NA in R. This impedes inferences on why data values are not available without context information. In contrast, “Qualified missingness” makes use of coded reasons for missing data such as refusals, met exclusion criteria or any other reason. The use of such missing codes enables the valid computation of non-response or refusal rates [34].
Missing data occur at different stages of a data collection. Reasons for participants not entering a study (1: unit missingness) may be different from those prompting a participant to leave the study after initial participation (2: longitudinal missingness, e.g. drop-out). Further restraints may impede the conduct of a segment of the study, such as a specific examination (3: segment missingness, e.g. taking part in an ultrasound examination). Within segments, there may be a failure to fully collect information (4: item missingness, e.g. refusal to respond to a question). Different sets of actionable information may result at each of these stages, both at the level of data quality management and statistical analyses. Analysing missing data at the stages 1 to 3 should forego the assessment of item missingness.
Correctness: Consistency and Accuracy
Correctness related analyses are guided by the question: Are data values free of errors? The first dimension, consistency comprises indicators that use Boolean type checks to identify inadmissible, impossible, or uncertain data values or combinations of data values. The domain range and value violations targets single data values that do not comply with allowed data values or value ranges [35]. The second domain, contradictions examines impossible or improbable combinations of multiple data values.
In contrast, indicators within the accuracy dimension use diverse statistical methods to identify unexpected data properties. Its first domain, unexpected distributions targets discrepancies between observed and expected distributional characteristics, e.g. the violation of an expected normal distribution. The second domain, unexpected associations, assesses discrepancies between observed and expected associations. The third domain, disagreement of repeated measurements, targets the correspondence between repeated measurements of the same outcome, for example related to the precision of measurements, or the correspondence with gold standard measurements.
Implementations
Various methods exist to compute data quality indicators. For example, different approaches are available to calculate response rates [34] or to assess outliers [36, 37]. Implementations describe the actual computation of data quality indicators. They can be tailored to specific demands of data quality assessments and may summarize results from different indicators. Implementations may therefore be linked to any level of the data quality framework hierarchy, for example to provide overall estimates of data quality for some dimension. Changes of implementations do not constitute a modification of the data quality concept.
Descriptors
Results of data quality assessments should be available in machine-readable format. This is a necessary precondition for automated processing and subsequent aggregation of results. Yet, not all data-quality-related information may be expressed in a machine-readable format. For example, histograms or smoothed curves [38] may provide important insights in addition to a statistical test of some assumption about a distribution or association. However, the detection of a data quality issue based on graphs relies on the implicit knowledge of a person inspecting the results. Such output without a machine-readable metric is named a descriptor. All descriptive statistics are descriptors as well. To consider a sample mean as being problematic without an explicit rule-based assessment relies on implicit knowledge. A single descriptor may provide information for different indicators, as there are various possible interpretations. For example, a scatterplot may serve to identify outliers but also to detect unexpected associations and distributional properties.
Data quality and process variables
Data are collected over time, possibly at different sites, by different examiners using diverse methods. Ambient conditions may vary. Such sources of variability, coded as process variables [39], may affect measurements and result in data quality issues. Unexpected association of statistical parameters with process variables may constitute novel data quality problems and can be related to almost all data quality indicators. An example of high practical relevance are examiner effects (indicator: unexpected location, Table 2; implementation: examiner effects - margins, Table 3). Another example are time trends in the data. Such associations with process variables should routinely be targeted.
Table 3
Example R-Functions and their Links to The Data Quality Framework
R-function name
|
Implementations within the function
|
Linked with the following indicators
|
pro_applicability_matrix()
|
Checks the correspondence of study data with the metadata and accessibility to files. Each study data variable is examined regarding the data type and cross-checked with the specified data type in the metadata.
|
Unexpected data elements;
data type mismatch
|
com_unit_missingness()
|
Evaluates on the level of entire observational units whether all measurements are missing.
|
Missing measurements (Unit level)
|
com_segment_missingness()
|
Evaluates whether all associated measurements at the level of study segments (e.g. single examinations or instruments) are missing for an observational unit. A pattern plot is provided as a descriptor.
|
Missing measurements (Segment level);
|
com_item_missingness()
|
Examines for each variable of the study data the amount and type of missing data according to specified missing/jump codes, including a count of data fields without any data entry like NA in R.
|
Missing measurements (Item level);
specific missingness;
uncertain missingness status
|
con_limit_deviations()
|
Assesses limit deviations, with regards to inadmissible and improbable values and counts deviations above/below the specified thresholds. Limits may comprise hard limits to identify inadmissible values, soft limits to identify improbable values, and detection limits which refer to a censoring based on the properties of the measurement devices used.
|
Inadmissible numerical values;
inadmissible time-date values;
uncertain numerical values;
uncertain time-date values
|
con_inadmissible_categorical()
|
Compares the match of single data values with admissible categories, summarizes observed vs. expected data values and counts the violations.
|
Inadmissible categorical values
|
con_contradictions()
|
Compares two data values of the same observational unit by using one of 16 logical comparisons. Counts the number of contradictions.
|
Logical contradictions;
empirical contradictions
|
acc_distributions()
|
Creates distributional plots (bar or histogram) for numerical measurements (float, integer). If a grouping variable is provided, stratified empirical cumulative distribution functions (ecdf) are plotted as well [16].
|
Indicators within the unexpected distributions domain
|
acc_univariate_outlier()
|
Computes distributional characteristics of numerical measurements (e.g. mean, standard deviation, skewness) and applies four different rules to identify univariate outliers, e.g. Tukey, Hubert, and six sigma [51–53]. Counts the number of outliers and indicates the direction (low/high).
|
Univariate outliers
|
acc_multivariate_outlier()
|
Computes the Mahalanobis distance of at least two variables and counts the number of extreme measurements. In a heuristic approach outlier identification is based on applying simple univariate rules [51–53] on the Mahalanobis distance to reduce computational costs.
|
Multivariate outliers
|
acc_shape_or_scale()
|
Tests the observed distribution of measurements against predefined distributional assumption (normal, gamma, uniform). Deviations from expected distributions are visualized using the idea of rootograms [51, 54].
|
Unexpected shape parameter;
unexpected scale parameter
|
acc_end_digits()
|
Computes preferences of manually collected data, i.e. the preference of end digits. The functions assume a uniform distribution of end digits and applies a rootogram-like visualization [51, 54].
|
Unexpected shape
|
acc_margins()
|
Compares the marginal distribution of different classes (e.g. examiners, devices) using measurements adjusted for covariates (e.g. age, sex). Adjusted linear models, logistic regression or poisson-regression are used to model marginal means of continuous measurements, binary, and count data [55].
|
Unexpected location;
unexpected proportion
|
acc_varcomp()
|
Computes the variance proportion explained by different classes (e.g. examiners, devices) in relation to the overall variance of the measurement. Depending on the data ANOVA or mixed effects models are applied [56, 57]
|
Unexpected location
|
acc_loess()
|
Computes and displays as a descriptor loess-smoothed trends of measurements across different classes over time. The raw measurements can be adjusted for covariates such as age or sex and the resulting residuals are smoothed over time using LOESS [38].
|
Indicators within the unexpected distributions domain, foremost unexpected location;
unexpected proportion
|
Using R and the data quality workflow
Data quality can be assessed using the R package dataquieR. Table 3 provides an overview of the applied computational and statistical methods. The use of dataquieR can be twofold: (1) all-at-once without an in-depth specification of parameters using the function dq_report() to create complete default reports or (2) step-by-step allowing for a detailed data quality assessment in a sequential approach. The first option checks the availability of metadata and applies all appropriate functions to the specified study data. A flexdashboard [40] is then generated which summarizes the results by data quality dimensions and variables.
In contrast, the sequential approach allows for specific parameter settings, changes to the output, corrections and modification of the data, and stratification according to additional variables. Examples of the step-by-step approach are shown in Fig. 3 using SHIP data. For the sake of clarity, only five variables (data elements) have been selected for display. First, the applicability of implementations to each data element was checked. Apparently, the data type of "waist circumference" did not comply with the data type specified in the metadata (Fig. 3, panel a top-left). After resolving this issue further data quality checks were conducted. Item missingness has been tabulated to provide insights about different reasons for missing data at this level (Fig. 3, panel b bottom-left). Afterwards the consistency of the data was examined with respect to limit deviations (Fig. 3, panel c top-right). Among the different applications addressing accuracy, the adjusted margins function compares mean values across observers to address examiner effects while adjusting for a for a vector of covariates (Fig. 3, panel d bottom-right). A commented example is available in the tutorial section of the webpage.