A total of 192 candidate articles were identified in the five journals for the time period January 1 to July 15, 2018 in the five journals (TABLE 1). A total number of 25 articles were included in this review (FIGURE 1, TABLE 1).
TABLE 1: Search and selection of articles
|
NEJM
|
JCO
|
Lancet
|
JAMA
|
CIRC
|
Total
|
Selected papers via Pubmed search
|
11
|
63
|
21
|
29
|
68
|
192
|
Included according to criteria after reviewing abstract
|
7
|
22
|
12
|
19
|
45
|
105
|
Included according to criteria after reviewing full text article
|
6
|
21
|
10
|
19
|
44
|
100
|
Randomly selected for review
|
5
|
5
|
5
|
5
|
5
|
25
|
Data sources for these observational studies included national registries, health insurance data bases, or health records from a single or multiple hospitals or cohort studies (TABLE 2).
Twelve of the 25 studies were based in the USA. Studies had large sample sizes (median =11,422 participants, IQR: 1,850 to 144,816). Survival endpoints (19/25) or binary outcomes (5/25) were the most common outcomes.
TABLE 2: Characteristics of the included studies
Study
|
Journal
|
Location
|
Years of participant selection*
|
Study size*
|
Data source*
|
Inohara et al (8)
|
JAMA
|
USA
|
2013-2016
|
141,311
|
Stroke registry
|
Purnell et al (9)
|
JAMA
|
USA
|
1995-2014
|
453,162
|
Transplant registry
|
Reges et al (10)
|
JAMA
|
Israel
|
2005-2015
|
33,540
|
Multiple hospitals
|
Snyder et al (11)
|
JAMA
|
USA
|
2006-2007
|
8,529
|
Cancer registry
|
Yu et al (12)
|
JAMA
|
China
|
2004-2008
|
271,217
|
Nationwide Biobank
|
Biccard et al (13)
|
Lancet
|
25 African countries
|
2016
|
11,422
|
Multiple hospitals
|
Wood et al (14)
|
Lancet
|
19 high income countries
|
1964-2010
|
599,912
|
Multiple CVD registries and a biobank
|
Dziadzko et al (15)
|
Lancet
|
USA
|
2000-2010
|
1,294
|
Single hospital and a medical registry of area residents
|
Zylbersztejn et al (16)
|
Lancet
|
UK, Sweden
|
2003-2013
|
4,946,246
|
Hospital episode registries, birth and death registries
|
Gilbert et al (17)
|
Lancet
|
UK
|
2013-2015
|
22,139
|
Hospital episode registry; death registry
|
Alexander et al (18)
|
Circulation
|
Australia
|
1987-1996
|
80
|
Childhood cardio-myopathy registry
|
Nazerian et al (19)
|
Circulation
|
Brazil, Germany, Italy, Switzerland
|
2014-2016
|
1,850
|
Multiple hospitals
|
Pollack et al (20)
|
Circulation
|
USA, Canada
|
2011-2015
|
2,500
|
Resuscitation outcomes registry
|
Puelacher et al (21)
|
Circulation
|
Switzerland
|
2014-2015
|
2,018
|
Single hospital
|
Chao et al (22)
|
Circulation
|
Taiwan
|
1996-2015
|
32,160
|
Health Insurance database
|
Chow et al (23)
|
JCO
|
USA
|
1962-2001
|
13,060
|
Multiple hospitals
|
Kenzik et al (24)
|
JCO
|
USA
|
2000-2011
|
72,408
|
Cancer registry and Health insurance database
|
Degnim et al (25)
|
JCO
|
USA
|
1967-2001
|
669
|
Single hospital
|
Gundle et al (26)
|
JCO
|
USA
|
1989-2014
|
2,217
|
Single hospital
|
Clarke et al (27)
|
JCO
|
USA
|
2003-2015
|
944,227
|
Multiple hospitals
|
Hoen et al (28)
|
NEJM
|
French territories in the Americas
|
2016
|
555
|
ZIKV pregnancy population cohort
|
Amarenco et al (29)
|
NEJM
|
Europe, Asia, Latin America
|
2009-2011
|
3,356
|
Stroke registry
|
Calderon et al (30)
|
NEJM
|
Israel
|
1980-2014
|
1,522,731
|
Renal registry and population cohort
|
Kyle et al (31)
|
NEJM
|
USA
|
1960-1994
|
1,384
|
Single hospital
|
Mead et al (32)
|
NEJM
|
USA
|
2016-2017
|
184
|
ZIKV male population cohort
|
*Only the development sample size (i.e not the validation sample size) was included here or the population of main interest for the analysis (i.e. not matched populations)
Reporting of initial data analyses
Data cleaning
Ten out of 25 papers (40%) included a statement about data cleaning. The statements were often general as illustrated by the following examples:
- “Clinically improbable laboratory values were removed.” (10)
- “The statistical analysis was performed on the data entered, checked, if necessary corrected and validated by the centers. ” (28)
- “Registrars were asked to follow-up with outside institutions in an effort to try to ensure data completeness, but actual data completeness was not measured.” (11)
No sufficient information about the nature of the problems encountered in data cleaning, or the number of records for which errors were detected and corrected was reported. Consequently, even if data cleaning was mentioned, we often know little about the process and potential impact. More details were provided, when explicitly reporting the rules for correcting data values, or reporting the range of admissible values and number of records with values outside the range in the Supplement (10). One paper included the computer code used for data cleaning in the Supplement (20), which made the data cleaning potentially reproducible.
The information about data cleaning was reported in Methods (n=5), Discussion (n=3) or Supplement (n=4).
Data Screening
Data screening examines data properties that do not touch the research questions but may affect the interpretation of results from statistical models or may lead to updating the analysis plan (3). This includes a systematic review of the distribution of variables and missing data. Understanding associations between variables can support decisions about modeling and later interpretation of the results. Statements about data screening were grouped by outcome and non-outcome variables and by location in the papers (TABLE 3). Methods of descriptions of such variables could include quantitative or graphical data summaries.
TABLE 3. Number of papers with data screening statements by location in the paper.
|
|
Location in Paper
|
|
Mentioned in papers, n (%)
|
M
|
R
|
D
|
S
|
Description of non-outcome variables
|
25 (100%)
|
5
|
24
|
0
|
15
|
Description of missing values of non-outcome variables
|
19 (76%)
|
6
|
12
|
0
|
6
|
Reporting association between non-outcome variables
|
14 (56%)
|
5
|
6
|
0
|
5
|
Description of non-outcome variables for subgroups
|
21 (84%)
|
2
|
19
|
1
|
11
|
Description of transformation of non-outcome variables
|
10 (40%)
|
4
|
4
|
0
|
2
|
Description of outcome variable(s)
|
25 (100%)
|
2
|
25
|
0
|
9
|
Information of missing values for outcome variables
|
12 (48%)
|
3
|
7
|
3
|
4
|
Description of methods for outcome variables
|
19 (76%)
|
13
|
4
|
0
|
1
|
Description of missingness of subjects
|
15 (60%)
|
1
|
11
|
2
|
5
|
Description of transformations in outcome variables
|
7 (28%)
|
1
|
6
|
0
|
0
|
Abbreviations: M-methods, R-results, D- discussion, S-supplement
A common aspect of data screening is the description of non-outcome variables. These were presented in all articles, commonly in the Results section (n=24) but also in the Supplement (n=15) and occasionally in Methods (n=5). Most articles reported this information in tables (n=21) and text (n=20). Data visualizations were rarely used (n=2). The statistical methods used to describe non-outcome variables were reported in 19 articles. Information about the association between non-outcome variables was included in 14 papers (56%). Information on missing values for non-outcome variables was reported in 19 papers (76%). The information appeared most often in Results (n=12) but also in Methods and in the Supplement (n=6 each). Ten papers provided information about distributions of non-outcome variables, which later implied a change in analysis plan. This information was provided in Results (n=4), Methods (n=4) and in the Supplement (n=2). This referred mainly to categorizing non-outcome numerical variables. Some studies reported categories with small frequencies, which led to a sparser grouping than originally intended (27,29). In one study (8), the adequateness of a non-outcome variable was checked in the IDA. “Comparison of the multilevel model to a non-multilevel model (likelihood-ratio test) indicated a significant clustering effect of testing intensity by facility (P < .001). […] Therefore, the [observed/expected] ratio for each facility was calculated based on the sum of the individuals from that facility. The facility was categorized into high intensity or low-intensity categories for comparison.” (11). However, it remained unclear to which degree the variable definition was preplanned and what the action would have been, if the likelihood ratio test had not been significant.
Data screening statements for outcome variables were included in all articles, and 72% (n=18) indicated the methods used to describe them. Item missingness was reported in 11 papers (44%), unit missingness in 15 papers (60%).
Changes in the analysis plan
Eleven papers (44%) mentioned some changes in the analysis plan. Reported changes referred to missing data treatment, unexpected values, population heterogeneity and aspects related to variable distributions or data properties (TABLE 4). The reporting of such changes could be found in all sections of the paper except in the Introduction.
TABLE 4. Number of papers with changes of the analysis plan statements by location in the paper.
|
|
Location in Paper
|
Reasons for change
|
Number of papers, n (%)
|
M
|
R
|
D
|
S
|
Unexpected Values
|
2 (8%)
|
2
|
0
|
1
|
0
|
Heterogeneity
|
1 (4%)
|
0
|
1
|
0
|
0
|
Unexpected confounding
|
2 (8%)
|
1
|
1
|
2
|
0
|
Variable Distribution
|
4 (16%)
|
3
|
1
|
1
|
0
|
Other Data Properties
|
2 (8%)
|
2
|
0
|
0
|
0
|
Missing Data
|
5 (20%)
|
4
|
1
|
1
|
0
|
Abbreviations: M-methods, R-results, D- discussion, S-supplement
Changes were described as follows:
- Due to variable distributions categories of the variables were grouped, or numerical variables were categorized based on findings from IDA.
- “Because few women were underweight (1.2%), we combined underweight with normal BMI (normal/underweight) and performed a sensitivity analysis excluding the underweight group.” (27)
- Chow et al resolved classification problems of patients by using the category with lower value. “If insufficient information was available to distinguish between grades, the lower grade was applied. ” (23)
- Gilbert et al observed that “patients had Hospital Frailty Risk Scores ranging from 0 to 99, but this was heavily skewed to the right” and categorised it using three risk levels (17).
- Revising the planned statistical model and including additional variables due to unexpected confounding was the result of IDA in some papers.
- In the discussion, Reges et al acknowledged that “There was a higher proportion of low SES among nonsurgical patients after matching. Given the higher mortality among low SES patients in general, SES could have been a confounder. This and other potential confounding characteristics were adjusted for in the models.” (10)
- Pollack et al adjusted their analysis for potential confounders. “For example, bystander AED shock was more likely to receive bystander CPR, so we adjusted for this covariate in the analysis,” acknowledging that obseved differences in survival could not be attributed solely to the type of help recieved by patients (20).
- Inclusion and exclusion criteria were modified thus leading to a change in the study population due to unexpected values or population heterogeneity.
- Biccard et al substantially relaxed the inclusion criteria as “more than half the countries in our study could not fulfill the protocol requirements for an included sample, and in hindsight these rules were inappropriately strict despite formal acceptance by the national leaders of these requirements before the study began.” (13).
- Yu et al exluded from the analyses the “participants from Zhejiang (n=56,813) where heating was rarely reported (0.6%).”(12)
- Methods to handle missing data in the analysis or inclusion/exclusion criteria were updated.
- Snyder et al used multiple imputation for two non-outcome variables for which they had observed more than 5% missing values. “Two variables, perineural invasion and lymphovascular invasion, had more than 5% missing values. Multiple imputation by chained equations was used to substitute predicted values for missing values with 20 imputed values.” (11)
- Amarenco et al excluded data from some study sites, and performed subgroup analyses, some of which were not prespecified. “Sites with follow-up data on more than 50% of their enrolled patients at 5 years were selected for the analysis in this report, and all reported results pertain to this selected cohort.” (29)
- Zylbersztejn et al used data screening to exclude hospitals with low quality data: “We excluded hospitals with high proportions of missing data or evidence of linkage error to address incomplete recording of risk factors at birth. We included hospitals with more than 500 births a year, with high completeness of recorded birthweight and gestational age, and hospitals where at least half of all deaths were linked to a death certificate”, and “We developed criteria for identifying hospitals with high completeness of gestational age and birth weight, and high quality of linkage with ONS mortality data in an iterative process. ” (16)
- Other data properties may influence statistical models.
- Wood et al exluded from combined analyses of several data sources “studies with fewer than five incident cases of a particular outcome” to avoid model overfitting (14).
Sensitivity analyses
Sensitivity analyses are commonly used when checking on robustness of models and conclusions. These are often pre-planned in the study design phase, but could be a consequence of IDA and planned before the main analyses instead of having to rely on post hoc analyses. For example,
- “Because few women were underweight (1.2%), we combined underweight with normal BMI (normal/underweight) and performed a sensitivity analysis excluding the underweight group.” (27)
- Inclusion criteria were relaxed during the data collection process and it was noted that “Before analysis we therefore decided to present the data describing the full cohort, and include a per-protocol analysis of the predefined representative sample for comparison.” (13)
- “Event rates were estimated among the overall study sample (main analysis), among patients evaluated by a stroke specialist within 24 hours after symptom onset (prespecified sensitivity analysis), and among patients from the 33 sites with follow-up data on more than 80% of their patients at 5 years (post hoc sensitivity analysis)” (29)
We point out that it was sometimes difficult to decide whether an information about a certain action reflected a consequence of IDA or had been preplanned. For example, the statement “If insufficient information was available to distinguish between grades, the lower grade was applied.” (23) may reflect a rule developed during IDA, but it may also reflect a rule already decided on in the study protocol.