Real data results
The majority of the 10,000 potential trials were excluded (Table 2). The three most common reasons were:
-
the XML file was not available despite being on the open access PubMed Central database,
-
there was no baseline table or one could not be detected by our algorithm,
-
it was not a randomised trial.
Table 2
Number of excluded trials and reasons.
reason | n | percent |
No baseline table | 2,824 | 37.4% |
Full text page not available | 2,127 | 28.2% |
Not a randomised trial | 1,547 | 20.5% |
No sample size | 408 | 5.4% |
Just one column in table | 304 | 4.0% |
Follow-up results in baseline table | 162 | 2.1% |
Just one sample size | 72 | 1.0% |
Pre-post comparison | 43 | 0.6% |
Single-arm study | 43 | 0.6% |
Difficult layout | 14 | 0.2% |
Could not detect statistics | 9 | 0.1% |
Total | 7,553 | 100.0% |
A previous study found that 92% of trials included a baseline table [1], hence we are likely excluding trials with a baseline table, but one that our algorithm did not detect. Often this was because the baseline table was in graphical format meaning the numbers in the table were not in the XML file. There are likely also exclusions where the study was not a randomised trial, and hence no baseline table was included. There were also trials that did not include a baseline table (e.g., PMC3574512).
The summary statistics extracted from the included trials are shown in Table 3.
Table 3
Summary statistics from the baseline tables and those included in the Bayesian model.
Included | Statistic | n | percent |
Yes | Percent | 37,291 | 52 |
Continuous | 27,544 | 38 |
Numbers | 2,818 | 4 |
Confidence interval | 59 | 0 |
No | Median | 4,117 | 6 |
P-values | 206 | 0 |
Range | 97 | 0 |
| Total | 72,132 | 100 |
There were 2245 included trials with a total of 51,243 table rows. The median number of rows per baseline table was 13 and the median number of columns was 2. The central 50% of published trial dates were between 15 June 2016 and 15 June 2020.
Table 4
Results for the trials from PubMed Central. Number and percentages of trials that were not flagged or were flagged as under- or over-dispersed. Using three thresholds for the study-specific probability of a non-zero change in the precision of 0.50, 0.80 and 0.99.
Probability threshold | No issue | Under-dispersed | Over-dispersed |
0.50 | 66.6 | 10.6 | 22.9 |
0.80 | 74.6 | 5.5 | 20.0 |
0.99 | 81.3 | 2.0 | 16.6 |
The results for the trials from PubMed Central are in Table 4 and show that most trials (81.3% for the 0.99 threshold) had no issue. The most common issue was results that were over-dispersed (16.7% for the 0.99 threshold), with fewer trials being under-dispersed (2.0% for the 0.99 threshold).
The t-distributions for three trials that were flagged as over-dispersed are plotted in Fig. 2. For comparison, three randomly selected trials with no issue are also plotted. The three flagged trials were selected using the smallest multiplier of the precision (\(ϵ\)) and hence show the most extreme over-dispersed trials. In each case there are a small number t-statistics that are extremely large.
Two trials that are flagged are due to errors in our data extraction (PMC7005701 and PMC7301747). For PMC7005701, a categorical variable is wrongly included as continuous, because the variable label included the word “score” which generally meant the summary statistic was continuous. The t-statistic for this row is over 400 and hence the trial is flagged as over-dispersed.
The result for PMC7301747 is an example of where an error in our data extraction creates a false impression of variability. The error occurs due to large numbers such as “15,170 (7,213)” which our algorithm extracts as three statistics: 15170, 7 and 213. This is because a comma is used both for large numbers and as a separator of two statistics such as a range within round brackets. The t-statistic for this row is over 1,000 and hence the trial is flagged as over-dispersed.
The baseline table in trial PMC7302483 had a non-standard layout with four summary statistics per group in four separate columns. The comparisons for this trial were between different summary statistics for the same group.
The t-distributions for three trials that were flagged as under-dispersed are plotted in Fig. 3.
One flagged study was not a randomised trial but was a case–control study with an age and gender matched control group (PMC2176143), hence it was not surprising that the summary statistics in the baseline table were very similar.
One trial used non-standard column headings for the baseline table which meant the data were read in as four rather than two groups (PMC6034465).
One trial labelled proportions (\(\in [0,1]\)) as percentages (\(\in [0,100]\)) and hence it looked as if there were lots of zero percentages which meant the two randomised groups appeared highly similar (PMC7578344).
The results from Figs. 2 and 3 show that the most extreme results in terms of precision and variability are often failures in the algorithm’s data extraction. Hence, we next look at less extreme results by excluding flagged trials that are in the tails of the precision distribution, which were the extremely under- or over-dispersed results (see Additional file 1 for the distribution). We show three other examples in Fig. 4 for over-dispersion and in Fig. 5 for under-dispersion.
One trial stratified the randomised groups on a severity variable which created large between group differences and hence the over-dispersion (PMC4074719).
A trial that was flagged as over-dispersed had standard deviations for height that were zero (PMC6230406). This is likely a presentation error as zero standard deviations would require all participants in the two groups to have exactly the same height, with a different common height in each group.
One study was not a trial but was an observational study with some very large differences between groups at baseline, with 4 absolute t-statistics larger than 10 including one that was labelled as not significantly different based on a Mann–Whitney test but had a t-statistic of 19 (PMC6820644).
Using the lower threshold of a multiplier under 10 flags a trial where the t-statistics for all four comparisons are within − 0.3 to 0.3 (PMC3136532). The percentage of women was equal in both groups which the authors said was due to the groups being “blocked by gender,” however, we presume they meant “stratified by gender.”
A trial that was flagged had four t-statistics within − 0.3 to 0.3 (PMC6709840). The randomisation was stratified on age and one of the percentage variables, which somewhat explains the under-dispersion.
A trial that was flagged compared only compared age and gender, but had four groups (PMC7245605). The 12 t-statistics were within − 0.6 to 0.5. The trial was a mix of “healthy controls” and randomised groups, but there was no mention matching for the controls which might explain the under-dispersion.
An alternative method for find under- or over-dispersed trials is to examine where the lower credible interval for the precision multiplier (\({ϵ}_{i}\)) is above 0. Example under-dispersed trials from this approach a trial with 19 t-statistics between − 0.8 and 0.6 (PMC2885597), and a trial with 8 t-statistics between − 0.8 and 0.6, and with clearly incorrect p-values (PMC6165973).
In the plots above we used histograms to summarise the t-statistics. An alternative approach is using a cumulative distribution function and we show an example of that in Additional file 2.
Predictors of under- or over-dispersion
We examined which study design features were associated with the probability of a non-zero dispersion parameter, indicating under- or over-dispersion. The six predictors selected by the elastic net approach are in Table 5.
Table 5
Estimated predictors of the probability of dispersion showing the mean change in probability and 95% confidence interval.
Predictor | mean | CI |
Standard error | 0.32 | 0.21 to 0.43 |
Difference in group sample sizes of 10+ | 0.12 | 0.08 to 0.16 |
Number of columns (+ 1) | 0.10 | 0.08 to 0.12 |
Proportion continuous (0.0 to 1.0) | 0.10 | 0.06 to 0.14 |
Sample size (per doubling) | 0.03 | 0.02 to 0.04 |
Block randomisation | -0.05 | -0.09 to -0.01 |
The probability of a non-zero dispersion was much higher in baseline tables that wrongly used the standard error of the mean instead of the standard deviation. This is as expected given that the standard error will be far smaller than the standard deviation and hence small differences could look like over-dispersion.
The probability increased when there were large differences in group sample size and for each additional column in the table. In both cases this could be because the trial was not a simple comparison of, for example, treatment versus control (two columns), but might include subgroups, such as gender or disease severity. These strata will likely create over-dispersion as the comparisons are no longer between randomised groups.
The probability increased when the baseline table had a greater proportion of continuous variables. This is likely because of the greater statistical power for continuous variables compared with categorical. Similarly the probability increased with greater sample size and hence greater power. The number of rows in the table was not a predictor, but in a separate simulation we confirmed that—as expected—the power to detect under-dispersion increased for larger tables (see Additional file 1).
Block randomisation was association with a decreased probability of dispersion, and this technique can help balance group characteristics when there is a strong correlation over time in the characteristics of participants. The size of this effect is surprising and it may be that some authors confused block randomisation with stratified randomisation (e.g., PMC3136532) or that these two techniques were often used in combination. We did not add stratified randomisation as a predictor, because “stratified” was often also used to describe analyses, although we could have assumed that any stratification mentioned in the analysis also meant the randomisation was stratified.
No journals or countries were selected by the elastic net variable selection, meaning none were associated with dispersion. However, the total number of trials were small for most journals and some countries, which reduces the statistical power. The largest number of trials for a single journal was 85.