## Implications for the reliability of trials in spine journals:

In these analyses, we were unable to independently replicate the p-value calculation for about one third of variables with a reported baseline p-value that had been extracted by Levayer and colleagues. Of those we could replicate, about 1 in 5 used a chi-square test even though there were sparse data with expected cells too small for the chi-square test, indicating that an exact test (eg mid-p or Fisher’s exact test) should be considered. In two-arm trials, there was a 20–25% excess of variables in which the between-groups difference in frequency counts was 1 or 2 compared to the expected amount. Similar findings have been reported in groups of trials with integrity concerns [4]. There were small differences between the observed and empirically calculated distributions of baseline p-values, particularly for those with sparse data. However, the importance of these findings remains uncertain. Collectively these issues raise concerns about the reliability of some of the trials in this dataset.

## Comparison to the previous results

Levayer and colleagues concluded that there was no evidence of systemic fraudulent behaviour, or non-random allocation in these trials [1]. They commented that their approach “will detect fraud, provided it presents as non-random allocation.” They therefore suggested that readers can “generally assume that RCTs published in these spine journals are genuine” [1].

We think that any suggestions of fraud based on assessment of publications should be avoided, because such assessments can only determine the existence of issues such as errors, impossible or improbable data, plagiarism, image manipulation, and statistical patterns inconsistent with randomisation [5]. These issues can be grouped under the term publication integrity, which can be readily assessed. However, if compromised integrity is identified, the reasons for the problems (such as honest error or questionable research practices including fraud or fabrication) can only be determined from an investigation [5]. The reason for the compromised integrity matters little to the readers of the publications, whose main concern is that the publication is reliable. For these reasons, we think it unwise to make any inferences about the presence or absence of fraud in these trials.

Putting that important issue aside, there are a number of other reasons to be cautious. Firstly, 167 trials were assessed, but 11 had only a single baseline p-value, and 98 (59%) had 5 or fewer p-values. The number of p-values in a study needed to draw a reliable conclusion from analyses of baseline p-values is not known, but it is likely to be considerably higher than 5, unless the p-values are extremely skewed. 3/69 (4%) and 2/10 (20%) trials with at least 5 or 10 baseline p-values, respectively, were identified using the threshold of a study-wise p-value of > 0.99, and a further 5 trials with > 5 baseline p-values using the thresholds of < 0.05 or > 0.95. One of the 3 trials exceeding the 0.99 threshold has been retracted. This seems less reassuring than Levayer and colleagues’ conclusion.

Furthermore, the test the authors used, the Carlisle-Fisher-Stouffer method only detects situations where there is an excess of baseline p-values close to 0 or close to 1. Carlisle made it clear that he was looking for outliers: trials that did not conform to this test of baseline data [6]. Carlisle was not seeking to detect every case of fraud [6]. Outlying results might have many explanations, one of which, and perhaps the least likely, is fabricated data. An example where this technique does not work well is that a study could have 10 baseline p-values between 0.45 and 0.55, a very unlikely distribution, but the study-wise p-value would be close to 0.5. In addition, Carlisle used the technique for continuous data, and to our knowledge, it has not been applied to categorical data previously. Categorical data differ from continuous data in that the baseline p-values are not uniform, and, particularly when the sample size is small, there are only a small number of discrete p-values [2]. The combination of small sample sizes and sparse data means a high proportion of p-values calculated using Fisher’s exact test will be 1 (Fig. 2A). However, Levayer and colleagues, chose to use the chi-square rather than exact tests which may have impacted the study-wise p-values calculated. If there is a baseline p-value of 1, the z-score used in the Carlisle-Fisher-Stouffer method cannot be derived. Levayer and colleagues chose to substitute a value of 3, comparable to a p-value of 0.998, including for the situation where all the frequency counts were 0 (eg all or none of the participants in the trial had the characteristic). It is not certain what the best approach should be for these situations- possibilities could include excluding sparse data, calculating one-way p-values for variables with frequencies counts of 0, or using different statistical tests. However, if different statistical tests are used, the baseline p-value distribution may change substantially but remain different from the expected distribution (Figs. 2 and 3).

## Limitations

We relied on the publicly available dataset and did not compare these data with the original publications, and so it is possible that some of the issues identified might be due to data extraction or typographical errors. The dataset did not contain the statistical test used to generate or report the p-values in the original trials. However, the reappraised package function calculates p-values using 7 commonly used statistical tests, allowing comparisons to results from all the major statistical approaches. While we have identified some issues about the publication integrity of this broad group of trials, we cannot make comments about individual RCTs since we have not assessed any individual trial in detail, and because some of the assessments (p-value distribution and frequency distribution) were based on the whole group of trials, preventing inferences about individual RCTs. Finally, we only assessed categorical data in the spine dataset. Continuous variables provide a number of options for analysis, which may be more preferable to analysing categorical variables [2].

## Summary

The dataset published by Levayer and colleagues has allowed a more detailed analysis of baseline categorical data in 167 RCTs in spine journals. This showed that incorrectly reported p-values and incorrect usage of statistical tests are common, and that there are differences between the observed and expected distributions of both frequency counts of variables and baseline p-values. Collectively, these findings raise questions about the reliability of some spine RCTs.

A simple potential fix for these issues would be for journals to require authors to provide a table at submission that contains baseline summary continuous and categorical data and p-values for the trial in a standardized format that would permit automated extraction and examination of data. If problems are identified, explanations could be sought during the review process for incorrect baseline p-values or unusual distributions or matching of baseline data. Publishing this table as supplementary information might be useful because, for example, reporting of baseline p-values is considered unnecessary by many experts [7]. Journals should also ensure that the statistical tests applied to categorical variables conform to recommended practice. Independent analysis and publication of anonymized individual patient data is likely to be much more informative for assessment of publication integrity [8].