Statistical Analysis of Aneurysmal Subarachnoid Hemorrhage Trials

Background: Many randomized controlled trials (RCT) have assessed new treatments in subarachnoid hemorrhage (SAH), yet most show no treatment ecacy. One explanation is the statistical analysis of the primary endpoint was not as ecient as possible. We reanalyzed SAH RCTs with various statistical tests to determine whether the statistical method affects RCT primary outcome. Methods: Individual patient data for the primary outcome (Glasgow outcome scale [GOS]) of two SAH RCTs were analyzed using 15 statistical methods. For tests requiring outcome dichotomization, multiple cut-points in the 5-level GOS were assessed. Next, a synthetic dataset generated using random sampling with replacement from ten SAH RCTs was assessed using the same statistical tests. A Friedman test (two-way non-parametric analysis of variance) determined which tests produced the highest average absolute Z-values. The number of times each test reported signicance of p<0.05 across the different datasets was calculated. Results: Bootstrapping with replacement produced the best-ranking results, followed by three χ 2 -tests: one differentiating excellent (GOS=5) from good (GOS=4), poor (GOS=2-3), or dead (GOS=1) outcomes; one differentiating favorable (GOS=4-5) from poor or dead outcomes; and one differentiating favorable (GOS=4-5) from unfavorable outcomes. Each of these reported statistical signicance for both RCTs, as did the following ranked tests, respectively: Wilcoxon median test, Student’s t-test, ordinal logistic regression, median test, and a chi-square dichotomizing excellent (GOS ≥ 4) and inferior outcomes. Statistical signicance for one or neither RCT was reported by two Cochrane-Armitage tests and two logistic regressions with alternate versions of bucketing, the Kolmogorov-Smirnov test and chi-square test differentiating surviving from dead patients. The synthetic dataset returned similar results, with the same nine most and six least ecient tests. Conclusions: Bootstrapping produced the most ecient results but is time-and resource-intensive. Chi-square tests grouping outcomes into dichotomous


Introduction
Many randomized clinical trials (RCT) have been conducted on patients with subarachnoid hemorrhage (SAH). 1 Only nimodipine and endovascular coiling have garnered robust support. 2,3 Other changes in management of these patients probably have contributed to improved outcomes, including early aneurysm repair, increased use of endovascular techniques, neurocritical care and better diagnosis of minor cases of SAH. 4,5 Nevertheless, still only 36-55% of patients regain independence and 35% succumb to their illness. [6][7][8] There are many possible reasons for the paucity of RCT demonstrating e cacy in SAH. Some include inadequate sample size or effect size of the treatment, adverse effects of the tested treatment, e cacy of rescue therapy in the placebo groups, lack of e cacy of the tested interventions, and insensitivity or suboptimal statistical analysis of the primary outcome. While studies on statistical power on SAH RCT have been conducted, 9 the history of examining statistical procedures to optimize RCT is comparatively much stronger in the ischemic stroke literature, [10][11][12][13] to the point that a collaboration (the Optimising Analysis of Stroke Trials [OAST] Collaboration) has been formed to address the topic. While no such collaboration exists for SAH, the Subarachnoid Hemorrhage International Trialists (SAHIT) repository possesses a large amount of RCT and clinical registry data in SAH, which provides an opportunity to study questions of statistical optimization. Hence, the objective of this study was to compare statistical methods in post-hoc analyses of SAH RCT, closely following the methodology of a previous ischemic stroke study by the OAST Collaboration, 10 to determine methodologies for the optimization of statistical e ciency.

Materials And Methods
The data that support the ndings of this study are available from the corresponding author upon reasonable request.

Trial Selection and Data Acquisition
Individual patient data in the SAHIT repository was obtained from two RCTs that demonstrated e cacy of the experimental interventions: the British Aneurysm Nimodipine Trial (BRANT) 2 and the International Subarachnoid Aneurysm Trial (ISAT). 3 These were chosen because they are the only two SAHIT trials to have positive effects on outcome.
Data regarding trial characteristics, patient demographics, patient severity and outcome on the Glasgow Outcome Scale (GOS) were collected for each trial. The time point at which outcome was recorded was three months for BRANT and two months for ISAT.

Synthetic Dataset
A synthetic dataset designed to have a 20% difference in favorable outcome was created using raw patient outcomes data from ten RCTs in the SAHIT repository. 14 The dataset was composed of 2,000 patients randomly selected with replacement. Half of the patients belonged to the experimental intervention cohort of the respective trial with the other half having received a placebo. The synthetic placebo cohort was generated through sampling without replacement. Then the synthetic intervention cohort was generated such that it had a 20% relative greater proportion of patients with favorable outcomes, de ned as a GOS score of 4 or 5. For trials that reported outcomes on the modi ed Rankin Scale (mRS) or the extended GOS (GOSE), scores were converted to the GOS using conversion schemes consistent with those of Olsen 15 and Michaud. 16 Outcomes were from 3 or 6 months after SAH depending on data availability for each trial.

Statistical Tests
Fifteen statistical tests were conducted on the raw data from each trial and the synthetic dataset and compared to assess the statistical signi cance of the RCT treatment effect. These tests were the same ones used by a similar study that examined ischemic stroke RCTs. 10 While some tests analyzed the raw trial data, certain analyses, such as the chi-square test, required placing patients into discrete bins or categories based on their outcome scores. Additionally, some tests required dichotomization of outcomes into 2 or more groups ("good"/"bad"); these tests were assessed multiple times with different "breakpoints" for dichotomization.

Comparison of Statistical Tests
The signi cance of each test's result was determined by the Z-value of the output. The absolute values of each Z-value were then ranked from highest to lowest, such that the highest Z-value and statistically most signi cant result received a rank of 1 and the lowest Z-value and least signi cant result received a rank of 15 (i.e. the lower the rank the more e cient the outcome). A nonparametric 2-way analysis of variance (ANOVA) (Friedman test) was used to determine which test produced the lowest average rank and to assess differences in the results of studied tests. Tests were re-ordered in terms of overall average rank, and the number of times each test reported signi cant results across the different datasets was calculated. Analyses were carried out in SAS (version 9.4) and statistical signi cance was de ned by p < 0.05.

Results
The average rank of effect of each statistical test showed that bootstrapping with a non-parametric Wilcoxon test had the highest rank followed by three chi-square tests with different bucketing criteria (excellent-good-poor-dead, good-poor-dead and good-poor, Table 1). For chi-square tests, those with more categories returned more statistically signi cant results with lower mean ranks, while those that dichotomized outcomes into two categories received higher mean ranks. Tests that analyzed ordinal variables performed better when analyzing raw outcome data on the 5-point GOS scale than when analyzing data arti cially bucketed into categories. For example, The Wilcoxon test, Student's t-test, median test and ordinal logistic regression all received higher mean ranks than the logistic regression when it was bucketed into two or more categories.  Fig. 1).
Results for the synthetic dataset found the four most e cient tests (bootstrap followed by three chisquare tests) were the same as those obtained by analysis of BRANT and ISAT (   17 Thus, a fundamental question is how to maximize the chances that an RCT of a truly effective treatment actually demonstrates this.
One way to do this may be to analyze the primary outcome in some way other than the most common, simple dichotomous "good" or "bad" method. 18 Herein we assessed raw data from BRANT, ISAT and a synthetic dataset using 15 statistical tests in order to identify a statistical method that might optimize the analysis of outcome in SAH RCT. 10,14 We found that the bootstrap analysis was the most e cient of the included tests. Bootstrapping involves high-volume resampling with random sampling and data replacement. This high-volume repetition of comparisons may function to increase the magnitude of treatment effect size. A study of ischemic stroke RCTs also found bootstrapping performed relatively well. 10 There are limitations of bootstrapping. It requires large sample sizes in order to ensure similar data distributions between trial groups and minimize type I error. 19 Bootstrapping also is heavily resource and time intensive due to the large extent of resampling required to control error rates.
Three different chi-square tests were the next most e cient tests. This is opposite to the results from the ischemic stroke RCT study. 10 That study reported that when similar tests were conducted on the data from ischemic stroke clinical trials, tests that grouped outcome scores into 2 or more categories, such as these three chi-square tests, did not perform as well as tests that evaluated ordinal data. 10 Rather, we observed that within the different chi-square tests, those that divided outcome measurements into 3 or 4 categories performed better than those with only 2 categories. These ndings suggest that the certain characteristics of SAH trial design, of the way trial data is collected, or of the outcomes of SAH patients may bene t from the division of such outcomes into pre-de ned categories.
It is likely that characteristics of the RCT data, such as the distribution of outcomes across the ordinal scale, in uence the statistical e ciency of different tests. For example, most patients in the RCTs included in this study had GOS outcomes of 4 or 5, so statistical tests that are able to differentiate between these scores may more easily detect signi cant effects of treatment. The distribution of outcomes for patients with other neurological diseases may differ from those of SAH patients. Regarding outcome distribution, it is important to ensure that the data distribution meets all the assumptions of a statistical test when designing the analysis for a study.
Not all studies with outcome scores divided into categories performed e ciently. An exception was a chisquare test with categories for "alive" or "deceased" outcomes, which did not nd a statistically signi cant difference between cohorts in either trial. This comparison differs from the other chi-square tests in that it analyzed effect of the interventions on mortality. The mortality rates were low, however, in the RCT and larger sample sizes may be needed.
Improved test performance from bucketing into more categories was not seen for all statistical tests. The Cochrane-Armitage (divided into either 3 or 4 buckets) and logistic regressions divided into 3 or 4 buckets performed poorly, each only returning statistically signi cant results for one of the trials, and all had higher mean-ranks than ordinal logistic regression. Notably, all of these tests failed to reach signi cance for BRANT. This could be due to the small sample size of BRANT, which was about a quarter the size of ISAT.

Conclusion
The e ciency of different statistical tests for analysis of SAH RCT varied signi cantly. However, which method to use is only one consideration in design of SAH RCT. Chi-square analyses that assess outcomes across multiple categories of ordinal outcomes tended to perform more e ciently in the included trials and may help future trials identify effective SAH treatments. A frequent limitation in SAH RCT is inadequate sample size and these results should not be taken as ways to reduce sample size.

Declarations
Ethics approval and consent: Need for approval was waived as this study utilizes only de-identi ed datasets without any identifying personal health information.

Consent for publication:
Not applicable.

Availability of Data and Materials:
The datasets analyzed during the current study are not publicly available due to their proprietary nature but are available from the corresponding author on reasonable request. Funding: No funding was received for the work done in this study. Authorship: WHS, SNN, MLL, EKC and AJS each contributed to the design and draft of the work, and they have approved the submitted version and have agreed both to be personally accountable for the accuracy or integrity of the work. EKO, JM, JB, MDC, ALOM, NE, HF, DH, BNRJ, PK, PLR, BL, SM, AM, AQ, GJER, TAS, JIS, MT, JCT, WMVDB, MDIV, GW, SY each contributed to the acquisition and interpretation of data analyzed in the work and substantially revised it, and they have approved the submitted version and have agreed both to be personally accountable for the accuracy or integrity of the work. RLM contributed to the concept, design, and drafting of the work and substantively revised it, and he has approved the submitted version and has agreed both to be personally accountable for the accuracy or integrity of the work Figure 1 Comparison of Rank Scores for 15 Included Statistical Tests: lower ranks imply the test is more e cient; tests joined by the same band are not statistically signi cantly different from each other at P<0.05