Comparison of statistical methods for analysis of small sample sizes for detecting the differences in efficacy between treatments for knee osteoarthritis

largest sample size (n=30). Conclusion The perm - test is suggested for analysis of small sample size to compare the differences in efficacy between two treatments for knee OA.


Background
Osteoarthritis (OA) is a degenerative joint disease and is most commonly located in the knee. Knee OA is a highly prevalent disease afflicting the elderly, and a 44% lifetime risk has been reported in American adults. [1] The degenerative process in the knee joint can lead to joint pain, stiffness, and swelling which could negatively affect physical function and life quality. Several nonsurgical treatments have been developed to ease the symptoms of knee OA such as autologous adipose tissue injections, hyaluronic acid injections, platelet-rich plasma treatment, and laser therapy. [2][3][4] A randomized controlled clinical trial is generally designed to compare the differences in efficacy between treatments for knee OA. [5] However, a small sample size (n < 30 patients for each treatment group) is usually enrolled to examine the differences, [6] which might be due to lack of human and financial resources. A small sample size is associated with low statistical power of the study, which could produce inconclusive results. It is important to investigate the effects of statistical methods for analysis of small sample size on the power of detecting clinical differences in efficacy between two treatments for knee OA.
Simulation studies have been adopted to evaluate the accuracy and appropriateness of various statistical methods under an assumption of the known validity. [7] Due to the invention and progression of computers, simulation studies have been increasingly used in medical research. [8,9] To compare statistical methods for analysis of small sample size for detecting the differences in efficacy between treatments for knee OA, simulation methods were adopted in this study.
The objective of this study was to compare different statistical methods for analysis of small sample size for detecting the differences in efficacy between two treatments for knee OA. A simulation method was adopted to conduct this research, and the results could suggest a better statistical method for analysis of small sample size to increase statistical power.

Parameters
To generate data sets based on an assumption of known parameters, these parameters were collected from the previous studies that investigated the differences in efficacy between treatments for knee OA.

Magnetic resonance imaging (MRI)
The changes of cartilage volume, thickness of the synovial membrane, and synovial fluid volume before and after treatments measured by MRI are a relatively new technique to assess the efficacy of treatments for knee OA. These parameters were collected from a previous study of assessing the differences in efficacy between treatments for knee OA (ClinicalTrials.gov NCT01354145). [10] It was a randomized double-blind, double-dummy, controlled trail with a maximum follow-up of two years.
Two patient groups were treated with chondroitin sulfate and celecoxib respectively. Eight parameters were measured by MRI that presented the mean changes between baseline and 24 months follow-up were adopted in this study, including four parameters that showed a significant difference or a borderline level between treatments and four parameters that showed no significant difference for the use of negative controls in this study (Table 1).

Self-report Index
The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) score is commonly adopted to assess efficacy of knee OA after treatment. WOMAC score contains three kinds of index, including pain index, stiffness index, and function index, and is measured from a patient-reported outcome. These parameters were collected from a randomized, controlled, single-blind study (IRCT2016071513442N11).
[11] Two patient groups were treated with plasma rich in growth factor and hyaluronic acid injections respectively, and WOMAC score changes between baseline and six months follow-up were measured (Table 2).

Statistical methods
To investigate the effects of statistical methods for analysis of small sample size on statistical power, four methods were considered, including the two-sample t-test (t-test), the Mann-Whitney U-test (M-W test), the Kolmogorov-Smirnov test (K-S test), and the permutation test (perm-test).

T-test
The t-test is used to examine whether two population means differ.
[12] The assumptions of this test are that both populations follow a normal distribution and have equal variances in standard deviation. The formula of the t-test is defined as: where ̅ 1 and ̅ 2 are the sample means, s 1 and s 2 are the sample standard deviations, and n 1 and n 2 are the sample sizes.
The null hypothesis is rejected at the level α: where 1− , is the critical value of the t distribution with ν degrees of freedom.

M-W test
The M-W test is a non-parametric test used for comparing the difference of two 8 population medians.
[13] This test is often taken as an alternative to the t-test when the samples do not follow a normal distribution. Test statistic U is defined as: whereΣR is the observed rank sum and n is the sample size for each group. If U1 is smaller than U2 and the null hypothesis is rejected at the level α: where is the mean of all observed ranks, is the standard deviation of all observed ranks, and 1− , is the critical value of the z distribution with ν degrees of freedom.

K-S test
The K-S test is a non-parametric test used for comparing the empirical distribution functions between two populations. The statistic is defined as: where sup is the supremum function, and 1 and 2 are the empirical distribution functions for each sample group.
The null hypothesis is rejected at the level α: where 1 and 2 are the sample sizes for each group.

Perm-test
The perm-test is a non-parametric test used for comparing if two populations come from the same distribution. The concept of this test is that the test statistic distribution that is obtained by calculating all possible values of the test statistic under random rearrangements of the labels on the observed sample groups is compared with the real test statistic that is calculated from the observed sample groups.
The procedure of the perm-test is as follows. In the first step, the mean difference of two groups is calculated. In the second step, the samples of two groups are pooled, and then randomly divided into two groups. The mean difference of two permutation groups ( ) is calculated. The second step is repeated 1000 times and 1000 values of are obtained. Finally, the two-sided p-value is determined by calculating the proportion of sampled permutations where these absolute values of are greater than or equal to the absolute value of .

Simulation procedure
To investigate how statistical methods for analysis of small sample size affect the power of assessing knee OA treated with two methods, 10,000 replicates of five sample sizes (n=10, 15, 20, 25, and 30 for each group) were generated respectively.
The data were generated from a normal distribution of the parameters that had means and standard deviations collected from the previous studies (Tables 1 and   2). [10,11] The generated data between two treatments for each replicate were tested by the t-test, the M-W test, the K-S test, and the perm-test respectively. The percentage of 10,000 replicates that had a significant difference (p-value < 0.05) was defined as statistical power for p1-p4 and a false positive rate for p5-p8.
A coefficient of variance (CV) of simulated parameter means over 10,000 replicates was calculated as follows: where sd is the standard deviation of the simulated parameter means over 10,000 replicates, and ′ ̅ is the mean of simulated parameter means over 10,000 replicates.
The bias of the simulated parameter mean was calculated as the following formula: where ′ is the simulated parameter mean from the simulation and p is the parameter mean from the original study.
Data analysis was conducted using R software version 3.31 (R Development Core Team 2016).

MRI parameters
The 8 MRI parameters were adopted in this simulation study, including four (p1-p4) with significant difference or a borderline level between treatments and four (p5-p8) without significant difference in the original study ( Table 1). The effect size of p1-p4 (≥ 0.48) was obviously larger than that of p5-p8 (≤ 0.23) ( Table 1). In sensitivity analysis for sample size, 10,000 replicates of larger sample sizes (n=50, 100, 150, 200, 250, and 300 for each group) were simulated, and the difference between treatments was examined by the t-test. For p1, p2, p3, p4, and p6, the mean, median, and variance of p-values decreased obviously with sample size ( Figure S1). In contrast, the mean, median, and variance of p-values seemed not to be obviously affected by sample size for p5, p7, and p8.
In analysis for smaller sample sizes (n=10, 15, 20, 25, and 30 for each group), the 95% confidence interval (CI) of bias of simulated parameter means over 10,000 replicates varied with sample sizes, and no decreased trend with sample size was observed for all parameters ( Figure 1). The CV% of simulated parameter means over 10,000 replicates decreased with sample size for the eight parameters ( Figure 2). For the largest sample size (n=30), the CV value could achieve a small level (<20%) except for p3, p7, and p8.
The power of the four statistical methods for analysis of parameters p1-p4 showed an increased trend with sample size ( Figure 3); however, the power was not over 80% even using the largest sample size (n=30) and varied with parameters. The parameters p1 and p2 had relatively low power. In comparison of power among the four statistical methods, the perm-test and the t-test had the highest power ( Figure   3). The power of the M-W test was lower than the perm-test and the t-test but slightly higher than the K-S test. In comparison of false positive rates (p5, p6, p7, and p8) among sample sizes, the false positive rates examined by the four statistical methods would not increase with sample size except for p6.
However, the significant differences of these parameters between treatments were not provided by the original study. In sensitivity analysis for sample size, larger sample sizes were simulated for the four parameters (w1-w4), and the differences in these parameters between treatments was examined by the t-test. For all these parameters, the mean, median, and variance of p-values obviously decreased with sample size (Figure S2), indicating that these parameters had high probability of significant difference between treatments.
In analysis for smaller sample sizes, the 95% CI of bias of simulated parameter means over 10,000 replicates varied with sample sizes for all parameters (Figure 4).
The CV% of simulated parameter means over 10,000 replicates decreased with sample size for all of the parameters ( Figure 5). For the largest sample size (n=30), the CV value could reach a small level (< 20%) except for w3 of the PRGF group.
The power of the four statistical methods for examining the parameters w1-w4 showed an increased trend with sample size ( Figure 6); however, the power was not over 80% even using the largest sample size (n=30) and it varied with parameters.
The parameter w3 had relatively lower power than the other three parameters. In comparison of power among statistical methods, the perm-test and the t-test had similar power and their power had the highest value ( Figure 6). The power of the M-W test was lower than the perm-test and the t-test but slightly higher than the K-S test. Interestingly, w1 examined by the K-S test had obviously higher power than the other 3 statistical methods as sample size ≥ 15.

Discussion
Randomized controlled study designs are high level of evidence to determine if one treatment is superior to any other; however, a small sample size is usually enrolled in such design to examine the differences in efficacy between treatments for knee OA.
To suggest a better statistical method for analyzing small sample size, we used simulation to compare four statistical methods for analysis of small sample sizes for detecting the differences in efficacy between treatments. The 95% CI of bias of simulated parameter means decreased with sample size; however, CV% of simulated parameter means over 10,000 replicates did not show an obviously decreased trend with sample size. In general, the perm-test and the t-test had the highest power. The power of the M-W test was lower than the perm-test and the t-test but slightly higher than the K-S test. However, the perm-test and the t-test could not raise the power to a high level (80%), even using the largest sample size (n=30).
An acceptable bias (within±10%) has been suggested for parameters estimated.
[14] However, the 95% CI of bias of simulated parameter means over 10,000 replicates seriously exceeded this acceptable level even using the largest sample size (n=30). The variation of these parameters was large (CV > 45%) (Tables 1   and 2), and thus these parameters were seriously over-or underestimated using small sample sizes (n≤30). On the other hand, a CV% value is used to assess the precision of a replication experiment.
[15] The CV% value of estimated parameters is expected to decrease with sample size. However, the CV% value varied with sample size in this study, which might have resulted from the inherent variation in these parameters. A CV of 20% has been suggested as an accepted level.
[15] The CV% of simulated parameter means could achieve this accepted level using the largest sample size (n=30) except for p3, p7, p8, and w3 of the RGFA group. A large variation was inherent in these parameters (Tables 1 and 2), and thus the CV% of simulated parameter means over 10,000 replicates could not achieve this accepted level as sample size <30.
Parameter p6 had no significant difference between treatments in an original study [10] and was taken as a negative control in this study. In sensitivity analysis for larger sample sizes (n=50, 100, 150, 200, 250, and 300 for each group), the p-values of p6 showed a decreased trend with sample size ( Figure S1). In addition, the effect size of p6 (0.23) was obviously larger than that of p5 (0.10), p7 (0.04), and p8 (0.05).
These indicate that p6 should have significant difference between treatments and it could not be taken as a negative control. This could explain why a false positive rate of p6 examined by the four statistical methods showed a slightly increased trend with sample size (Figure 3).
WOMAC score is a traditional measure to assess the efficacy of knee OA patients after receiving treatments. It is a self-reported score and is considered as a

Conclusion
This report used simulation to compare statistical methods for analysis of small sample size for detecting the clinical differences in efficacy between treatments for knee OA. The parameters were seriously over-or underestimated using small sample sizes (n≤30). Two types of measures of treatment efficacy were considered in this study, including MRI measurements and WOMAC score. The simulation produced similar results between the two types of measures. Among the non-parametric tests for analysis of small sample size, we discovered that the perm-test had the highest power, and its false positive rate was not affected by sample size. However, the power of the perm-test could not achieve a high value (80%) even using the largest sample size (n=30) in this study.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and material
Not applicable.

Competing interests
The author claims no conflict of interest.

Funding
Not applicable.