The proposed approach focuses on estimation and hypothesis testing of progression criteria for feasibility outcomes that are potentially modifiable (e.g. recruitment, treatment fidelity/ adherence, level of follow up). Thus, it aligns with the main aims and objectives of pilot and feasibility studies and with the progression stop-amend-go recommendations of Eldridge et al. (2) and Avery et al. (18).
Hypothesis concept
The concept is to set up hypothesis-testing around progression criteria that tests against being in the RED zone (designating lower/unacceptable feasibility outcome – ‘STOP’) based on an expectation of being in the GREEN zone (designating higher/acceptable feasibility outcome – ‘GO’). Specifically, testing is against the upper “RED” (stop) cut-off (denoted, RUL) (on the assumption that the “GREEN” (go) threshold – lowest value of the GREEN zone (denoted, GLL) – is true), as:-
- Null hypothesis: True feasibility not greater than the upper “RED” stop limit (RUL)
- Alternative hypothesis: True feasibility is greater than RUL
The test is a 1-tailed test with suggested customary values for alpha (a) of 0.1, 0.05 or 0.01 and beta (b) of 0.1 or 0.2, dependent on the required strength of evidence of the test.
Progression rules
Let E denote the observed point estimate (ranging from 0 to 1 for proportions, or for percentages 0-100%); RUL, the upper limit for the RED zone; GLL, the lower limit for the GREEN zone. Simple 3-tiered progression criteria would follow as:-
- E ≤ RUL [P-value non-significant (P ≥ a)] -> RED (unacceptable - STOP)
- RUL < E < GLL -> AMBER (potentially acceptable - AMEND)
- E ≥ GLL [P-value significant (P < a)] -> GREEN (acceptable - GO)
In this case, we express progression criteria as three distinct progression signals, as illustrated in Figure 1(a).
Sample size
Table 1 displays a quick look-up grid for sample size across a range of anticipated proportions for RUL and GLL for one-sample one-sided 1% and 5% alpha with typical 80% and 90% power (see Appendix 1 for corresponding mathematical expression; derived from Fleiss et al. (27)). Clearly, as the difference between proportions RUL and GLL increases the sample size requirement is reduced.
Table 1: Sample size and significance cut-points for (GLL-RUL) differences, power (80%/90%) and 1-tailed significance levels (1%/5%).
RUL
|
GLL
|
a(0.05) b(0.2)
|
a(0.05) b(0.1)
|
a(0.01) b(0.1)
|
%
|
%
|
n
|
Asig
|
AR%
|
n
|
Asig
|
AR%
|
n
|
Asig
|
AR%
|
10
|
20
|
79
|
16.6
|
66.1
|
112
|
15.6
|
55.5
|
157
|
16.9
|
68.6
|
15
|
25
|
101
|
21.5
|
65.5
|
141
|
20.5
|
55.4
|
202
|
21.7
|
67.3
|
|
30
|
50
|
24.7
|
64.8
|
69
|
23.3
|
55.1
|
97
|
25.3
|
68.4
|
20
|
30
|
119
|
26.5
|
65.3
|
166
|
25.5
|
55.3
|
241
|
26.6
|
66.4
|
|
35
|
57
|
29.7
|
64.9
|
79
|
28.3
|
55.1
|
113
|
30.1
|
67.4
|
|
40
|
34
|
32.9
|
64.6
|
47
|
31.0
|
55.0
|
66
|
33.7
|
68.4
|
25
|
35
|
135
|
31.5
|
64.9
|
186
|
30.5
|
55.3
|
272
|
31.6
|
66.0
|
|
40
|
64
|
34.6
|
64.2
|
87
|
33.3
|
55.1
|
126
|
35.0
|
66.7
|
|
45
|
37
|
37.9
|
64.5
|
51
|
36.0
|
54.9
|
73
|
38.5
|
67.4
|
|
50
|
25
|
40.9
|
63.7
|
34
|
38.7
|
54.6
|
48
|
42.0
|
68.0
|
30
|
40
|
146
|
36.5
|
64.9
|
201
|
35.5
|
55.3
|
297
|
36.6
|
65.6
|
|
45
|
69
|
39.6
|
63.9
|
94
|
38.2
|
54.8
|
136
|
39.9
|
66.2
|
|
50
|
40
|
42.7
|
63.7
|
54
|
41.0
|
54.8
|
78
|
43.4
|
66.8
|
|
55
|
26
|
45.9
|
63.8
|
35
|
43.7
|
55.0
|
51
|
46.8
|
67.2
|
|
60
|
20*
|
48.3
|
61.0
|
26
|
46.0
|
53.5
|
36
|
50.5
|
68.3
|
35
|
45
|
155
|
41.5
|
64.7
|
213
|
40.5
|
55.2
|
316
|
41.5
|
65.3
|
|
50
|
72
|
44.6
|
63.9
|
98
|
43.2
|
54.8
|
144
|
44.8
|
65.6
|
|
55
|
42
|
47.6
|
63.1
|
56
|
45.9
|
54.7
|
82
|
48.2
|
66.0
|
|
60
|
27
|
50.8
|
63.2
|
36
|
48.7
|
54.8
|
53
|
51.6
|
66.5
|
|
65
|
20*
|
53.4
|
61.3
|
26
|
51.1
|
53.8
|
37
|
55.3
|
67.6
|
40
|
50
|
161
|
46.4
|
64.5
|
220
|
45.5
|
55.2
|
327
|
46.5
|
65.1
|
|
55
|
74
|
49.5
|
63.7
|
100
|
48.2
|
54.8
|
148
|
49.8
|
65.3
|
|
60
|
43
|
52.5
|
62.7
|
57
|
50.9
|
54.5
|
84
|
53.1
|
65.5
|
|
65
|
28
|
55.5
|
62.1
|
37
|
53.5
|
54.0
|
54
|
56.5
|
65.8
|
|
70
|
20*
|
58.3
|
61.0
|
26
|
56.0
|
53.5
|
38
|
59.9
|
66.3
|
45
|
55
|
164
|
51.4
|
64.2
|
222
|
50.5
|
55.2
|
333
|
51.5
|
64.8
|
|
60
|
75
|
54.5
|
63.2
|
100
|
53.2
|
54.8
|
149
|
54.8
|
65.1
|
|
65
|
43
|
57.5
|
62.4
|
57
|
55.8
|
54.2
|
84
|
58.0
|
65.2
|
|
70
|
28
|
60.4
|
61.5
|
36
|
58.6
|
54.2
|
53
|
61.5
|
65.8
|
|
75
|
20#*
|
63.0
|
60.1
|
25
|
61.1
|
53.7
|
37
|
64.9
|
66.2
|
50
|
60
|
163
|
56.4
|
64.1
|
221
|
55.5
|
55.1
|
331
|
56.5
|
64.7
|
|
65
|
74
|
59.5
|
63.0
|
99
|
58.2
|
54.5
|
147
|
59.7
|
64.9
|
|
70
|
42
|
62.4
|
62.2
|
55
|
60.9
|
54.3
|
82
|
63.0
|
65.0
|
|
75
|
27
|
65.3
|
61.3
|
35
|
63.5
|
53.8
|
52
|
66.3
|
65.1
|
55
|
65
|
159
|
61.4
|
63.9
|
215
|
60.5
|
55.0
|
323
|
61.5
|
64.5
|
|
70
|
72
|
64.4
|
62.6
|
95
|
63.2
|
54.5
|
143
|
64.7
|
64.5
|
|
75
|
40
|
67.4
|
62.0
|
53
|
65.8
|
53.9
|
79
|
67.9
|
64.6
|
60
|
70
|
152
|
66.4
|
63.6
|
205
|
65.5
|
54.8
|
309
|
66.4
|
64.3
|
|
75
|
68
|
69.3
|
62.3
|
90
|
68.1
|
54.1
|
135
|
69.6
|
64.3
|
|
80
|
38
|
72.2
|
61.1
|
49
|
70.8
|
53.8
|
74
|
72.9
|
64.3
|
65
|
75
|
143
|
71.3
|
63.0
|
190
|
70.5
|
54.7
|
288
|
71.4
|
64.0
|
|
80
|
63
|
74.3
|
61.7
|
82
|
73.1
|
54.1
|
124
|
74.6
|
64.1
|
|
85
|
35
|
77.0
|
60.2
|
44
|
75.7
|
53.7
|
67
|
77.8
|
64.1
|
70
|
80
|
129
|
76.3
|
62.7
|
171
|
75.4
|
54.5
|
260
|
76.4
|
63.8
|
|
85
|
57
|
79.1
|
60.7
|
73
|
78.0
|
53.6
|
111
|
79.5
|
63.6
|
75
|
85
|
113
|
81.2
|
61.9
|
147
|
80.4
|
54.3
|
225
|
81.4
|
63.6
|
|
90
|
49#
|
83.9
|
59.5
|
61
|
83.0
|
53.4
|
94
|
84.5
|
63.3
|
80
|
90
|
93
|
86.1
|
60.9
|
119
|
85.4
|
53.8
|
183
|
86.3
|
63.3
|
RUL=upper limit of RED zone; GLL=lower limit of GREEN zone; Asig=AMBER-statistical significance threshold (within the AMBER zone) where an observed estimate equal or below the cut-point will result in a non-significant result (p≥0.05 or ≥0.01) and figures above the cut-point will be significant (p<0.05 or <0.01); AR%=percent of AMBER zone yielding a non-significant test result (% within AMBERR sub-zone).
Sample sizes were derived using the normal approximation to the binomial distribution (with continuity correction) formula given in the Appendix, which by convention is stable for np>5 and n(1-p)>5 (this is the case for the scenarios in the above table except where indicated by # where n(1-p)=4.9 and 5.0). * Derived sample size is less than 25 which may give overly wide confidence intervals (if n≥25 (recommended) then the standard error for purposes of 1-sided interval estimation will be no greater than 0.1).
Multi-criteria assessment
We recommend that progression for all key feasibility criteria should be considered separately, and hence overall progression would be determined by the worst-performing criterion e.g. RED if at least one signal is RED; AMBER if none of the signals fall into RED but at least one falls into AMBER; GREEN if all signals fall into the GREEN zone. Hence, the GREEN signal to ‘GO’ across the set of individual criteria will give indication that progression to a main trial can take place without any necessary changes. A signal to ‘stop’ and not proceed to a main trial is recommended if any of the observed estimates are ‘unacceptably’ low (i.e. fall within the (RED) zone). Otherwise, where neither ‘GO’ nor ‘STOP’ are signalled, the design of the trial will need amending by indication of subpar performance on one or more of the criteria.
Sample size requirements across multi-criteria will vary according to the designated parameters linked to the progression criteria, which may be set at different stages of the study on different numbers of patients (e.g. those screened, eligible, recruited and randomised, allocated to the intervention arm, total followed up). This is illustrated in Box 1 for statistical testing requirement at different levels e.g. number needed to be targeted/screened (for potential recruitment); number to be randomised; number to be randomised specifically to the intervention arm. The overall size needed will be dictated by the requirement to power each of the multi-criteria statistical tests. Since these tests will yield separate conclusions in regards to the decision to ‘GO’ across all individual feasibility criteria there is no need to consider a multiple testing correction with respect to alpha. However, researchers may wish to increase power (and hence, sample size) to ensure adequate power to detect ‘GO’ signals across the collective set of feasibility criteria. For example, powering at 90% across three (assumed independent) criteria will ensure a collective power of 73% (i.e. 0.9^3), which may be considered reasonable; but, 80% power across five criteria will give an overall probability of only 33% for identifying ‘GO’ signals across all five criteria.
Box 1: A two-arm parallel design (1:1 allocation to intervention and control arms) with three key feasibility objectives, to assess: (i) recruitment uptake (percent of screened patients recruited); (ii) treatment fidelity; (iii) participant retention (follow up). Hypothesis-testing incorporates a(1-sided)=5% and power=90%.
Assume the progression criteria (and affiliated sample size requirements) for each are as follows:-
(i) Recruitment uptake: ≤20% (RED zone), ≥35% (GREEN zone) {RUL = 20% / GLL = 35%}
- Required sample size n = 79 [total screened patients]
(ii) Treatment fidelity: ≤50% (RED zone), ≥75% (GREEN zone) {RUL = 50% / GLL = 75%)
- Required sample size n = 35 [intervention arm only]
(iii) Follow up: ≤65% (RED zone), ≥85% (GREEN zone) {RUL = 65% / GLL = 85%)
- Required sample size n = 44 (total randomised participants with 22 per arm)
The sample sizes across criteria (i)-(iii) are at different levels – (i) is at the level of screened patients, whereas (ii)-(iii) are at the level of all randomised patients. To meet criteria (i) we need nscreened≥79 (although we anticipate nscreened=200 (i.e. (1/0.35)x nrandomised where 0.35 is the expected proportion uptake of the total number screened), and for (ii)-(iii) we need nrandomised=70 (35 per arm, based on (ii)).
Taking each of the objectives in turn (and the updated sample sizes to meet the multi-criteria objectives), we express progression criteria for the three objectives as follows:-
(i) Recruitment uptake [nscreened≥79; expected nscreened=200; maximum nscreened=350 (i.e. (1/0.2)x nrandomised)]
- E ≤ 0.2 [P ≥ 0.05] -> RED (STOP)
- 0.2 < E < 0.35 -> AMBER (AMEND)
- E ≥ 0.35 [P < 0.05] -> GREEN (GO)
(with the following signals for expected nscreened=200: 0 – 40 (RED); 41 – 69 (AMBER); 70 – 200 (GREEN)) [i.e. ≤0.2x200=40 and ≥0.35x200=70]
(ii) Treatment fidelity [nintervention-arm=35 (intervention arm only)]
- E ≤ 0.5 [P ≥ 0.05] -> RED (STOP)
- 0.5 < E < 0.75 -> AMBER (AMEND)
- E ≥ 0.75 [P < 0.05] -> GREEN (GO)
(with the following signals for nintervention-arm=35: 0 – 17 (RED); 18 – 26 (AMBER); 27 – 35 (GREEN)) [i.e. ≤0.5x35=17 and ≥0.75x35=27]
(iii) Follow up [nrandomised=70 (intervention and control arms)]
- E ≤ 0.65 [P ≥ 0.05] -> RED (STOP)
- 0.65 < E < 0.85 -> AMBER (AMEND)
- E ≥ 0.85 [P < 0.05] -> GREEN (GO)
(with the following signals for nrandomised=70: 0 – 45 (RED); 46 – 59 (AMBER); 60 – 70 (GREEN)) [i.e. ≤0.65x70=45 and ≥0.85x70=60]
In accordance with the multi-criteria aim, the decision to proceed would be based on the worst-signal
- If signal=RED for (i) or (ii) or (iii) -> overall signal is RED
- Else, if no signal is RED but signal=AMBER for (i) or (ii) or (iii) -> overall signal is AMBER
- Else, if signals=GREEN for (i) and (ii) and (iii) -> overall signal is GREEN
Further expansion of AMBER zone
Within the same sample size framework the AMBER zone may be further split to indicate whether ‘minor’ or ‘major’ amendments are required according to the significance of the p-value. Consider a 2-way split in the AMBER zone (designated AMBERR (region of Amber zone adjacent to RED zone) and AMBERG (region of AMBER zone adjacent to GREEN zone). This would draw on two possible levels of amendment (major amend and minor amend) (as shown in Figure 1(b)). Hence, the re-configured approach would follow as:-
- E ≤ RUL [P-value non-significant (P ≥ a)] -> RED (unacceptable - STOP)
- RUL < E < GLL -> AMBER (potentially acceptable - AMEND)
- RUL < E < GLL and P ≥ a -> AMBERR (major AMEND)
- RUL < E < GLL and P < a -> AMBERG (minor AMEND)
- E ≥ GLL [P-value significant (P < a)] -> GREEN (acceptable - GO)
Figure 2 illustrates the signals according to this expanded approach. The worked example for this extension is presented in Box 2.
Box 2:
Taking each of the objectives in turn, we re-express the progression criteria for the three objectives according to the 4-tiered approach, as follows:-
(i) Recruitment uptake [expected nscreened=200]
- E ≤ 0.2 [P ≥ 0.05] -> RED (STOP)
- 0.2 < E < 0.35 -> AMBER (AMEND)
- 0.2 < E < 0.256 [P ≥ 0.05] -> AMBERR (AMEND-major)
- 0.256 < E < 0.35 [P < 0.05] -> AMBERG (AMEND-minor)
- E ≥ 0.35 [P < 0.05] -> GREEN (GO)
(with the following signals for nscreened=200: 0–40 (RED); 41–51 (AMBER-major);
52–69 (AMBER-minor); 70– 200 (GREEN))
(ii) Treatment fidelity [nintervention-arm=35 (intervention arm only)]
- E ≤ 0.5 [P ≥ 0.05] -> RED (STOP)
- 0.5 < E < 0.75 -> AMBER (AMEND)
- 0.5 < E < 0.643 [P ≥ 0.05] -> AMBERR (AMEND-major)
- 0.643 < E < 0.75 [P < 0.05] -> AMBERG (AMEND-minor)
- E ≥ 0.75 [P < 0.05] -> GREEN (GO)
(with the following signals for nintervention-arm=35: 0–17 (RED); 18–22 (AMBER-major);
23–26 (AMBER-minor); 27-35 (GREEN))
(iii) Follow up [nrandomised=70 (intervention and control arms)]
- E ≤ 0.65 [P ≥ 0.05] -> RED (STOP)
- 0.65 < E < 0.85 -> AMBER (AMEND)
- 0.65 < E < 0.742 [P ≥ 0.05] -> AMBERR (AMEND-major)
- 0.742 < E < 0.85 [P < 0.05] -> AMBERG (AMEND-minor)
- E ≥ 0.85 [P < 0.05] -> GREEN (GO)
(with the following signals for nrandomised=70: 0–45 (RED); 46–51 (AMBER-major);
52–59 (AMBER-minor); 60-70 (GREEN))
In accordance with the multi-criteria aim, the decision to proceed would be based on the worst-signal (as in Box 1)