The proposed approach focuses on estimation and hypothesis testing of progression criteria for feasibility outcomes that are potentially modifiable (e.g. recruitment, treatment fidelity/ adherence, level of follow up). Thus, it aligns with the main aims and objectives of pilot and feasibility studies and with the progression stopamendgo recommendations of Eldridge et al. (2) and Avery et al. (18).
Hypothesis concept
The concept is to set up hypothesistesting around progression criteria that tests against being in the RED zone (designating lower/unacceptable feasibility outcome – ‘STOP’) based on an expectation of being in the GREEN zone (designating higher/acceptable feasibility outcome – ‘GO’). Specifically, testing is against the upper “RED” (stop) cutoff (denoted, RUL) (on the assumption that the “GREEN” (go) threshold – lowest value of the GREEN zone (denoted, GLL) – is true), as:
 Null hypothesis: True feasibility not greater than the upper “RED” stop limit (RUL)
 Alternative hypothesis: True feasibility is greater than RUL
The test is a 1tailed test with suggested customary values for alpha (a) of 0.1, 0.05 or 0.01 and beta (b) of 0.1 or 0.2, dependent on the required strength of evidence of the test.
Progression rules
Let E denote the observed point estimate (ranging from 0 to 1 for proportions, or for percentages 0100%); RUL, the upper limit for the RED zone; GLL, the lower limit for the GREEN zone. Simple 3tiered progression criteria would follow as:
 E ≤ RUL [Pvalue nonsignificant (P ≥ a)] > RED (unacceptable  STOP)
 RUL < E < GLL > AMBER (potentially acceptable  AMEND)
 E ≥ GLL [Pvalue significant (P < a)] > GREEN (acceptable  GO)
In this case, we express progression criteria as three distinct progression signals, as illustrated in Figure 1(a).
Sample size
Table 1 displays a quick lookup grid for sample size across a range of anticipated proportions for RUL and GLL for onesample onesided 1% and 5% alpha with typical 80% and 90% power (see Appendix 1 for corresponding mathematical expression; derived from Fleiss et al. (27)). Clearly, as the difference between proportions RUL and GLL increases the sample size requirement is reduced.
Table 1: Sample size and significance cutpoints for (GLLRUL) differences, power (80%/90%) and 1tailed significance levels (1%/5%).
RUL

GLL

a(0.05) b(0.2)

a(0.05) b(0.1)

a(0.01) b(0.1)

%

%

n

Asig

AR%

n

Asig

AR%

n

Asig

AR%

10

20

79

16.6

66.1

112

15.6

55.5

157

16.9

68.6

15

25

101

21.5

65.5

141

20.5

55.4

202

21.7

67.3


30

50

24.7

64.8

69

23.3

55.1

97

25.3

68.4

20

30

119

26.5

65.3

166

25.5

55.3

241

26.6

66.4


35

57

29.7

64.9

79

28.3

55.1

113

30.1

67.4


40

34

32.9

64.6

47

31.0

55.0

66

33.7

68.4

25

35

135

31.5

64.9

186

30.5

55.3

272

31.6

66.0


40

64

34.6

64.2

87

33.3

55.1

126

35.0

66.7


45

37

37.9

64.5

51

36.0

54.9

73

38.5

67.4


50

25

40.9

63.7

34

38.7

54.6

48

42.0

68.0

30

40

146

36.5

64.9

201

35.5

55.3

297

36.6

65.6


45

69

39.6

63.9

94

38.2

54.8

136

39.9

66.2


50

40

42.7

63.7

54

41.0

54.8

78

43.4

66.8


55

26

45.9

63.8

35

43.7

55.0

51

46.8

67.2


60

20*

48.3

61.0

26

46.0

53.5

36

50.5

68.3

35

45

155

41.5

64.7

213

40.5

55.2

316

41.5

65.3


50

72

44.6

63.9

98

43.2

54.8

144

44.8

65.6


55

42

47.6

63.1

56

45.9

54.7

82

48.2

66.0


60

27

50.8

63.2

36

48.7

54.8

53

51.6

66.5


65

20*

53.4

61.3

26

51.1

53.8

37

55.3

67.6

40

50

161

46.4

64.5

220

45.5

55.2

327

46.5

65.1


55

74

49.5

63.7

100

48.2

54.8

148

49.8

65.3


60

43

52.5

62.7

57

50.9

54.5

84

53.1

65.5


65

28

55.5

62.1

37

53.5

54.0

54

56.5

65.8


70

20*

58.3

61.0

26

56.0

53.5

38

59.9

66.3

45

55

164

51.4

64.2

222

50.5

55.2

333

51.5

64.8


60

75

54.5

63.2

100

53.2

54.8

149

54.8

65.1


65

43

57.5

62.4

57

55.8

54.2

84

58.0

65.2


70

28

60.4

61.5

36

58.6

54.2

53

61.5

65.8


75

20#*

63.0

60.1

25

61.1

53.7

37

64.9

66.2

50

60

163

56.4

64.1

221

55.5

55.1

331

56.5

64.7


65

74

59.5

63.0

99

58.2

54.5

147

59.7

64.9


70

42

62.4

62.2

55

60.9

54.3

82

63.0

65.0


75

27

65.3

61.3

35

63.5

53.8

52

66.3

65.1

55

65

159

61.4

63.9

215

60.5

55.0

323

61.5

64.5


70

72

64.4

62.6

95

63.2

54.5

143

64.7

64.5


75

40

67.4

62.0

53

65.8

53.9

79

67.9

64.6

60

70

152

66.4

63.6

205

65.5

54.8

309

66.4

64.3


75

68

69.3

62.3

90

68.1

54.1

135

69.6

64.3


80

38

72.2

61.1

49

70.8

53.8

74

72.9

64.3

65

75

143

71.3

63.0

190

70.5

54.7

288

71.4

64.0


80

63

74.3

61.7

82

73.1

54.1

124

74.6

64.1


85

35

77.0

60.2

44

75.7

53.7

67

77.8

64.1

70

80

129

76.3

62.7

171

75.4

54.5

260

76.4

63.8


85

57

79.1

60.7

73

78.0

53.6

111

79.5

63.6

75

85

113

81.2

61.9

147

80.4

54.3

225

81.4

63.6


90

49#

83.9

59.5

61

83.0

53.4

94

84.5

63.3

80

90

93

86.1

60.9

119

85.4

53.8

183

86.3

63.3

RUL=upper limit of RED zone; GLL=lower limit of GREEN zone; Asig=AMBERstatistical significance threshold (within the AMBER zone) where an observed estimate equal or below the cutpoint will result in a nonsignificant result (p≥0.05 or ≥0.01) and figures above the cutpoint will be significant (p<0.05 or <0.01); AR%=percent of AMBER zone yielding a nonsignificant test result (% within AMBERR subzone).
Sample sizes were derived using the normal approximation to the binomial distribution (with continuity correction) formula given in the Appendix, which by convention is stable for np>5 and n(1p)>5 (this is the case for the scenarios in the above table except where indicated by # where n(1p)=4.9 and 5.0). * Derived sample size is less than 25 which may give overly wide confidence intervals (if n≥25 (recommended) then the standard error for purposes of 1sided interval estimation will be no greater than 0.1).
Multicriteria assessment
We recommend that progression for all key feasibility criteria should be considered separately, and hence overall progression would be determined by the worstperforming criterion e.g. RED if at least one signal is RED; AMBER if none of the signals fall into RED but at least one falls into AMBER; GREEN if all signals fall into the GREEN zone. Hence, the GREEN signal to ‘GO’ across the set of individual criteria will give indication that progression to a main trial can take place without any necessary changes. A signal to ‘stop’ and not proceed to a main trial is recommended if any of the observed estimates are ‘unacceptably’ low (i.e. fall within the (RED) zone). Otherwise, where neither ‘GO’ nor ‘STOP’ are signalled, the design of the trial will need amending by indication of subpar performance on one or more of the criteria.
Sample size requirements across multicriteria will vary according to the designated parameters linked to the progression criteria, which may be set at different stages of the study on different numbers of patients (e.g. those screened, eligible, recruited and randomised, allocated to the intervention arm, total followed up). This is illustrated in Box 1 for statistical testing requirement at different levels e.g. number needed to be targeted/screened (for potential recruitment); number to be randomised; number to be randomised specifically to the intervention arm. The overall size needed will be dictated by the requirement to power each of the multicriteria statistical tests. Since these tests will yield separate conclusions in regards to the decision to ‘GO’ across all individual feasibility criteria there is no need to consider a multiple testing correction with respect to alpha. However, researchers may wish to increase power (and hence, sample size) to ensure adequate power to detect ‘GO’ signals across the collective set of feasibility criteria. For example, powering at 90% across three (assumed independent) criteria will ensure a collective power of 73% (i.e. 0.9^3), which may be considered reasonable; but, 80% power across five criteria will give an overall probability of only 33% for identifying ‘GO’ signals across all five criteria.
Box 1: A twoarm parallel design (1:1 allocation to intervention and control arms) with three key feasibility objectives, to assess: (i) recruitment uptake (percent of screened patients recruited); (ii) treatment fidelity; (iii) participant retention (follow up). Hypothesistesting incorporates a(1sided)=5% and power=90%.
Assume the progression criteria (and affiliated sample size requirements) for each are as follows:
(i) Recruitment uptake: ≤20% (RED zone), ≥35% (GREEN zone) {RUL = 20% / GLL = 35%}
 Required sample size n = 79 [total screened patients]
(ii) Treatment fidelity: ≤50% (RED zone), ≥75% (GREEN zone) {RUL = 50% / GLL = 75%)
 Required sample size n = 35 [intervention arm only]
(iii) Follow up: ≤65% (RED zone), ≥85% (GREEN zone) {RUL = 65% / GLL = 85%)
 Required sample size n = 44 (total randomised participants with 22 per arm)
The sample sizes across criteria (i)(iii) are at different levels – (i) is at the level of screened patients, whereas (ii)(iii) are at the level of all randomised patients. To meet criteria (i) we need nscreened≥79 (although we anticipate nscreened=200 (i.e. (1/0.35)x nrandomised where 0.35 is the expected proportion uptake of the total number screened), and for (ii)(iii) we need nrandomised=70 (35 per arm, based on (ii)).
Taking each of the objectives in turn (and the updated sample sizes to meet the multicriteria objectives), we express progression criteria for the three objectives as follows:
(i) Recruitment uptake [nscreened≥79; expected nscreened=200; maximum nscreened=350 (i.e. (1/0.2)x nrandomised)]
 E ≤ 0.2 [P ≥ 0.05] > RED (STOP)
 0.2 < E < 0.35 > AMBER (AMEND)
 E ≥ 0.35 [P < 0.05] > GREEN (GO)
(with the following signals for expected nscreened=200: 0 – 40 (RED); 41 – 69 (AMBER); 70 – 200 (GREEN)) [i.e. ≤0.2x200=40 and ≥0.35x200=70]
(ii) Treatment fidelity [ninterventionarm=35 (intervention arm only)]
 E ≤ 0.5 [P ≥ 0.05] > RED (STOP)
 0.5 < E < 0.75 > AMBER (AMEND)
 E ≥ 0.75 [P < 0.05] > GREEN (GO)
(with the following signals for ninterventionarm=35: 0 – 17 (RED); 18 – 26 (AMBER); 27 – 35 (GREEN)) [i.e. ≤0.5x35=17 and ≥0.75x35=27]
(iii) Follow up [nrandomised=70 (intervention and control arms)]
 E ≤ 0.65 [P ≥ 0.05] > RED (STOP)
 0.65 < E < 0.85 > AMBER (AMEND)
 E ≥ 0.85 [P < 0.05] > GREEN (GO)
(with the following signals for nrandomised=70: 0 – 45 (RED); 46 – 59 (AMBER); 60 – 70 (GREEN)) [i.e. ≤0.65x70=45 and ≥0.85x70=60]
In accordance with the multicriteria aim, the decision to proceed would be based on the worstsignal
 If signal=RED for (i) or (ii) or (iii) > overall signal is RED
 Else, if no signal is RED but signal=AMBER for (i) or (ii) or (iii) > overall signal is AMBER
 Else, if signals=GREEN for (i) and (ii) and (iii) > overall signal is GREEN
Further expansion of AMBER zone
Within the same sample size framework the AMBER zone may be further split to indicate whether ‘minor’ or ‘major’ amendments are required according to the significance of the pvalue. Consider a 2way split in the AMBER zone (designated AMBERR (region of Amber zone adjacent to RED zone) and AMBERG (region of AMBER zone adjacent to GREEN zone). This would draw on two possible levels of amendment (major amend and minor amend) (as shown in Figure 1(b)). Hence, the reconfigured approach would follow as:
 E ≤ RUL [Pvalue nonsignificant (P ≥ a)] > RED (unacceptable  STOP)
 RUL < E < GLL > AMBER (potentially acceptable  AMEND)
 RUL < E < GLL and P ≥ a > AMBERR (major AMEND)
 RUL < E < GLL and P < a > AMBERG (minor AMEND)
 E ≥ GLL [Pvalue significant (P < a)] > GREEN (acceptable  GO)
Figure 2 illustrates the signals according to this expanded approach. The worked example for this extension is presented in Box 2.
Box 2:
Taking each of the objectives in turn, we reexpress the progression criteria for the three objectives according to the 4tiered approach, as follows:
(i) Recruitment uptake [expected nscreened=200]
 E ≤ 0.2 [P ≥ 0.05] > RED (STOP)
 0.2 < E < 0.35 > AMBER (AMEND)
 0.2 < E < 0.256 [P ≥ 0.05] > AMBERR (AMENDmajor)
 0.256 < E < 0.35 [P < 0.05] > AMBERG (AMENDminor)
 E ≥ 0.35 [P < 0.05] > GREEN (GO)
(with the following signals for nscreened=200: 0–40 (RED); 41–51 (AMBERmajor);
52–69 (AMBERminor); 70– 200 (GREEN))
(ii) Treatment fidelity [ninterventionarm=35 (intervention arm only)]
 E ≤ 0.5 [P ≥ 0.05] > RED (STOP)
 0.5 < E < 0.75 > AMBER (AMEND)
 0.5 < E < 0.643 [P ≥ 0.05] > AMBERR (AMENDmajor)
 0.643 < E < 0.75 [P < 0.05] > AMBERG (AMENDminor)
 E ≥ 0.75 [P < 0.05] > GREEN (GO)
(with the following signals for ninterventionarm=35: 0–17 (RED); 18–22 (AMBERmajor);
23–26 (AMBERminor); 2735 (GREEN))
(iii) Follow up [nrandomised=70 (intervention and control arms)]
 E ≤ 0.65 [P ≥ 0.05] > RED (STOP)
 0.65 < E < 0.85 > AMBER (AMEND)
 0.65 < E < 0.742 [P ≥ 0.05] > AMBERR (AMENDmajor)
 0.742 < E < 0.85 [P < 0.05] > AMBERG (AMENDminor)
 E ≥ 0.85 [P < 0.05] > GREEN (GO)
(with the following signals for nrandomised=70: 0–45 (RED); 46–51 (AMBERmajor);
52–59 (AMBERminor); 6070 (GREEN))
In accordance with the multicriteria aim, the decision to proceed would be based on the worstsignal (as in Box 1)