Panelists
We invited 161 panelists to participate in round 1 and 2, of which 45 people completed round 1 (and three people partially completed this round), and 41 people completed round 2 (another five partially). Round 3 was only sent to the 58 panelists who at least opened the link to any of the previous rounds (52 of these people participated in round 1 or round 2); 39 panelists completed the third round. Thirty-six panelists completed all three rounds, 10 completed two rounds, and 12 panelists completed one round. See Table 1 for descriptive information on the panelists who participated.
Preparatory input to the Delphi study
In the COSMIN database of systematic reviews of outcome measurement instruments 174 reviews were found that included ClinROMs, PerFOMs or laboratory values. Of these, 103 reviews described standards on any measurement property to assess the quality of the study design or the statistical methods used, of which 30 of these reviews provided standards specifically for reliability and measurement error studies (see Appendix 2). Through references in these reviews, we found three methodological papers describing a relevant checklist or guideline [17-19], one extraction form [20], and one additional systematic review [21]. All standards from this literature were used as input in our Delphi study. Important themes included in these standards were the standardization of the application of the instrument (e.g. instructions about specific equipment and settings that should be used, the environment, the professionals involved (e.g. training), etc.), independency and blinding, stability of patients, time interval, and statistical methods. We tested these items on three published studies in which generalizability theory was applied [22-24], and subsequently we realized that the research questions in these papers were not specific enough to assess whether the chosen design and statistical models were appropriate. Based on these experiences we felt the need to disentangle steps in the process of assessing the quality of a study on reliability or measurement error, into (1) understanding how exactly results of a study informed us about the quality of an instrument, and (2) assessing the quality of the study. As a basic foundation to elaborate these two steps, we decided to first identify all general components (i.e. potential sources of variation) of a measurement instrument.
Components of outcome measurement instruments
In round 1 we started with a list of components of outcome measurement instruments that can be considered potential sources of variation that can influence the score on the measurement instrument. Based on panelists’ suggestions and comments, in round 2 the steering committee decided to propose two sets of components, one for outcome measurement instruments that involve biological sampling (i.e. blood or urine tests, tissue biopsy), and one for those that do not (i.e. ClinROMs and PerFOMs). This was proposed because the words ‘data’ and ‘score’ that we proposed to use for specific components were not considered appropriate for laboratory values. We reached consensus to use the words ‘biological sample’ and ‘value’, respectively, for outcome measurement instruments that involve biological sampling. Except for one, we reached consensus on all terms for the components and their elaboration (see Tables 2 and 3 and Appendix 3 and 4 for an elaboration). For the remaining issue, the steering committee decided to use the term ‘determination of the value of the biological sample’ over its alternative ‘actual measurement of the value of the biological sample’.
Elements of a comprehensive research question
In order to understand how exactly the result of a reliability study informs us on the quality of the measurement instrument under study, in round 1 we agreed on 7 elements that can be disentangled from the described design of the study and together form a comprehensive research question (Table 4). In round 2 we proposed an alternative wording for element 4 (see Appendix 5). As a result, agreement on this element increased from 70% to 86%.
Standards on design requirements of studies on reliability and measurement error
To assess reliability or measurement error of an outcome measurement instrument repeated measurements in stable patients are required. The design of a study assessing any of the two measurement properties is the same, i.e. the same data can be used for estimating reliability and measurement error. Only different statistical parameters are applied to the same data to express both measurement properties.
In round 2 we reached consensus on five standards on design requirements, referring to stable patients, appropriate time interval, similar measurement conditions, and independent measurements and scoring (Table 5). Alternative wordings for the standards 4 and 5 increased consensus for these standards from 73% and 78%, respectively, to 92% in round 3.
Standards on preferred statistical methods of studies on reliability
We reached consensus on three standards (Table 6) on preferred statistical methods to assess reliability of outcome measures that have continuous, ordinal and dichotomous/nominal scores, respectively, and how these standards should be rated. Preferred statistical methods are ICCs and (weighted) Kappa. Based on suggestions by the panelists, we asked in round 3 whether we should add that when the data was non-normally distributed standard 7 for continuous scores should be rated as inadequate – for which we did not reach consensus (i.e. 54%) and this proposal was therefore not included in the standard. The most important issue was the relatively low degree of consensus on the kappa statistic as a preferred statistical methods to assess reliability of ordinal scores (standard 8 for reliability): 67% agreed in round 2 that weighted kappa was the preferred statistical method to assess reliability for ordinal scores (standards 8 for reliability), and 56% agreed in round 2 that kappa was the preferred statistical methods to assess reliability for dichotomous/nominal scores (standard 9). Issues raised included the difficulty in interpreting a kappa value, and the dependence on the prevalence of a specific outcome (i.e. the heterogeneity of the sample). Panelists recommended reporting the marginals, as well as the percentage specific agreement. However, specific agreement is considered to be a parameter of measurement error (agreement), and therefore cannot be proposed as a preferred statistical method to assess reliability. In round 3 we again proposed the (weighted) kappa as the preferred statistical method to assess reliability of ordinal scores, while acknowledging that reliability is less informative than measurement error (standard 8 for reliability), and for dichotomous/nominal scores kappa was proposed calculated for each category against the other categories combined (standard 9 for reliability). The percentage consensus for standards 8 and 9 for reliability increased up to 73 and 71%, respectively.
We did not reach consensus on what is considered an adequate method to assess reliability of ordinal scores (standard 8 for reliability). In round 2 60% of the panelists agreed or strongly agreed to the proposal to rate the standard as ‘adequate’ when in a study ‘the weighted kappa was calculated, but the weighting scheme was not described’. In round 3 we proposed to rate the standard as ‘adequate’ when ‘the kappa is calculated, but weighting scheme is not described or does not optimally match the reviewer constructed research question’. This proposal was in line with the proposal for the preferred statistical method to assess reliability of continuous scores. Only 54% agreed or strongly agreed to this proposal. In round 2 62% consensus was reached on the proposal to rate a study using the unweighted kappa statistic for ordinal scores as ‘doubtful’, while in round 3 only 49% (strongly) agreed to rate a study as ‘adequate’ when the unweighted kappa statistic was used. Panelists argued that the weighted kappa is mathematically the same as the ICC. After round 3, we further discussed this issue within the steering committee, and decided to keep it as suggested in round 3 (Table 7) to be in line with the standard for continuous scores.
Standards on preferred statistical methods of studies on measurement error
We reached consensus on two standards on preferred statistical methods to assess measurement error (Table 7). For continuous scores (standard 7 for measurement error) we reached consensus that the Standard Error of Measurement (SEM), Smallest Detectable Change (SDC), Limits of Agreement (LoA) or Coefficient of Variation (CV) were the preferred statistical methods, and for ordinal/dichotomous/nominal scores (standard 8 for measurement error) the percentage specific (e.g. positive and negative) agreement was preferred (see Table 8). In round 3 we agreed on an alternative wording for the responses of the four-point rating system of the standard for continuous scores, to be in line with the proposed wording for the standard on reliability for continuous scores.
Sometimes Cronbach’s alpha instead of the ICC is used to calculate the measurement error with the formula , where σy represents the standard deviation (SD) of the sample [25]. The panelists agreed this method is inadequate, because it is based on one full-scale measurement where items are considered as the repeated measurements, instead of at least two full-scale measurements using the total score in the calculation of the SEM. Moreover, Cronbach’s alpha is sometimes used inadequately, because it is calculated for a scale that is not unidimensional, or based on a formative model. In such cases the Cronbach’s alpha cannot be interpreted. Some panelists argued that this method of SEM calculation was better than nothing. With the explanation that a rating of ‘inadequate’ means that the SEM resulting from such a study can still be used, but the results are less trustworthy, 72% agreed to rate ‘a SEM calculated based on Cronbach’s alpha, or using SD from another population’ as ‘inadequate’.
In round 2 we reached 53% consensus to consider the Coefficient of Variation (CV) as the preferred statistical method to assess measurement error for scales with proportion or percentage scores. Several panelists pointed out that the CV is also frequently used for continuous scores, specifically for laboratory values. Therefore, we proposed that the CV is also an appropriate statistical method for continuous scores on measurement error (add it to standard 7 for measurement error), and reached 73 % consensus.
Term for ‘research question’
One final issue remained without consensus. In general, the statistical methods should match the research question and study design. We proposed to state that the statistical methods should match the ‘retrospectively formulated research question’ (round 2) or the ‘reviewer constructed research question’ (round 3). However, some panelists considered the term ‘retrospectively’ unclear and inappropriate as it could be interpreted that the research question was defined afterwards (while we meant that it was comprehensively formulated afterwards). The term ‘reviewer constructed research question’ was also considered unclear, as it was not very clear to whom ‘reviewer’ referred to (i.e. the one who is using the Risk of Bias tool and reviews a study). The steering committee finally decided to use the term ‘study design’ instead, and to state in the standards that the statistical methods should match the ‘study design’.