Delphi procedures are used in the health sciences with the primary goal of finding consensus [1–3]. The aim is “to obtain the most reliable consensus of opinion of a group of experts” [4]. The concept of consensus is often understood to be the majority of the participants agreeing on a standardized item [5]. In healthcare, consensus is most frequently measured using percentage agreement [6, 7]. Zarnowitz & Lambros [8] define consensus as “the degree of agreement among point predictions aimed at the same target by different individuals” [8]. To reach consensus, experts participating in a Delphi procedure evaluate concrete epistemic issues over multiple rounds [3, 9]. A crucial difference from one-time surveys is that the expert panel is not randomly selected and there is no claim of statistical representativity in the results [10].
In a classic Delphi procedure, the experts' judgments are typically collected anonymously using (online) questionnaires [11]. However, researchers will sometimes modify the classic Delphi so that the survey process fits the goals or the resources available for the project. For instance, this can involve having participants meet face-to-face or limiting the number of rounds from the start [12–14]. In a systematic review of Delphi studies that identify healthcare quality indicators, Boulkedid et al. [13] found that more than half of the Delphi studies reported following a modified Delphi process. Yet, these modifications are not always described or justified [15, 16].
Now, alongside what often appear to outsiders to be nontransparent modifications are Delphi variants whose approaches are clearly articulated and justified. Among them are the policy Delphi [17] and real-time Delphi [18]. Some of the reasons for these developments are to better record the general context behind standardized judgments and to enable anonymous debates of the arguments in real time [17, 18]. However, only in individual cases have these different variants been reflected on or evaluated in terms of their methodology. Moreover, how modifications affect the overall results of Delph procedures is still to the largest extent unclear. In one example, no significant difference was found in the overall results after comparing a real-time Delphi and a classic Delphi, both from futures research [19]. In our view, despite differences in study design, the following characteristics constitute a Delphi study [3, 4, 9]:
- Survey of several people with specialized knowledge (known as experts) (e.g., operational knowledge, experiential knowledge, functional knowledge, contextual knowledge);
- Carrying out at least two survey rounds or the option to respond at least two times;
- Feedback, the (interim) results are presented to the respondents with a possibility to respond.
Typically, Delphi procedures also share a focus on complex topics and questions, the answers to which require a certain amount expertise and experience [9]. The theoretical assumption underlying the Delphi method is that multiple experts must be asked in order to cover different perspectives [20, 21]. To gather valid and practicable results, those conducting the Delphi must understand the response behavior of the experts, especially when the aim of the Delphi is agreement [22, 23] because a consensus procedure has direct practical relevance, for instance, when consensus is sought to make concrete recommendations for routine clinical practice (e.g., [24]) or for medical curricula (e.g., [25]). Furthermore, Delphi studies are also very widespread in the field of health [26]. For these reasons, we want to shed light from a methods perspective on how consensus is found in Delphi studies in the health sciences. However, first we explain how judgments are formed in Delphi procedures and which factors influence them.
1.1 Theoretical insights on judgment formation in consensus Delphi studies
In standardized surveys, response behavior is described as an ideal type of process consisting of four steps: I. understanding the question, II. retrieving information, III. evaluating the question, and IV. submitting the response. Making the cognitive effort in all steps is described by Krosnick [27] as optimizing. The intensity of the effort at each step of the response process model depends, among other things, on the motivation and ability of the respondent and the difficulty of the task [27]. Depending on what influences these factors exert, the response strategy of satisficing can come into play [27]. When satisficing, the respondent engages more superficially with the steps of the response process or skips over individual steps entirely to give an answer to the question asked [27, 28].
From a cognitive psychology perspective, judgment formation in consensus Delphi procedures can be augmented by the idea of mental models [21, 29]. With Delphi studies it must be assumed that participants go through the steps of the response process in a state of cognitive uncertainty because, typically, uncertain and incomplete knowledge exists regarding the topics, which sometimes extend beyond the experts' main area of expertise [20, 21]. Specialized knowledge on the part of the respondents is required to comprehend, contextualize (step I, Fig. 1) and evaluate questions [3, 21]. When forming judgments (steps II and III, Fig. 1), experts are often required to place the question in a larger context and generate transformation knowledge beyond the scope of their specialty [21, 30], e.g., in regard to the consequences of the judgment for affected groups [31], future generations [32] or other specialized areas in an organization [33]. Furthermore, Delphi procedures sometimes integrate additional information which the respondents should consider when forming their judgments, e.g., a summary of the current state of research [24]. Ideally, all of the information is taken into consideration by the respondents when forming judgments (see step II, Fig. 1).
From the second round onward, feedback plays another central role in the process of forming judgments [4, 11]. The difference from the first Delphi round is that the experts have already formed a mental model and they receive the feedback as additional information (step II, Fig. 1), which can consist of arguments put forth by other experts, statistical data regarding the group response, the expert's response from the previous round, or a combination of these [7]. The feedback is meant to encourage the experts to include previously unconsidered aspects in their mental models in order to give a well-founded and carefully thought-out judgment according to the optimizing strategy [30, 35].
The individual process of coming to a judgment in a Delphi procedure is complex and can only function "optimally" under certain conditions (see "Judgement requirements," Fig. 1), and, to be specific, if the respondents [29, 36]:
- possess extensive knowledge of the topic,
- are familiar with the topic and have experience with it, meaning they regularly engage with the topic under investigation (usually for professional reasons, but also because they are personally affected),
- have certain cognitive abilities and the motivation to specify, structure and evaluate information.
In Delphi studies judgments can also be influenced by other factors, such as the stated aim of finding consensus, thus making optimizing more difficult [37]. There is still little reflection on the theoretical level about how response behavior in Delphi studies takes shape in practice, even though this seems to be highly relevant. If there is little success in getting the experts to optimize their response process, the results will be less precise and less reliable [28]. Satisficing [27, 28] in Delphi procedures could take these forms:
- Experts avoid a clear judgment on a specific position, e.g., in that they tend toward the middle of the spectrum or consciously take a decision that differs from the majority.
- Experts respond arbitrarily, e.g., in that they select the first answer on offer.
- Experts do not form a judgment, e.g., in that they leave out questions, choose an evasive category (e.g., "don't know") or discontinue the survey process.
- Experts form deliberate judgments but, consciously or unconsciously, do not consider all of the information equally, e.g., in that they only include the first response options or arguments in the qualitative feedback when forming their opinion.
- Experts respond such that the Delphi will be terminated, e.g., in that they more or less agree with the majority opinion as presented in the feedback in order to support a statistical consensus.
Contrary to the process outlined in the ideal model (Fig. 1), evidence shows that judgment formation is subject to suboptimal conditions that can make optimizing more difficult [28]. Hence, respondents' individual personal characteristics, the situation or the questionnaire's content and visual presentation have effects on response behavior and thus on the overall results. In the following we present an overview of methods studies that shed light on these aspects and show the effects on individual judgments.
1.2 Methodological findings on judgment formation in Delphi studies
Different methodological findings exist in regard to how respondents form their judgments in Delphi procedures, namely:
- Systematic reviews based on publications of Delphi studies [1, 2, 6, 7, 14, 16, 38]
- Method experiments [23, 33, 37, 39–44]
- Evaluation studies [22, 45–47]
- Reports by Delphi practitioners [11, 48, 49]
According to these findings, three factors have a direct or indirect influence on the overall result of a Delphi study: the expert panel, the questionnaire design, and the process and feedback design (Fig. 2). How these three factors exert influence on individual response behavior and the overall result of a Delphi study is explained in the following.
1. The expert panel is a central feature of the Delphi method. Empirical evidence demonstrates that five aspects affect the individual judgments of the respondents:
- The subjective perception of the baseline (e.g., estimation of the topic's relevance, majority opinions in the field) [41, 42, 47]
- The actual professional knowledge and experience (e.g., knowledge about current studies, position in an organization, lifeworld experiences) [22, 31, 50–52]
- The intention to participate in the Delphi study (e.g., personal and/or institutional interests and objectives) [22, 47]
- Personal characteristics (e.g., value systems) and sociodemographic profile (e.g., age) [22, 31, 43, 51]
- The assessment of the Delphi study (e.g., relevance or clarity of the study's aims and the Delphi procedure) [45, 46]
Although it must be assumed that, along with expertise, these diversity variables have an effect, they are generally not considered in the selection of the experts for Delphi studies [51]. Selection is typically done on the basis of professional expertise, e.g., through professional associations [16, 24]. Furthermore, the nature and size of the expert panel composition is relevant to the process of finding consensus. To ensure sufficient professional heterogeneity, it is recommended that experts be recruited through purposive sampling and not snowballing [10, 11]. How large a Delphi panel should be is not determined in terms of method, usually the number is in the low double-digits [7]. Statistical models demonstrate that groups of this size can deliver stable final results, which depend, of course, on the topic and composition of the expert panel [53, 54]. When different expert groups, e.g., different disciplines, are included and the numbers between them are unequal, biases can emerge in favor of the judgements from the expert group with the most members. Beiderbeck et al. [55] therefore advise performing subgroup analyses if there are 15 to 20 members per expert group.
Generally speaking, heterogeneous panels reach consensus overall for fewer aspects of a question than homogeneous panels [33, 52]. Still, the heterogeneity of the expert panel regarding professional knowledge counts as quality enhancing for Delphi studies, despite the partially unclear influences on the overall study results [14, 16, 48, 49]. Influences arising from the individual personalities of the respondents carry less weight as a result [48].
2. With questionnaire design, we mean how questions in Delphi studies are a) presented visually and in terms of content and sequence and b) designed methodologically, e.g., what types of questions (open/closed) and scales. Bias in the filling out of a questionnaire, as is seen with every standardized survey, can also be avoided in Delphi procedures by adhering to the recommendations for designing standardized questionnaires [56].
a) Potential biases due to how questions and responses are formulated and presented can be mitigated in Delphi procedures [48]. The background here appears to be that experts (including their mental models) analyze the questions more cognitively than citizens do with questions in survey polls [57].
Nevertheless, a questionnaire's complexity plays an important role in Delphi procedures [58]. An item's length should not exceed 25 words according to a recommendation from futures research [59]. Markmann et al. [23] found that longer and more abstract statements make the formation of individual judgments in Delphi studies more difficult and lead to more moderate judgments.
Brookes et al. [60] investigated the effect of biases on judgments in Delphi studies arising from the order in which the questions are asked by presenting topic blocks to the participants in different sequences. By doing so, they were able to determine effects on the judgments of patients and healthcare professionals which were equally relevant to the overall result but could have different effects on it. Brookes et al. [60] and also Hallowell & Gambatese [61] therefore recommend randomizing the questions in Delphi studies, which has been reported by several studies (e.g., [62, 63]).
b) The relevance of open questions varies in regard to Delphi procedures. Standardized items dominate mostly, and open comments are used only in individual cases or in an initial qualitative round [10, 64]. However, there are also Delphi studies that focus on the interchange of arguments from open comments [65]. The aim of free-text responses can be to supplement or specify details or to justify or appraise the judgments [64]. The problem, though, is that the handling of the Delphi procedure and its analysis is often not undertaken systematically [64]. In these cases it is questionable if the increased cognitive effort required can be justified to the respondents [66].
The study findings are unclear on scale range and the design of rating scales in Delphi studies. Different reviews show that rating scales are typically used to measure consensus in Delphi procedures and often have five or more graduations [6, 7, 16, 38]. Based on the results of their review of Delphi studies in histopathology, Taze et al. [16] recommend the use of a "nine-point Likert scale with a 'no opinion' option and a free-text comment box" [16]. Initial analyses indicate that scales with different lengths lead to a different end result [67, 68]. In a comparison of three scale lengths (3-point, 5-point, and 9-point rating scales), Lange et al. [67] determined that the 5-point rating scale with a cut-off value of 75% achieved the least consensus and the 9-point scale the most. It must be noted that, while the scales had the same defined cut-off value, different numbers of scale points were included in the definition of consensus [67]. Meyer et al. [68] found higher consensus with a longer scale, whereby consensus was also defined here using one (3-point scale) or more (9-point scale) scale points. Both studies concluded that recommendations for direct action in clinical practice can be derived with a 3-point scale (e.g., “main goal,” “secondary goal” and “no goal”) and the result is simpler to interpret than with longer scales [67, 68].
3. Another factor influencing individual judgments and the overall result is process and feedback design. This influence is seen on three levels: a) the communicated consensus, b) the aggregated feedback and c) the individual feedback on a participant's response from the previous Delphi round.
a) Barrios et al. [37] differentially analyzed the influence of the level of agreement on the individual judgments and observed that if the consensus was over 75%, participants were more likely to converge with the opinion of the group than if the group's aggregated consensus was below this value [37]. Signs of a conscious judgment against the consensus were observed by Barrios et al. [37] when the value in the feedback lay below the consensus level of the percent agreement. They speculate that experts consciously manipulate results as a result of revealing the consensus [37]. Although this influence is theoretically probable, we are not aware of other publications on the effect of disclosing the level consensus or dissent on individual judgments, e.g., in the communication of the Delphi study's aims or as part of the feedback.
b) Feedback involving the statistical group response dominates in Delphi studies, while peer feedback is less frequently given [7]. The extent to which the form and type of feedback (qualitative or quantitative) affects judgment behavior is disputed [12, 41, 48]. Some Delphi practitioners support the use of qualitative instead of quantitative feedback because then the experts do not prematurely side with the majority opinion (bandwagon effect) [48, 49]. However, this assumes that the open responses are not presented in an unfiltered form, but rather systematically analyzed, which, as already described, is not always the case [48, 64]. In addition, the effects of differentiating the feedback according to expert groups have also been shown already [39]. Brookes et al. [39] have demonstrated that, if the feedback contains information on different groups of participants, the level of agreement between the expert groups increases compared to peer feedback. MacLennan et al. [69] have also carried out a randomized Delphi study with different feedback strategies but were unable to confirm the effects observed by Brookes et al. [39]. Fish et al. [43] assert the hypothesis that in comparison to healthcare professionals, patients less often integrate the feedback of other expert groups and hence do not reflect as much on judgments made from other perspectives [43]. Turnbull et al. [40] show similar findings.
c) A randomized experimental study on urban sustainability by Meijering & Tobi [44] demonstrated that experts less often adjust their judgment when they see their response from the previous round; however, an effect on the final consensus could not be determined [44].
We are not aware of analyses of other factors affecting the process and feedback design, e.g., how the termination criterion or the numbers of rounds influence individual judgments and the overall result. Having said this, though, the termination criterion and the number of rounds are relevant in order to ascertain whether the consensus is stable and valid [1, 2].
1.3 Aims of the systematic review
The methodical tests, experiments and discussions presented here concerning Delphi procedures in the health sciences ultimately identify three factors which can be alleged to exert an influence on the results of a Delphi procedure and thus on the consensus: the expert panel, questionnaire design, and process and feedback design (Fig. 2). These three factors serve as the basis for the present systematic review. Finding consensus is the most common aim of Delphi studies in healthcare [1–3].
The following research questions are answered in this systematic review:
- How are the influential factors described here used in the practice of consensus Delphi studies in the health sciences?
- Which conclusions can be drawn for the conduction of quality-assured Delphi procedures?
In addition to these three proven factors, there are indications of other factors that influence individual judgments and the overall results of Delphi studies, e.g., the effect of the time between rounds [43], though such factors have not yet been fully examined explicitly for Delphi studies, e.g., the effect of sponsors or members of a supervisory group on the overall result. In general, a decrease can be seen in publications that examine Delphi studies in terms of method compared to Delphi primary studies [26]. Flostrand et al. [26] showed that the ratio of methods studies to Delphi primary studies was 1:1 in 1975 and 1:19 in 2016. Other factors will not be considered in this systematic review due to a lack of evidence. In addition to the three proven factors, we identify general criteria for describing Delphi studies, including the Delphi variants and the definition of consensus. It must be noted that this review of research practice is based on publications of Delphi studies even though a lack of clarity and sometimes even errors have been repeatedly shown to exist in such publications [5, 7, 14].