Method.
The Delphi method is a pragmatic process of community-engaged inquiry used to gain insight into a complex problem.30 This method systematically collects the opinions and lived experiences of content experts to create a collective judgement on the topic in question, often for the sake of informing policy and practice.30,31 Experts include both researchers and practitioners. The aim is to reach group consensus on the problem, accomplished through iterative rounds of inquiry and confidential feedback by the respondents.32 As a participatory action research technique, the Delphi method involves participants at every stage of the process: survey design, interaction with others, interpretation of results, and application of results.33,34 It must also serve the needs of not just the researcher but the participants as well, where participants generate solutions and additional action items.33 This includes privileging participant opinions to design the study, translate the results, and inform policy.34 Supplement A provides more methodological details.
Recruitment Procedures.
Recruitment took place in August 2018 and data were collected from August to December 2018. Determination was made a priori to hold three rounds of primary data collection: individual semi-structured interviews (Round 1), and two structured online surveys (Rounds 2 and 3). Consistent with Delphi practices, the procedure was iterative with each round informed by the previous. The first three rounds were conducted independently to ensure anonymity; only the researcher knew the identities of panelists. The study culminated in a debrief with all participants. Participants were invited to help co-design and/or pilot each round of the study; no one elected to do so. However, procedural suggestions made by participants during the Delphi study process did inform each round.
A maximum of 15 participants were sought for recruitment based on the suggestion that 8 to 15 panelists is ideal for a Delphi study.35 Because all participants must move through the process collectively, rolling recruitment is not possible. To leave room for attrition, it was determined that a minimum of 10 participants would be necessary to begin the study. Using purposeful sampling,36 individuals were invited to the study based on study staff knowledge that they would meet experiential criteria (described below); additional participants were recommended by the initial pool of potential participants. Expertise worthy of inclusion on the panel was determined by peer nomination (i.e., researchers and practitioners identified by the author nominated other researchers and practitioners as potential participants), screening questions for self-reported experience, and additional inquiry on details of experience during the Round 1 interview.
Three narrow criteria were used to nominate participants, with the expectation that each participant must meet at least one: Criterion A) Staff (e.g. physicians, nurses, behavioral health specialists, administrators) currently or previously (for at least one year) at a clinic/practice that has integrated behavioral health and primary care where previously they did not have an integrated system; Criterion B) Consultants or technical assistance providers for clinics integrating care, who had provided this type of support to clinics for at least one year; Criterion C) Researchers who have studied both implementation science and integrated care for at least one year. These criteria were selected to capture the three different systemic levels of influence for implementation success within the framework that provided the conceptual basis upon which the R = MC2 determinants were built;22 in the Interactive Systems Framework (ISF)37 this represents actors within the delivery system (Criterion A), support system (Criterion B), and synthesis and translation system (Criterion C). Multiple criteria ensured a diversity of experience on the study panel, consistent with Delphi method best practices for dependability.
Data Collection Procedures.
Qualitative and quantitative data were simultaneously collected throughout the procedure. There were two aspects of data collection and analysis: primary data collection with iterative analysis (Delphi study process) and secondary data collection (post-debrief process evaluation survey) with post-study analysis. Interim analytic processes variably prioritized qualitative then quantitative data by round, with both given equal weight for determining final study results (QUAL + QUAN38).
The first round was conducted as a one-on-one phone interview. This interview obtained background information on the participant and confirmed their eligibility for inclusion, gathered initial opinions on the determinant framework’s relevance for integrated care, and weighed in on when during the process of implementing integrated care the determinants seem most relevant for success. Participants were sent an information sheet one week in advance of the interview. The sheet contained information about the study, the R = MC2 determinant framework’s components and subcomponents, and information on two existing stage frameworks (one derived from implementation science, one derived from the integrated care literature). Qualitative data from Round 1 provided the basis for Round 2. Based on data from Round 1, the implementation science stages were deemed more appropriate for the study. The surveys for Rounds 2 and 3 were structured into three parts by the determinant framework’s components, with specific questions for each subcomponent. Within each item, the subcomponent was defined in terms of integrating behavioral health and primary care. Then a selection of panelists’ anonymized comments from the previous round were presented. Questions for each item included asking how important each subcomponent is for integrated care during each implementation stage (Exploration, Installation, Initial Implementation, and Full Implementation), with responses on a 7-point Likert scale from “totally unimportant” to “very important.” A comment box encouraged describing the rationale for their choice. The study culminated in a debrief, which served several purposes. First, the participants were provided with a preliminary results document; this transparency aids confirmability of the study’s rigor.39 The document was created using the quantified consensus results from Round 3 in two tables, along with an overview of the study purpose, process, research questions, and potential implications. Second, the debrief provided panelists a chance to discuss openly their reactions to the results and the process and generate potential implications. Third, the debrief allowed panelists to introduce themselves to each other and openly debate their opinions on the topic. Their identities, experience level, and professional affiliations had previously been kept confidential. Fourth, as part of the reflection panelists could provide suggestions on next steps for the data. Interviews were recorded and transcribed. As a source of secondary data, panelists were invited to complete a process evaluation survey after the debrief. This was included as an additional form of member checks to ensure trustworthiness of the results.40
Analyses.
There were three components of data analysis: process analysis, initial data analysis, and exploratory data analysis.
In initial analyses, the qualitative data were deductively analyzed using the Framework Method as a guide for coding and interpretation.41 After familiarization with recordings and transcripts, first cycle coding42 was conducted via NVIVO with a priori deductive codes. For second cycle coding,42 data sections on validity and implications were organized into framework matrices by participant (rows) and first cycle codes (columns). Pattern coding42 was employed to interpret themes in validity-related perceptions, study implications, and additional analyses suggested by participants. Pattern coding is a second-cycle coding process and is distinct from pattern matching (a qualitative method of deriving causal relationships). Additionally, to assess validity, descriptive statistics from the process evaluation surveys were converged with qualitative data.
Initial analyses identified concern from participants that preliminary results were quantified only by items being rated on the highest end of the Likert scale (“important” or “very important”). Participants suggested addressing this by considering the results by the distribution of scores rather than solely by percent agreement to capture whether ratings were also rated as “unimportant.” This suggestion led to the exploratory analysis.
Exploratory analysis considered results by distribution of scores, rather than solely by consensus percentages. Broadening the data interpretation shows strength of agreement in a different way. This indicates if respondents perceived determinants as wholly unimportant, or just less important compared to other determinants by stage. Percent agreement is the most commonly applied metric of consensus in Delphi studies,43 which is why it was chosen for the preliminary analyses in Phase I. However, based on the participant suggestion of considering results in different ways, the Delphi literature was revisited. Agreement and internal reliability are valid metrics, in addition to consensus, for assessing Delphi results.35 In Delphi studies, descriptive statistics and graphical displays of data are appropriate for assessing internal validity.35 However, levels of dispersion (e.g., inter-quartile range, standard deviation) are not appropriate for ordinal Likert scale data32,35,44 thus cannot be applied here. Variability (e.g., minimum, maximum, range), central tendency (e.g., median, mode), and frequency distributions (e.g., percentages) are alternative options for analyzing ordinal data. Therefore, final study results are gleaned from a matrix display of percent agreement on Likert scale ratings, range of ratings, and a frequency distribution (i.e., histogram).
Descriptive statistics are not generally subject to systematic interpretive methods. However, to prevent bias and inconsistencies in translating the quantitative results into practice implications, the study employed an iterative coding process to make inferences from the matrix display.42 Two coders first collaboratively determined the method for extracting relevancy (Table 1) from the descriptive statistics and visual display, then independently categorized each determinant, and finally reviewed their category labels and resolved discrepancies. Relevancy was defined as: “how important this determinant seems at this stage.” Using fuzzy set theory (specifically type 2 fuzzy sets45), five relevancy categories were defined (Table 1): highly relevant, relevant, less relevant, irrelevant, variable relevancy/more information needed. Coding benchmarks were constructed for each relevancy category. To retain the integrity of participants’ perceptions rather than the coders’ perceptions, both coders were blinded to the items within the matrix and applied the relevancy category labels based solely on the coding benchmark and definition. After labels were applied for all items, category labels were compared and resolved via discussion to ensure consistent interpretation across items.
Table 1
Interpretative coding definitions
Category | Benchmark | Definition |
Highly relevant | Clear consensus (75–100% of participants rated 6–7 on Likert scale); narrow range; high median; left-skewed distribution. | Participants indicated that this determinant is very important for implementation success at this stage. |
Relevant | More consensus (50–74% of participants rated 6–7 on Likert scale); narrow range; high median; distribution somewhat left-skewed. | Participants indicated that this determinant is important for implementation success at this stage. |
Less relevant | Less consensus (0–49% of participants rated 6–7 on Likert scale); and/or medium-wide range; and/or distribution somewhat normal. | Participants indicated that this determinant is less important for implementation success at this stage. |
Irrelevant | Right-skewed distribution; and/or median value rated 1, 2, or 3 on a 7-pt Likert scale; narrow range | Participants indicated that this determinant is not important for implementation success at this stage. |
Variable relevancy, more information needed | Unclear due to wide range, abnormal distribution, or high discrepancies between consensus percentage and other benchmarks. | Participants did not clearly indicate whether this determinant is important for implementation success at this stage. |