Twenty individuals responded to the email invitations. Eighteen interviews were conducted and varied in length from approximately 30 minutes to approximately 80. The remaining two respondents were deemed ineligible due to lack of firsthand guideline development experience. Five participants were male, and 13 were female. Half of the participants had between five and ten years of experience in evidence synthesis; five participants had between ten and 20 years of experience; two participants had more than 20 years of experience, and two had less than 5 years of experience. Ten participants’ current primary affiliation was with a university, seven with a government body, and one in the private sector. Eleven participants were based in Australia, with the remaining seven based either in the United States or in the United Kingdom. In addition to NICE and NHMRC, organizations represented included the Agency for Healthcare Research and Quality (AHRQ, United States), the Joanna Briggs Institute (JBI, Australia), the World Health Organization (WHO), and private consultancies. No participants withdrew from the study, and no repeat interviews were required.
Interview transcripts demonstrated high consistency in distribution within the Diffusion of Innovations thematic framework. Following initial coding (Stage 1), Compatibility was the most prominently discussed theme across all participants. Relative Advantage and Observability were also given substantial attention from participants’ discussions, though to a lesser extent than Compatibility. Trialability and Complexity were the least discussed themes among all participants.
All participants discussed their values as guideline developers at length, both within the context of potential use of automation and independent from it. Emphasis on the Compatibility theme was consistently far stronger than any of the other four themes in the deductive framework being applied in this analysis. Some examples of values were a “rigorous” approach to evaluation of evidence and careful construction of questions. Relating to automation more specifically, participants highlighted a need for human and organizational involvement.
“how you synthesize it, how you pull it together is kind of key” Participant 3
“I think it would be a shame if humans weren’t involved in [synthesis].” Participant 9
Two sub-themes were identified within the Compatibility theme which further detailed participants’ desire to match new practices with the values which underpin current practices: ability to double check, and transparency as accountability.
Ability to double check
Most participants indicated the importance of the ability to double check the output of automation with human researcher input. These discussions often cited the rationale that current practices usually involve a human double checking the work of another human and posited that newer workflows should therefore maintain this pattern.
Some, but not all, participants indicated that reproducibility was the underlying reason for the double-checking status quo. It is possible that views of participants would be different should rigorous research alter overall perceptions of the reproducibility of automated screening and extraction; this is further discussed as a contextual factor in subsequent sections.
“I can see it could be done. But surely it would need to be checked by someone anyway. Because even if it’s done by a human with vast experience, it’s always important to have a second person to check it.” Participant 5
“At the minute the standard is for two operators. So you’d want it to have been checked by a second method, if not person. So that would be my only thing – reproducibility.” Participant 7
Transparency as accountability
Several participants wanted to ensure that any automation methods used in synthesizing evidence were freely accessible and transparent to examination. Many emphasized that they are accountable to stakeholders who need to be sure they have not missed any information, and therefore require the ability to freely examine methods used, including any automation.
Trustworthiness of evidence in general is integral to the professional culture of guideline development and was emphasized by the participants. Trustworthiness and methods to verify it therefore extend to new tools that use automation, in the view of participants, in the form of transparency and validation.
“A group of experts can apply judgement to that body of evidence, and needs to know they can trust the evidence that you’d found.” Participant 12
“The key part of working with a face to face committee ... Is you have they have to have total confidence in what the technical team has done” Participant 16
When discussing Relative Advantage, participants focused on the freeing up of human resources, and to a lesser extent on time and cost saving. When prompted to discuss ML directly (in contrast to general views of evidence synthesis and guideline development approaches), participants tended to more frequently discuss ideas relating to the Relative Advantage of automation. Participants were interested in freeing time and money, but contingent upon the automation perfectly matching perceived human quality.
Freeing up human resources
The primary advantage specified in discussion with participants was the potential to free up human resources for rededication to additional tasks within the health evidence ecosystem.
“In research time is always limited and you know there’s never enough grant money to help employ staff … by having a machine do it, it would be cost-effective, and spare the researchers’ time to do other research-related tasks.” Participant 17
Time and cost saving
Some participants also identified that automation might potentially save time and/or save money. Strikingly, no participants indicated an openness to any trade-off between accuracy and time.
“No matter how quickly a guideline’s done, everybody always wants it faster and to be of high quality. So anything that can improve on that would be welcome, I think.” Participant 11
Participants communicated that they would like to see evidence prior to implementing new practices, as well as a sustained ability to cross-examine the behavior of the technologies.
Need for evidence
The need for rigorously produced, disseminated, and easily accessed evidence was clear in the data. Several participants expressed an openness to automation being integrated into evidence synthesis, on the condition that accuracy has been demonstrated.
“I think at the moment it has a potentially high level of risk of being incorrect. But I don’t really know enough about it. I’d need to be convinced about it I think to consider it.” Participant 9
“If the whole process were done by some machine or machine learning application, I think it would need to be properly trialed.” Participant 5
“As long as there was clear data to support that ... machine-learning is a reliable method, but you know, better than or equal to humans doing it.” Participant 17
One notable outlier indicated they were already convinced of automation’s abilities within the specific context of screening. This unusual case raises the possibility that these results would be different given further evidence production and dissemination.
“I do think it’s been well demonstrated for the screening aspects, for the hit rates of what gets included and what doesn’t, and how correct it is.” Participant 11
Personal need for double-checking
Participants often wanted an established and ongoing method of observing the inner workings of the ML processes, frequently described as a desire to “check” what ML had done. While similar to the previous theme of Compatibility: ability to double check, the latter discussed that guideline developers believe the ability to check methods should be available as a matter of principle, while Observability: personal need for double-checking discusses that guideline developers want to do such checking themselves.
This need to be able to continually check how the machine learning has processed information could be interpreted as a desire to maintain control over the evidence synthesis process. As previously discussed, guideline developers must convince other stakeholders of their recommendations’ integrity, so personal quality control fits in with the cultural expectations of guideline development.
“The thing that’s sort of a little bit distressing from a novice point of view with machine-learning is not feeling like I have a way to check it… I’d need some way to be confident …. [I’d need] a way to check the algorithms” Participant 3
Complexity and Trialability
Selected participants identified that the learning process would need to be simple if researchers were to adopt ML. They also expressed a preference for familiarity over the novel.
“Whenever you try and really change things, I think there’s a degree of skepticism anyway...I think that might just be the nature of human beings.” Participant 9
“If they have to learn the process, and if it’s hard, then that sort of discourages them.” Participant 18
“So unless the technology offers a value add that’s substantial enough to overcome the learning curve…however much time it takes to do that has to not be more time than you’re gonna save.” Participant 3
Upon re-examination of how the data informed the deductive framework (described in Step 5 of the Methods section), several contextual factors were identified.
Participant familiarity with automation
Participants nearly always offered disclaimers prior to commenting, indicating they felt they did not have sufficient experience with automation technologies to be able to comment at their desired level of expertise. These data were of significant interest as they demonstrated a current lack of robust knowledge of the capabilities of automation within the target population.
“I’ve done a very little bit with machine-learning.” Participant 3
“It’s just my concern would be that I’ve not had any experience with it.” Participant 7
“I haven’t had much to do with machine-learning. Like I’ve kind of heard about it” Participant 17
“I think that’s something I have no personal experience with” Participant 11
“To be honest I actually haven’t had much experience with it” Participant 8
“Yeah, I don’t know, I don’t really understand that ￼process.” Participant 5
Overall skepticism towards Machine Learning
Overall skepticism or mistrust towards automation, both towards current technologies and anticipated future ones, was clear in the contributions from participants. They particularly expressed doubt over the ability of a machine to mimic human judgement calls they felt are currently essential to well-formulated health guidelines.
“It would be very difficult to train a machine to make the sort of value decisions that we have to make” Participant 10
“I’m still a bit nervous about some of the interpretation of that…it just might be a distrust about it, I think?” Participant 13
“How can a computer apply judgement? ...There’s judgement required when it comes to things like quality or – they are not things I expect to be evidence that could be accurate.” Participant 12
“I don’t think it could fully replace a human ... I think there can be subtleties between how things can interact... I think there’s always going to be some sort of human element.” Participant 9
“I don’t know if we’re there yet. Maybe we’ll get to the point where we can do that, but to do that, like quality rating, or to do a level of evidence, or strength of evidence… I mean there’s still a lot of value judgements in that. And I don’t know how much machine learning could help with that at this point.” Participant 3