The literature search generated a list of 1,186 relevant publications. A total of 882 unique emails were identified and extracted from the publications. In six weeks, from the 4th of October to the 15th of November 2018, a total of eighty-one valid responses were received from international experts, with a response rate of 9.2%. Valid responses were those completing all survey sections, responding to all statements, and answering all questions in each section. We also received seven invalid responses, which started the survey but did not complete all sections, missed some statements, or missed some questions on one or more sections.
Experts Agreement on GRASP Criteria
On average, the eighty-one respondents strongly agreed with the eight closed-ended statements, regarding the evaluation criteria of the GRASP framework, showing an average of 4.35 on a five-points Likert scale. Respondents strongly agreed with six of the eight closed-ended agreement statements. They somewhat agreed with one, and were neutral about another one, of the eight closed-ended agreement statements. Table 2 shows the average agreement score, standard deviation, and the 95% confidence interval of the respondents on each of the eight closed-ended statements. Figure 3 shows the averages and distributions of respondents’ agreements on each statement. The country distributions of the respondents are shown in Table 7 in the Appendix.
Table 2: Average Scores, Standard Deviations, and Confidence Intervals of Expert Respondents Agreement with the GRASP Framework Evaluation Criteria
SN
|
Question
|
Mean Score
|
Meaning
|
SD
|
95% CI
|
1
|
Predictive Performance: We should consider the evidence on validating the tool's predictive performance.
|
4.88
|
Strongly Agree
|
0.43
|
[4.87, 488]
|
2
|
Evidence Levels on Predictive Performance: The evidence level could be High (internal + multiple external validation), Medium (internal + external validation once), or Low (internal validation only).
|
4.44
|
Strongly Agree
|
0.87
|
[4.44, 4.45]
|
3
|
Usability: We should consider the evidence on the tool's usability.
|
4.68
|
Strongly Agree
|
0.70
|
[4.67, 4.68]
|
4
|
Potential Effect: We should consider the evidence on the tool's potential effect.
|
4.62
|
Strongly Agree
|
0.68
|
[4.61, 4.62]
|
5
|
Usability is Higher: The evidence level on tools' usability should be considered higher than the evidence level on tools' potential effect.
|
2.96
|
Neither Agree nor Disagree
|
1.23
|
[2.95, 2.97]
|
6
|
Impact: We should consider the evidence on the tool's impact on healthcare effectiveness, efficiency or safety.
|
4.78
|
Strongly Agree
|
0.57
|
[4.78, 4.79]
|
7
|
Evidence Levels of Post-Implementation Impact: The evidence level could be High (based on experimental studies), Medium (observational studies), or Low (subjective studies).
|
4.18
|
Somewhat Agree
|
1.14
|
[4.17, 4.19]
|
8
|
Evidence Direction: Based on the conclusions of published studies, the overall evidence direction could be Positive, Negative or Mixed.
|
4.25
|
Strongly Agree
|
0.78
|
[4.25, 4.26]
|
Overall Average
|
4.35
|
Strongly Agree
|
1.01
|
[4.349, 4.354]
|
Experts Comments, Suggestions and Recommendations
While the total valid responses were eighty-one; almost 80%; 64 respondents, provided their suggestions or discussed some recommendations for two or more of the six open-ended free text questions. These questions asked experts for their feedback regarding adding, removing or changing any of the GRASP framework evaluation criteria, their feedback regarding defining and capturing successful tools’ predictive performance, when different clinical predictive tasks have different predictive requirements, and their feedback regarding managing conflicting evidence of studies while there is variability in the quality and specifications of published evidence. Using the coding strategy to quantify the feedback of experts, their responses were classified in relation to each of the GRASP framework evaluation criteria. Their responses were also highlighted as positive when they suggested to add to the existing criteria or positively change the wording of some of them. On the other hand, their responses were highlighted as negative, when they suggested to remove the existing criteria or negatively change the wording of some of them. The strength of their feedback, in terms of being important or less important was coded a score from 1 to 10. In addition to quantifying the qualitative feedback of experts, the authors worked together on deciding which suggestions were more valuable, feasible, and essential to be included in the GRASP framework evaluation criteria.
Predictive Performance and Performance Levels
Most of the respondents, 59/64, discussed that the method, type, and quality of internal and external validation studies should be reported in the GRASP framework detailed report. When external validation studies are conducted multiple times using different patient populations, in different healthcare settings, at different institutions, in different countries, over different times, or by different researchers then the tool is said to have a broad validation range, which means it is more reliable to be used across these different variations of healthcare settings. Some of the respondents, 21/64, said that the tool’s predictive performance is considered stable and reliable, when multiple external validation studies produce homogeneous predictive performances, e.g. similar sensitivities and specificities. They also discussed adding the concept of “Strength of Evidence”; which should be mainly based on the quality of the reported study and how much the conditions of the study are close to the original specifications of the predictive tool, in terms of clinical area, population, and target outcomes. It should be part of the components of deciding the direction of evidence (positive, negative, or mixed). It should also be reported in the detailed GRASP framework report, so that users can consider when selecting among two or more tools of the same assigned grade. For example, two predictive tools are assigned grade C1 (each was externally validated multiple times) but one of them shows a strong positive evidence and the other shows a medium or weak positive evidence. It is logic to select the tool with the stronger evidence if both have similar predictive performances for the same tasks.
Usability and Potential Effect
Most of the respondents, 49/64, discussed that the methods and quality of the usability studies and the potential effect studies should be reported in the GRASP framework detailed report. Some of the respondents, 19/64, discussed that the potential effect and usability are not measured during implementation, rather they are measured during the planning for implementation, which is before wide-scale implementation. They also suggested that the details on the potential effect should report the focus on clinical patient outcomes, healthcare outcomes, or provider behaviour. Most of the respondents, 51/64, said that the potential effect is more important than the usability and should have a higher evidence level. This feedback came basically from respondents who gave a low rate, “strongly disagree’ and “disagree”, to statement number Q6: “The evidence level on tools' usability should be considered higher than the evidence level on tools' potential effect”. They argued that a highly usable tool that has no potential effect on healthcare is useless, while a less usable tool that has a promising potential effect is surely better. Some respondents, 11/64, discussed that evaluating both the potential effect and the usability should be considered together as a higher evidence than any of them alone.
Post-Implementation Impact and Impact Levels
Most of the respondents, 57/64, discussed that the method and quality of the post-implementation impact study should be reported in the GRASP framework detailed report. Again, some respondents, 13/64, discussed adding the concept of “Strength of Evidence”. Within each evidence level of the post-implementation impact we could have several sub-levels, or at least a classification of the quality of studies. For example, not all observational studies are equal in quality; a case series would be very different to a case control or large-scale prospective cohort study. Within the experimental studies we could also have different sub-levels of evidence, quasi-experimental vs. randomised controlled trial for example. These sub-levels should be included in the GRASP framework detailed report, when reporting the individual studies, this will provide the reader with more details on the strength and quality of the evidence on the tools.
Direction of Evidence
Most of the respondents, 47/64, discussed that the direction of evidence should consider the quality and strength of evidence. Most respondents here used the terms; “quality of evidence” and “strength of evidence”, synonymously. Many respondents, 38/64, discussed that quality of evidence or the strength of evidence should consider many elements of the published study, such as the methods used, the appropriate population, appropriate settings, the clinical practice, the sample size, the type of data collection; retrospective vs prospective, the outcomes, the institute of study and any other quality measures. The direction of evidence depends mainly on the quality and strength of the evidence in case there are conflicting conclusions from multiple studies.
Defining and Capturing Predictive Performance
Some of the respondents, 24/64, discussed that the predictive performance evaluation depends basically on the intended prediction task, so this is different from one tool to another, based on the task that each tool does. The clinical condition under prediction and the cost-effectiveness of treatment would highly influence the predictive performance evaluation. Predictive performance evaluation depends also on the actions recommended based on the tool. For example, screening tools should perform with high sensitivity, high negative predictive value, and low likelihood ratio, since there is a following level of checking by clinicians or other tests, while diagnostic tools should always perform with high specificity, high positive predictive value, and high likelihood ratio, since the decisions are based here directly on the outcomes of the tool, and some of these decisions might be risky to the patient or expensive to the healthcare organisation. A few of the respondents, 7/64, discussed that for diagnostic tools, predictive performance is more likely to be expressed through sensitivity and specificity, while for prognostic tools, it is better to express predictive performance through probability/risk estimation. Predictive tools must always be adjusted to the settings, populations, and the intended tasks before their adoption and implementation in the clinical practice.
Managing Conflicting Evidence
Some of the respondents, 27/64, discussed that deciding on the conflicting evidence should consider the quality of each study or the strength of evidence, to decide on the overall direction of evidence. Measures include the proper methods used in the study, if the population is appropriate, if the settings are appropriate, if the study is conducted at the clinical practice, if the sample size is large, if the data collection was prospective not retrospective, if the outcomes are clearly reported, if the institute of the study is credible, if the study involved multiple sites or hospitals, and any other quality measures related to the methods or the data. We should rely primarily on conclusions from high-quality low risk of bias studies, as recommended in other fields, e.g. systematic reviews. A well designed and conducted study should have more credibility than a poorly designed and conducted study. If different results are obtained for sub-populations, this should be further investigated and explained. The predictive tool may only perform well in certain sub-populations, based on the intended tasks. If we have evidence from settings outside the target population of the tool, then these shouldn't have much weight, or less weight, on the evidence to support the tool, such as non-equivalent studies; which are conducted to validate a tool for a different population, predictive task, or clinical settings. Much of the important information is in the details of the evidence variability. So, it is important to report this in the framework detailed report, to provide as much details as possible for each reported study to help end users make more accurate decisions based on their own settings, intended tasks, target populations, practice priorities, and improvement objectives.
Updating the GRASP Framework
Based on the respondents’ feedback, on both the closed-ended evaluation criteria agreement statements and the open-ended suggestions and recommendations questions, the GRASP framework concept was updated, as shown in Figure 4. Regarding Phase C; the pre-implementation phase including the evidence on predictive performance evaluation, the three levels of internal validation, external validation once, and external validation multiple times, were additionally assigned “Low Evidence”, “Medium Evidence”, and “High Evidence” labels respectively. Phase B: During Implementation, has been renamed to “Planning for Implementation”. The Potential Effect is now made of higher evidence level than Usability and the evidence of both potential effect and usability together is higher than any one of them alone. Now we have three levels of evidence; B1 = both potential effect and usability are reported, B2 = Potential effect evaluation is reported, and B3 = Usability testing is reported. Figure 5 in the Appendix shows a clean copy of the updated GRASP framework concept.
The GRASP framework detailed report was also updated, as shown in Table 5 in the Appendix. More details were added to the predictive tools information section, such as the internal validation method, dedicated support of research networks, programs, or professional groups, the total citations of the tool, number of studies discussing the tool, the number of authors, sample size used to develop the tool, the name of the journal which published the tool and its impact factor. Table 6 in the Appendix shows the Evidence Summary. This summary table provides users with more information in a structured format on each study discussing the tools, whether these were studies of predictive performance, usability, potential effect, or post-implementation impact. Information includes study name, country, year of development, and phase of evaluation. The evidence summary provides more quality related information, such as the study methods, the population and sample size, settings, practice, data collection method, and study outcomes. Furthermore, the evidence summary provides information on the strength of evidence and a label, to highlight the most prominent or important predictive functions, potential effects or post-implementation impacts of the tools.
We developed a new protocol to decide on the strength of evidence. The strength of evidence protocol considers two main criteria of the published studies. Firstly, it considers the degree of matching between the evaluation study conditions and the original tool specifications, in terms of the predictive task, target outcomes, intended use and users, clinical specialty, healthcare settings, target population, and age group. Secondly, it considers the quality of the study, in terms of the sample size, data collection, study methods, and credibility of institute and authors. Based on these two criteria, the strength of evidence is classified into 1) Strong Evidence: matching evidence of high quality, 2) Medium Evidence: matching evidence of low quality or non-matching evidence of high quality, and 3) Weak Evidence: non-matching evidence of low quality. Figure 8 in the Appendix shows the strength of evidence protocol.
The GRASP Framework Reliability
The two independent researchers assigned grades to the eight predictive tools and produced a detailed report on each one of them. The summary of the two independent researchers assigned grades, compared to the grades assigned by the authors of this study, are shown in Table 2. A more detailed information on the justification of the assigned grades is shown in the Appendix in Table 8. The Spearman's rank correlation coefficient was 0.994 (p<0.001) comparing the first researcher to the authors, 0.994 (p<0.001) comparing the second researcher to the authors, and 0.988 (p<0.001) comparing the two researchers to each other. This shows a statistically significant and strong correlation, indicating a strong interrater reliability of the GRASP framework. Accordingly, the GRASP framework produced reliable and consistent grades when it was used by independent users. Providing their feedback to the five open-ended questions, after assigning the grades to the eight tools, both two independent researchers found GRASP framework design logical, easy to understand, and well organized. They both found GRASP useful, considering the variability of tools’ quality and levels of evidence. They both found it easy to use. They both thought the criteria used for grading were logical, clear, and well structured. They did not wish to add, remove, or change any of the criteria. However, they asked for adding some definitions and clarifications to the evaluation criteria of the GRASP framework, which was included in the update of the framework. The screenshot of the post-task questionnaire given to the two independent reviewers is shown in the Appendix.
Table 3: Grades Assigned by the Two Independent Researchers and the Paper Authors
Tools
|
Grading by Researcher 1
|
Grading by Researcher 2
|
Grading by Paper Authors
|
Centor Score [42]
|
B2
|
B3
|
B3
|
CHALICE Rule [43]
|
B2
|
B2
|
B2
|
Dietrich Rule [44]
|
C0
|
C0
|
C0
|
LACE Index [45]
|
C1
|
C1
|
C1
|
Manuck Scoring System [46]
|
C2
|
C2
|
C2
|
Ottawa Knee Rule [47]
|
A1
|
A2
|
A1
|
PECARN Rule [48]
|
A2
|
A2
|
A2
|
Taylor Mortality Model [49]
|
C3
|
C3
|
C3
|