Validating and Updating GRASP: An Evidence-Based Framework for Grading and Assessment of Clinical Predictive Tools

Background: When selecting predictive tools, clinicians are challenged with an overwhelming and ever-growing number, most of which have never been implemented or evaluated for comparative effectiveness. The authors developed an evidence-based framework for grading and assessment of predictive tools (GRASP). The objective of this study is to update GRASP and evaluate its reliability. Methods: A web-based survey was developed to collect responses of a wide international group of experts, who published studies on clinical prediction tools. Experts were invited via email and their responses were quantitatively and qualitatively analysed using NVivo software. The interrater reliability of the framework, to assign grades to eight predictive tools by two independent users, was evaluated. Results: We received 81 valid responses. On ve-points Likert scale, experts overall strongly agreed with GRASP evaluation criteria=4.35/5, SD=1.01, 95%CI [4.349, 4.354]. Experts strongly agreed with six criteria: predictive performance=4.88/5, SD=0.43, 95%CI [4.87, 488] and evidence levels of predictive performance=4.44/5, SD=0.87, 95%CI [4.44, 4.45], usability=4.68/5, SD=0.70, 95%CI [4.67, 4.68] and potential effect=4.62/5, SD=0.68, 95%CI [4.61, 4.62], post-implementation impact=4.78/5, SD=0.57, 95%CI [4.78, 4.79] and evidence direction=4.25/5, SD=0.78, 95%CI [4.25, 4.26]. Experts somewhat agreed with one criterion: post-implementation impact levels=4.18/5, SD=1.14, 95%CI [4.17, 4.19]. Experts were neutral about one criterion; usability is higher than potential effect=2.96/5, SD=1.23, 95%CI [2.95, 2.97]. Sixty-four respondents provided recommendations to six open-ended questions regarding updating evaluation criteria. Forty-three suggested potential effect is higher than usability.

decisions [22,23]. Some clinicians, especially those developing clinical guidelines, search the literature for best available published evidence. Commonly they look for research studies that describe the development, implementation or evaluation of predictive tools. More specifically, some clinicians look for systematic reviews on predictive tools, comparing their predictive performance or development methods. However, there are no available approaches to objectively summarise or interpret such evidence, to identify the most appropriate predictive tool(s) for a given application while considering various implementation challenges and clinical settings constraints [24][25][26].

The GRASP Framework
To overcome this major challenge, the authors have developed a new evidence-based framework for grading and assessment of predictive tools (The GRASP Framework) [27]. This framework aims to provide clinicians with standardised objective information on predictive tools to support their search for and selection of effective tools for their tasks. Based on the critical appraisal of the published evidence on predictive tools, the GRASP framework uses three dimensions to grade predictive tools: 1) Phase of Evaluation, 2) Level of Evidence and 3) Direction of Evidence.

Phase of Evaluation
Assigns A, B, or C based on the highest phase of evaluation. If a tool's predictive performance, as reported in the literature, has been tested for validity, it is assigned phase C. If a tool's usability and/or potential effect have been tested, it is assigned phase B. Finally, if a tool has been implemented in clinical practice, and there is published evidence evaluating its impact, it is assigned phase A.

Level of Evidence
A numerical score, within each phase, is assigned based on the level of evidence associated with each tool. A tool is assigned grade C1 if it has been tested for external validity multiple times; C2 if it has been tested for external validity only once; and C3 if it has been tested only for internal validity. C0 means that the tool did not show sufficient internal validity to be used in clinical practice. Similarly, B1 is assigned to a predictive tool that has been evaluated during implementation for its usability; while if it has been studied for its potential effect on clinical effectiveness, patient safety or healthcare efficiency, it is assigned B2. Finally, if a predictive tool had been implemented then evaluated after implementation for its impact, on clinical effectiveness, patient safety or healthcare efficiency, it is assigned score A1 if there is at least one experimental study of good quality evaluating its impact, A2 if there are observational studies evaluating its impact and A3 if the impact has been evaluated through subjective studies, such as expert panel reports. The final grade assigned to a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports a positive conclusion. The GRASP framework concept is shown in Fig. 1 and the GRASP framework detailed report is presented in

Study Objectives
Updating new clinical instruments, healthcare models and evaluation frameworks through the feedback of experts is a well-established approach, especially in the area of CDS [12,28,29]. Using a mixed approach of qualitative and quantitative methods in research proved to be useful in healthcare, because of the complexity of the studied topics [30]. Using open-ended questions in quantitative surveys adds significant value and depth to both the results and conclusions of studies conducted [31]. The aim of this study is to update the GRASP framework and evaluate its initial reliability. The primary objective is to update the criteria used by the GRASP framework, for grading and assessment of predictive tools, through the feedback of a wide international group of healthcare experts in the areas of developing, implementing and evaluating clinical decision support systems and predictive tools. The secondary objective is to evaluate the initial reliability of this preliminary version of the GRASP framework to ensure that the outcomes produced by independent users, when grading predictive tools using the GRASP framework, are consistent and reliable.

Methods The Study Design
The study is composed of two parts. The first part includes updating the GRASP framework and the second part includes evaluating the framework reliability. For the first part, a survey was designed to solicit the feedback of experts on the criteria used by the GRASP framework for grading and assessment of predictive tools. The main outcome of this part of the study is to measure the experts' agreement and update the design and content of the GRASP framework. The analysis includes evaluating the degree of agreement of experts on how essential the different criteria used to grade predictive tools are, including the three dimensions; phases of evaluation (before, during and after implementation), levels of evidence and directions of evidence within each phase. In addition, experts' feedback on adding, removing or updating any of the criteria, used to grade predictive tools, and their further suggestions and recommendations will also be analysed and considered. Figure 2 shows a flow diagram of the study design.
Based on similar studies, validating and updating systems through surveying expert users, it was estimated that the required sample size for this study is around fifty experts [32-35]. Experts were identified as researchers who have published at least one paper on developing, implementing or evaluating predictive tools and clinical decision support systems. Expert researchers were not required to be practicing clinicians, rather they should have experience in evidence-based methods and clinical decision support development and evaluation approaches. To search for such researchers and publications, the concepts of Clinical Decision Support, Clinical Prediction, Developing, Validating, Implementing, Evaluating, Comparing, Reviewing, Tools, Rules, Models, Algorithms, Systems, and Pathways were used. The search engines used included MEDLINE, EMBASE, CINAHL and Google Scholar. For the purpose of emails currency, the search was restricted to the last three years.
The study has been approved by the Human Research Ethics Committee, Faculty of Medicine and Health Sciences, Macquarie University, Sydney, Australia, on the 4th of October 2018. The authors expected the distribution of the survey to take two weeks, and the collection of the feedback to take another four weeks. The authors expected the response rate to be around 10%. Before the deployment of the survey a pilot testing was conducted through asking ten experts to take the survey. The feedback of the pilot testing was used to improve the survey design and content; some questions were rephrased, some were rearranged, and some were supported by definitions and clarifications. Experts who participated in the pilot testing were excluded from the participation in the final survey. An invitation email, introducing the study objectives, the survey completion time, which was estimated at 20 minutes, and a participation consent was submitted to the identified experts with the link to the online survey. A reminder email, in two weeks, was sent to the experts who have not responded or completed the survey.

The Study Survey
The online survey was developed using Qualtrics experience management platform [36]. The online survey, as illustrated through the screenshots in the Appendix, included eight five-points Likert scale closed-ended agreement questions and six open-ended suggestions and recommendations questions distributed over seven sections. The introduction informed the participants about the aim of developing the GRASP framework, its design, and the task they are requested to complete. In addition to informing them that they can request feedback and acknowledgement as well as providing them with contacts to ask for further information or to make complaints. The second section asked experts about their level of agreement with the evaluation of the published evidence on the tools' predictive performance before implementation, such as internal and external validation. The third section asked experts about their level of agreement with the evaluation of the published evidence on the tools' usability and/or potential effect on healthcare during implementation.
The fourth section asked experts about their level of agreement with the evaluation of the published evidence on the tools' post-implementation impact on clinical effectiveness, patient safety or healthcare efficiency. The fifth section asked experts about their level of agreement with the evaluation of the direction of the published evidence on the tools, being positive, negative or mixed.
Experts were also requested to provide their free-text feedback on adding, removing or changing any of the criteria used for the assessment of phases of evaluation, levels of evidence, or directions of evidence. The sixth section asked experts to provide their free-text feedback on the best methods suggested to define and capture successful tools' predictive performance, when different clinical prediction tasks have different predictive performance requirements. Experts were also asked about managing conflicting evidence of studies when there is variability in the quality and/or subpopulations of the published evidence.

Reliability Testing
The second part of this study; evaluating the framework reliability, followed the completion of the first part and used the validated and updated version of the GRASP framework. Two independent and experienced researchers were trained, for four hours each by the authors, on using the framework to grade predictive tools. The two researchers hold PhD degrees in closely relevant health disciplines, have several relevant publications, and have long experience in evidence-based methods and systematic reviews. The researchers were asked to grade eight different predictive tools independently, using the validated and updated version of the GRASP framework, the full text studies describing the development of the tools, and the comprehensive list of all the published evidence on each tool along with the full text of each study. The objective of this part of the study was to measure the reliability of using the framework, by independent users, to grade predictive tools. Since the tested function of the GRASP framework here is grading tools, the interrater reliability was the best measure to evaluate its reliability. The eight predictive tools were selected, before conducting the reliability testing and after examining and assigning grades to thirty candidate predictive tools, by the authors of this research, using the validated and updated version of the GRASP framework. The eight predictive tools were selected because of their diversity, as they cover a wide range of GRASP framework grades; covering eight different grades of the designed ten grades of the framework; from Grade C0 up to Grade A1, which should support the validity of the interrater reliability evaluation. The interrater reliability, also called interrater agreement, interrater concordance, and interobserver reliability, is the degree of agreement or the score of how much homogeneity, or consensus, there is in the ratings given by independent reviewers [37]. Since the target ratings of the GRASP framework are ordinal, based on the available collected data, the correlation testing is an appropriate method for showing the interrater reliability. The Spearman's rank correlation coefficient is a nonparametric correlation estimator. It is widely used in the applied sciences and reported to be a robust measure of correlation [38]. This correlation coefficient is a test for a monotonic relationship, while in practice we may also be interested to capture concealed statistical relationships. Therefore some researchers prefer to add some divergence measures, which was not feasible using the available collected data [39]. After grading the tools, the two independent researchers were asked to provide their openended feedback. Through a short five questions survey, they were asked if the GRASP framework design was logical, if they found it useful, easy to use, their opinion in the criteria used for grading, and if they wish to add, remove, or change any of them.

Analysis and Outcomes
Three major outcomes were planned. Firstly, through the eight closed-ended agreement questions of the survey, the average scores and distributions of experts' opinions on the different criteria used by the GRASP framework to grade and assess the predictive tools should help to improve such criteria. A five-points Likert scale ranging from strongly agree to strongly disagree was used, where the first was assigned the score of five and the last was assigned the score of one, to translate qualitative values into quantitative measures for the sake of the analysis [40,41]. Secondly, the six open-ended free text questions should provide experts with the opportunity to suggest adding, removing or updating any of the framework criteria. The qualitative analysis should help categorise such suggestions and recommendations into specific information and should also support updating the framework design and detailed content. The qualitative data analysis was conducted using the NVivo Version 12.3 software package [42]. Thirdly, an interrater reliability testing was designed to measure how accurate and consistent the grading of eight predictive tools, conducted by two independent researchers, compared to each other and compared to the grading of the same tools by the authors.

Results
The literature search generated a list of 1,186 relevant publications. A total of 882 unique emails were identified and extracted from the publications. In six weeks, from the 4 th of October to the 15 th of November 2018, a total of eighty-one valid responses were received from international experts, with a response rate of 9.2%.

Experts Agreement on GRASP Criteria
The overall average agreement of the eighty-one respondents to the eight closed-ended questions was 4.35, which means the respondents overall strongly agreed to the criteria of the GRASP framework. Respondents strongly agreed to six of the eight closed-ended agreement questions, regarding the criteria used by the GRASP framework for evaluating predictive tools. They somewhat agreed to one, and were neutral about another one, of the eight closed-ended agreement questions. Table 1 shows the average agreements of the respondents on each of the eight closed-ended questions and Figure 3 shows the averages and distributions of respondents' agreements on each question. The country distributions of the respondents are shown in Table 6 in the Appendix.

Usability and Potential Effect
The respondents discussed that the methods and quality of the usability studies and the potential effect studies should be reported in the GRASP framework detailed report. Some of the respondents discussed that the potential effect and usability are not measured during implementation, rather they are measured during the planning for implementation, which is before wide-scale implementation.
They also suggested that the details on the potential effect should report the focus on clinical patient outcomes, healthcare outcomes, or provider behaviour. Most of the respondents said that the potential effect is more important than the usability and should have a higher evidence level. A highly usable tool that has no potential effect on healthcare is useless, while a less usable tool that has a promising potential effect is surely better. Some respondents discussed that evaluating both the potential effect and the usability should be considered together as a higher evidence than any of them alone.
Post-Implementation Impact and Impact Levels The respondents discussed that the method and quality of the post-implementation impact study should be reported in the GRASP framework detailed report. Again, respondents discussed adding the

Direction of Evidence
Respondents discussed that the direction of evidence should consider the quality and strength of evidence. Most respondents here used the terms; "quality of evidence" and "strength of evidence", synonymously. Respondents discussed that quality of evidence or the strength of evidence should consider many elements of the published study, such as the methods used, the appropriate population, appropriate settings, the clinical practice, the sample size, the type of data collection; retrospective vs prospective, the outcomes, the institute of study and any other quality measures.
The direction of evidence depends largely on the quality of the evidence, in case there are conflicting conclusions from multiple studies.

Defining and Capturing Predictive Performance
Respondents discussed that the predictive performance evaluation depends basically on the intended prediction task, so this is different from one tool to another, based on the task that each tool does. tool may only perform well in certain sub-populations, based on the intended tasks. If we have evidence from settings outside the target population of the tool, then these shouldn't have much weight, or less weight, on the evidence to support the tool, such as non-equivalent studies; which are conducted to validate a tool for a different population, predictive task, or clinical settings. Much of the important information is in the details of the evidence variability. So, it is important to report this in the framework detailed report, to provide as much details as possible for each reported study to help end users make more accurate decisions based on their own settings, intended tasks, target populations, practice priorities, and improvement objectives.

Updating the GRASP Framework
Based on the respondents' feedback, on both the closed-ended evaluation criteria agreement questions and the open-ended suggestions and recommendations questions, the GRASP framework concept was updated, as shown in Figure 4. Regarding Phase C; the pre-implementation phase including the evidence on predictive performance evaluation, the three levels of internal validation, external validation once, and external validation multiple times, were additionally assigned "Low Evidence", "Medium Evidence", and "High Evidence" labels respectively. Phase B; During Implementation, has been renamed to "Planning for Implementation". The Potential Effect is now made of higher evidence level than Usability and the evidence of both potential effect and usability together is higher than any one of them alone. Now we have three levels of evidence; B1 = both potential effect and usability are reported, B2 = Potential effect evaluation is reported, and B3 = Usability testing is reported. Figure 5 in the Appendix shows a clean copy of the updated GRASP framework concept.
The GRASP framework detailed report was also updated, as shown in Table 4 in the Appendix. More details were added to the predictive tools information section, such as the internal validation method, dedicated support of research networks, programs, or professional groups, the total citations of the tool, number of studies discussing the tool, the number of authors, sample size used to develop the tool, the name of the journal which published the tool and its impact factor. Table 5 in the Appendix shows the Evidence Summary. This summary table provides users with more information in a structured format on each study discussing the tools, whether these were studies of predictive performance, usability, potential effect or post-implementation impact. Information includes study name, country, year of development, and phase of evaluation. The evidence summary provides more quality related information, such as the study methods, the population and sample size, settings, practice, data collection method, and study outcomes. Furthermore, the evidence summary provides information on the strength of evidence and a label, to highlight the most prominent or important predictive functions, potential effects or post-implementation impacts of the tools.
We developed a new protocol to decide on the strength of evidence. The strength of evidence protocol considers two main criteria of the published studies. Firstly, it considers the degree of matching between the evaluation study conditions and the original tool specifications, in terms of the predictive task, target outcomes, intended use and users, clinical specialty, healthcare settings, target population, and age group. Secondly, it considers the quality of the study, in terms of the sample size, data collection, study methods, and credibility of institute and authors. Based on these two criteria, the strength of evidence is classified into 1) Strong Evidence: matching evidence of high quality, 2) Medium Evidence: matching evidence of low quality or non-matching evidence of high quality, and 3) Weak Evidence: non-matching evidence of low quality. Figure 8 in the Appendix shows the strength of evidence protocol.

The GRASP Framework Reliability
The two independent researchers assigned grades to the eight predictive tools and produced a detailed report on each one of them. The summary of the two independent researchers assigned grades, compared to the grades assigned by the authors of this study, are shown in Table 2. A more detailed information on the justification of the assigned grades is shown in the Appendix in Table 7.
The Spearman's rank correlation coefficient was 0.994 (p<0.001) comparing the first researcher to the authors, 0.994 (p<0.001) comparing the second researcher to the authors, and 0.988 (p<0.001) comparing the two researchers to each other. This shows a statistically significant and strong correlation, indicating a strong interrater reliability of the GRASP framework. Accordingly, the GRASP framework produced reliable and consistent grades when it was used by independent users. Providing their feedback to the five open-ended questions, after assigning the grades to the eight tools, both two independent researchers found GRASP framework design logical, easy to understand, and well organized. They both found GRASP useful, considering the variability of tools' quality and levels of evidence. They both found it easy to use. They both thought the criteria used for grading were logical, clear, and well structured. They did not wish to add, remove, or change any of the criteria. However, they asked for adding some definitions and clarifications to the evaluation criteria of the GRASP framework, which was included in the update of the framework.

Discussion
Brief Summary It is a challenging task for most clinicians to critically evaluate a growing number of predictive tools,

Predictive Performance
The internal validation of the predictive performance of a tool is essential to make sure the tool is doing the prediction task as designed [25,51]. The predictive performance is evaluated using measures of discrimination and calibration [52]. While discrimination refers to the ability of the tool to distinguish between patients with and without the outcome under consideration, calibration refers to the accuracy of the prediction, and show how much the predicted and the observed outcomes agree [53]. Discrimination is usually measured through sensitivity, specificity, and the area under the curve (AUC) [54]. On the other hand, calibration could be summarised using the Hosmer-Lemeshow test or the Brier score [55]. The external validation of predictive tools is essential to reflect the reliability and generalisability of the tools [56]. Predictive tools are more reliable and trustworthy, not only when their predictive performance is better, but more importantly when they undergo high quality, multiple, and wide range external validation [57]. The high quality is usually reflected in the type and size of data samples used in the validation, while repeating the external validations on different patient populations, at different institutions, in different healthcare settings, and by different researchers shows higher reliability of the predictive tools [25].

Usability and Potential Effect
In addition to the predictive performance, clinicians are usually interested to learn about the potential effects of the tools on improving patient outcomes, saving time, costs and resources, or supporting patient safety [2,58]. They need to know more about the expected impact of using the tool on different healthcare aspects, processes or outcomes, assuming the tool has been successfully implemented in the clinical practice [59,60]. If a CDS tool has less potential to improve healthcare processes or clinical outcomes it will not be easily adopted or successfully implemented in the clinical practice [61]. Some clinicians might also be interested to learn about the usability of predictive tools; whether these tools can be used by the specified users to achieve specified and quantifiable objectives in the specified context of use [62,63]. CDS tools with poor usability will eventually fail, even if they provide the best performance or potential effect on healthcare [7,64]. Usability includes several measurable criteria, based on the perspectives of the stakeholders, such the mental effort needed, the user attitude, interaction, easiness of use, and acceptability of systems [65,66]. Usability can also be evaluated through measuring the effectiveness of task management with accuracy and completeness, the efficiency of utilising resources, and the users' satisfaction, comfort with, and positive attitudes towards, the use of the tools [67,68], in addition to learnability, memorability and freedom of errors [69,70].

Post-Implementation Impact
Clinicians are interested to learn about the post-implementation impact of CDS tools, on different healthcare aspects, processes, and outcomes, before they consider their implementation in the clinical practice [71][72][73]. The most interesting part of the impact studies for clinicians is the effect size of the CDS tools and their direct impact on physicians' performance and patients' outcomes [74,75].
Clinicians consider that high quality experimental studies, such as randomised controlled trials, are the highest level of evidence, followed by observational well-designed cohort or case-control studies and lastly subjective studies, opinions of respected authorities, and reports of expert committees or panels [76][77][78]. For many years, experimental methods have been viewed as the gold standard for evaluation, while observational methods were considered to have little or no value. However, this ignores the limitations of randomised controlled trials, which may prove unnecessary, inappropriate, inadequate, or sometimes impossible. Furthermore, high-quality observational studies have an important role in comparative effectiveness research because they can address issues that are otherwise difficult or impossible to study. Therefore, we need to understand the complementary roles of the two approaches and appreciate the scientific rigour in evaluation, regardless of the method used [79,80].

Direction of Evidence and Conflicting Conclusions
It is not uncommon to encounter conflicting conclusions when a predictive tool is validated or implemented and evaluated in different patient subpopulations or for different prediction tasks or outcomes [81,82]. The cut-off value that determines what a good predictive performance is, for example, depends not only on the clinical condition under consideration but largely on the requirements, conditions, and consequences of the decisions made accordingly [83]. One of the main challenges here is dealing with the huge variability in the quality, types, and conditions of studies published in the literature. This variability makes it impossible to synthesise different measures of predictive performance, usability, potential effect or post-implementation impact into simple quantitative values, like in meta-analysis or systematic reviews [84,85].

The GRASP Framework Overall
The grades assigned to predictive tools, using the GRASP framework, provide relevant evidencebased information to guide the selection of predictive tools for clinical decision support. However, the framework is not meant to be precisely prescriptive. An A1 tool is not always and absolutely better than an A2 tool. A clinician may prefer an A2 tool showing improved patient safety in two observational studies rather than an A1 tool showing reduced healthcare costs in three experimental studies. It all depends on the objectives and priorities the users are trying to achieve, through implementing and using predictive tools in their clinical practice. More than one predictive tool could be endorsed, in clinical practice guidelines, each supported by its requirements and conditions of use and recommended for its most prominent outcome of predictive performance, potential effect, or post-implementation impact on healthcare and clinical outcomes.
The GRASP framework remains a high-level approach to provide clinicians with an evidence-based and comprehensive, yet simple and feasible, method to evaluate and select predictive tools.
However, when clinicians need further information, the framework detailed report provides them with the required details to support their decision making. The GRASP framework is designed for two levels of users: 1) Expert users, such as healthcare researchers experienced in evidence-based evaluation methods. These will use the framework to critically evaluate published evidence, assign grades to predictive tools, and report their details. 2) End users, such as clinicians and healthcare professionals responsible for selecting tools for implementation at their clinical practice or for recommendation in clinical practice guidelines. These will use the GRASP framework detailed reports on tools and their assigned grades, produced by expert users, to compare existing predictive tools and select the most suitable tools for their intended tasks.
Challenges, Limitations, and Future Work analysing the feedback of experts using open-ended questions is rather difficult [86]. Qualitative content and thematic analysis of free text feedback is challenging, since the extraction of significance becomes more difficult with diverse opinions, different experiences, and variable perspectives [87]. It is advised by many healthcare researchers to use Delphi techniques to reach to consensus among experts, through successive rounds of feedback, when developing clinical guidelines or selecting evaluation criteria and indicators [88][89][90]. However, many researchers recommend that Delphi method panels should include a number between ten and fifty members, as this count is more appropriate and realistic, considering the large amount of the collected data and the subsequent multiple rounds of analysis each panellist generates [91]. Based on our expectation, that the number of respondents would be almost a hundred, and due to time and resources limitations, we preferred using a single round of an online survey of experts' feedback.
Even though we have contacted a large number of 882 experts, in the area of developing, implementing and evaluating predictive tools and CDS systems, we got a very low response rate of 9.2%, and received only 81 valid responses. This low response rate could have been improved if participants were motivated by some incentives, more than just acknowledging their participation in the study, or if more support was provided through the organisations these participants belong to, which needs much more resources to synchronise these efforts. For the sake of keeping the survey feasible for most busy experts, the number of the closed ended as well as the open-ended questions were kept limited and the required time to complete the whole survey was kept in the range of 20 minutes. However, some of the participants could have been willing to provide more detailed feedback, through interviews for example, which was out of the scope of this study and was not initially possible to conduct with all the invited experts, otherwise we would have received a much lower response rate.
To evaluate the impact of the GRASP framework on clinicians' decisions and examine the application of the framework to grade predictive tools, the authors are currently working on two more studies.
The first study should evaluate the impact of using the framework on improving the decisions made by end user clinicians and healthcare professionals, regarding selecting predictive tools for their clinical tasks. Through an online survey of a wide international group of clinicians and healthcare professionals, the study should compare the performance and outcomes of making the selection decisions with and without using the framework. The first study should also evaluate the usability and usefulness of the GRASP framework on a wider scale, seeking the feedback of over a hundred clinicians and healthcare professionals, which should give a more accurate and reliable measure, of usability and usefulness, than the feedback provided currently by only two researchers. The second study aims to apply the framework to a large consistent group of predictive tools, used for the same clinical prediction task. This study should show how the framework provides clinicians with an evidence-based method to compare, evaluate, and select predictive tools, through grading and reporting tools based on the critical appraisal of their published evidence.

Conclusion
The GRASP framework grades predictive tools based on the critical appraisal of the published evidence across three dimensions: 1) Phase of evaluation; 2) Level of evidence; and 3) Direction of evidence. The final grade of a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports positive conclusion. The GRASP framework aims to provide clinicians with a high-level, evidence-based, and comprehensive, yet simple and feasible, approach to evaluate and compare clinical predictive tools, considering their predictive performance before implementation, potential effect and usability during planning for implementation, and post-implementation impact on healthcare processes and clinical outcomes.
To enable end user clinicians and clinical practice guideline developers to access detailed information, reported evidence and assigned grades of predictive tools, it is essential to discuss implementing the GRASP framework into an online platform. However, maintaining such grading system up to date is a challenging task, as this requires the continuous updating of the predictive tools grading and assessments, when new evidence becomes published and available. It is important to discuss using automated or semi-automated methods for searching and processing new information to keep the GRASP framework updated. Furthermore, we recommend that the GRASP framework be utilised by working groups of experts of professional organisations to grade predictive tools, in order to provide consistent results and increase reliability and credibility for end users. These professional organisations should also support disseminating such evidence-based information on predictive tools, in a similar way of announcing and disseminating new updates of clinical practice guidelines.
Finally, the GRASP framework is still in the development phase. This is a first pass experiment, to update the framework and its evaluation criteria, and further work will confirm what else needs to be added, removed, or changed. We can only claim its validity and reliability after it has been used, over a period of time, by the real expert users and end users; using the framework to grade predictive tools and then using the assigned grades and detailed reports to select the most effective tools.   The Updated GRASP Framework Concept

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. Appendix.pdf