The research team created video recordings of simulated pharmacy consultations. Pharmacists and a pharmacy technician were involved, with medical actors playing the part of patients. The simulated consultations focused on long term conditions and acute presentations in a primary care setting, for example medication review in a care home, post-discharge asthma review, type-2-diabetes medication review, urinary tract infection and knee pain. Fifteen recorded simulated consultations were selected for the validation study representing the three different levels of practice in the MR-CAT: Below expectations (n = 5), competent (n = 5) and excellent (n = 5). A further three recordings were created specifically to be used to train participants in how to use the MR-CAT.
Educators involved in training pharmacy professionals to work in advanced practice roles who had also completed prior consultation skills training were invited to participate in the study. These participants (now referred to as ‘raters’) completed training where they were first familiarised with the content of the MR-CAT and had the opportunity to learn how to use the tool.
Following this, raters were asked to independently view and assess the three simulated videoed patient consultations created for the training aspect of the validation work using the MR-CAT before attending a second training session during which raters discussed their rating of each consultation and their rationale for a rating to ensure all raters understood how to use the MR-CAT.
After completing the training, raters independently assessed the 15 pre-recorded simulated videoed consultations using the MR-CAT. Raters were blinded to the levels of practice demonstrated in the video recordings. Key descriptors within each section of the MR-CAT support skill discrimination [see Table 2]. Raters recorded their ratings via an online survey tool using a unique identifier.
Two rounds of data collection were completed. In the first round (January 2020), the raters independently assessed fifteen pre-recorded simulated videoed consultations using the MR-CAT. The second round of data collection (March 2020) was performed to assess the intrarater reliability (test-retest analysis). This involved participants independently re-assessing a sub-sample (n = 6) of the original 15 recordings, approximately eight weeks later. The six recordings consisted of two from each level of practice (2 below expectations, 2 competent and 2 excellent).
At each stage raters submitted section and global ratings to the research team via an online survey tool. Data were downloaded from the survey platform and subsequently analysed using Statistical Package for Social Sciences (SPSS) version 25 database (SPSS Inc., Chicago IL) and STATA for statistics and data management version 14 (StataCorp, College Station, TX). A range of statistical tests were used to test discriminant validity, intrarater and interrater reliability testing, and internal consistency of the MR-CAT. The two-tailed p-value was considered significant at p-value < 0.05.
Initially, the Cronbach’s alpha test for scales was used to evaluate the internal consistency or how closely a set of items in a tool are related as a group. The global rating and each of the section ratings (Likert scale of 3 in ordinal scale) of the MR-CAT were entered into the analysis. To determine internal consistency, a high value of Cronbach’s alpha was taken to indicate a relatively high internal consistency or reliability.
To explore the extent to which the MR-CAT could discriminate between consultations of different quality – that is, between consultations that were below expectations, competent or excellent – a Kruskal–Wallis test, with post-hoc Wilcoxon rank sum analysis, was used. This compared raters’ global ratings of consultations that had a priori been classified as below expectations, competent or excellent using the mean and standard deviation with the rating of the raters for statistical differences between grouped consultation types.
The degree to which raters awarded similar ratings when observing the same consultations was investigated (inter-rater reliability). Each rater’s ratings at the global level and sections of the MR-CAT were ranked across the 15 simulated consultations. Kendall’s coefficient of concordance was calculated to assess the degree of agreement between raters’ ranked ratings at each level of practice.
The extent to which raters produced consistent ratings when applying the MR-CAT to the same simulated consultation at two time points was investigated (intrarater reliability). The test-retest values were produced at Time 1 (original test of tool) and Time 2 at 8 weeks after the original test and compared. Spearman’s correlation coefficients (rho) were calculated for each section of the MR-CAT using rank orders of ordinal data.