Consensus Analysis for the Panel Responses and the Psychometrics Properties in Script Concordance Testing

Background: Clinical reasoning is an essential attribute in the teaching, learning, and assessment part of medical education for undergraduates. In using the Script Concordance Test (SCT) to foster clinical reasoning, expert panel members’ responses are initially created. There is no agreement in optimizing the panel members’ responses. Our study aimed to develop and validate an SCT and test the utility of the consensus index and panel response pattern. Methods: The methodology was an evolving pattern of constructing SCTs, administering them to the panel members, optimizing the panel with response pattern and consensus index. The SCT’s nal items were chosen to be administered to the students. Item-total correlation and Cronbach’s alpha were calculated from the students’ scores. Results: Our study developed an SCT with 98 items and was administered to 20-panel members. The mean score of the panel members for these 98 items was 79.5 (+/- 4.4 SD). On optimizing with the panel responses, 14 items had a uniform response pattern, and 2 had bimodal response patterns. The consensus index calculated for the 98 item SCT ranged from 25.81 to 100. When the 16 items of bimodal and uniform response pattern were eliminated, the consensus index ranged from 58.65 to 100. We administered this 82 items SCT to 30 undergraduate and ten postgraduate students. The mean score of undergraduate students was 61.1 (+/-7.5 SD), and that of postgraduate students was 67.7 (+/- 6.3 SD), which was statistically signicant using an independent t-test. Cronbach’s alpha for this 82 item SCT was 0.74. On analysing the item-total correlation, 22 items had a correlation of less than 0.05. Excluding these 22 poor items, the nal SCT instrument of 60 items had a Cronbach’s alpha of 0.82. Conclusion: The consensus index can also be used to optimize the items for panel responses in SCT. Our study revealed that a consensus index of above 60 had a good item-total correlation with good internal consistency. Our study also revealed that the panel response clustering pattern could also be used to categorize the items though bimodal and uniform distribution patterns need further differentiation.

was 61.1 (+/-7.5 SD), and that of postgraduate students was 67.7 (+/-6.3 SD), which was statistically signi cant using an independent t-test. Cronbach's alpha for this 82 item SCT was 0.74. On analysing the item-total correlation, 22 items had a correlation of less than 0.05. Excluding these 22 poor items, the nal SCT instrument of 60 items had a Cronbach's alpha of 0.82.
Conclusion: The consensus index can also be used to optimize the items for panel responses in SCT. Our study revealed that a consensus index of above 60 had a good item-total correlation with good internal consistency. Our study also revealed that the panel response clustering pattern could also be used to categorize the items though bimodal and uniform distribution patterns need further differentiation.

Background
Clinical reasoning is an essential attribute in the teaching, learning, and assessment part of medical education for undergraduates. Script Concordance Test (SCT) is a tool to assess clinical reasoning (1,2). Various studies have shown SCT's utility in fostering and assessing clinical reasoning among medical undergraduates (3)(4)(5). In using the Script Concordance Test model, expert panel members' responses are initially created based on which the students are evaluated and scored. The degree of concordance of the student's response to that of the expert panel decides the student's credit. However, SCTs are rarely used in medical education for undergraduates in India.
Despite the presence of guidelines in constructing an SCT (6), there is no uniform agreement in optimizing the panel members' responses. In the study conducted by SH Wan et al., 2015 (3) , the author has clustered the panel response and identi ed four patterns to identify the panel members' agreement. Robert Gagnon et al. (7) analysed the 'outlier' method, 'distance-from-mode' and 'judgment-by-experts' in optimizing the panel responses for Script Concordance Test. The consensus index is a measure of dispersion applied to determine the agreement among any given data on an ordinal scale (8). Though theoretically, it can assess the agreement between panel members' responses, so far, no study has tested the utility of the consensus index in evaluating the panel member response in an SCT.
Our study aimed to develop and validate a Script concordance testing model in the domains of diagnosis, investigation, and management for ENT undergraduates. We also collated the panel responses, tested the utility of the consensus index and panel response pattern to assess the agreement between the panel members in evaluating the SCT.

Materials And Methods
The present study was carried out in the Department of ENT, in Jawaharlal Institute of Postgraduate Medical Education and Research (JIPMER), Puducherry, India, a tertiary care teaching hospital to develop the Script Concordance testing to foster clinical reasoning among undergraduate medical students. The Institutional Ethics Committee of JIPMER approved the study (JIP/IEC/2014/9/460). All methods were carried out in accordance with the guidelines and regulations of JIPMER. The methodology was designed to be an evolving pattern of constructing SCTs, administering them to the panel members, analysing the panel with response pattern and consensus index, based on which the SCT's nal items were chosen to be administered to the students. Item-total correlation and Cronbach's alpha were calculated from the students' scores.

Construction of the Scripts
The study used the guidelines put forth by Fournier et al. (6) for the construction of script concordance tests. The case scenarios were made by two specialists in ENT who had more than ve years' experience in ENT practice and undergraduate teaching. The authors of this study designed an SCT comprising 26 clinical scripts, each clinical script with three to six items, thus a total of 98 items. The developed 98-item SCT is attached as Supplementary File 1. These items were made at the standard of the undergraduate curriculum. Each script of the SCT was designed to re ect the common ENT conditions the undergraduate students face and learn during their clinical postings. The scripts included rhinology, otology, and laryngology. The scripts were developed to promote clinical reasoning in the domains of diagnosis, investigation, and management for an undergraduate student.

Construction of the Expert panel
To achieve the highest possible reliability for our study, we set up a panel comprising 20 members (9). Each panel member was a certi ed ENT specialist who had a work experience of more than ve years and had undergone undergraduate curriculum pedagogy training. The SCT was mailed to the panel members. The experts took the SCT independently, and their responses were recorded. After collecting all the panel members' responses, for each item, the number of panel members marking the respective responses were aggregated, as shown in Table 1. In this way, for all 98 items, the aggregated responses were recorded.

Construction of the Scoring Grid
From the responses obtained from the panel, for each item, a credit score was calculated corresponding to the proportion of panel members who have chosen the same response. The credit scores ranged from 0 to 1. The maximal score of 1 was given to the response having the maximal number of panel members (modal response) agreeing to it. A partial credit score was calculated for any nonmodal panel member response. The number of panel members who gave the modal response was taken as the denominator and divided to the other nonmodal response to get this proportional credit scoring. This method of aggregate scoring has a better construct validity than consensus scoring (10) as well as better reliability and validity coe cients (11).
For example, on item number 46 from Table 1, 14 members chose "+1" on the Likert scale. Hence "+1" is the modal response.
In this way, the credit scores for all the 98 items in the 26 clinical scripts were tabulated to create the scoring grid (Table 3 four types: ideal response, uniform response, bimodal, and outlier response. When we attempted to classify our panel responses in a similar manner, we found an additional pattern, which we labelled as partial ideal response, which is elaborated below (Fig.  1). We noticed that the panel members were split in choosing the extremes of the options available for some questions. We called this a bimodal response (Fig 1a). When there was an equal spread in the number of members choosing all the ve options, it was classi ed as uniform divergence responses (Fig 1b). A discrete outlier response (Fig 1c) was labelled when there were one or more responses beyond a nil response. The ideal response pattern meant a close convergence with some variation limited within </= 3 options (Fig 1d). We noticed the fth pattern in which there was relatively close convergence with some variation limited by four options chosen by the panel members, and we labelled this as partial ideal response pattern (Fig 1e).
We eliminated the items showing uniform and bimodal patterns (as elaborated in the Results section), and the SCT to be administered to the students had 82 items. Analysing the panel response patterns to identify the uniform and bimodal response patterns was a time-consuming process. So, we looked for a much simpler tool to do the same. We tested the use of the consensus index to achieve this purpose.

Consensus Index
The consensus index re ects the agreeability among the panel members for each item, and it is calculated using the following formula, where µ X is the mean of item X, and d X is the width of X, d X = X max -X min .
The consensus index takes a value ranging from 0 -100, with complete disagreement being 0 and 100 being the entire agreement. As an ordinal measure of the panel members' scoring, the consensus index is argued to be superior to mean and standard deviation (8).

SCT Administration to Students
Around 30 undergraduate (UG) students and 10 postgraduate (PG) students of ENT volunteered for this study. Informed consent was taken after explaining their role in this study. The SCT was provided in printed format to the students in a pre-designated hall for this study, and their responses were to be marked in the answer sheet. No time limit was set to complete the SCT. However, all the participants were required to answer all the 82-item SCT.

Scoring of students
Based on the scoring grid, credits were awarded to the students for all the 82 items, and the total marks of each student and mean marks scored by all students in each item were calculated. Each student's total credit score was calculated and converted to a 100 point score to yield the result as a percentage. Mean marks scored by all students were calculated separately for undergraduate and postgraduate students.

Statistical Analysis
With the con dence interval set at 95%, the students' mean score for each item was calculated along with its standard deviation (SD). These scores of the students were compared with the responses of the experts using a t-test. A p-value of less than 0.05 was taken as signi cant. The reliability of the test was calculated by Cronbach's alpha. Pearson's correlation was used to ascertain the item-total correlation. IBM SPSS software (Version 17; SPSS Inc., Chicago, Illinois, USA) was used for statistical analysis.
Item-total correlation is the correlation between the score on a particular question and the collective score on all the remaining questions. In short, questions with low item-total correlations do not produce responses that are consistent with the remainder of the test (12). We used the item-total correlation to identify questions with low values (r<0.05). The items with low item-total correlation were discarded, and the Cronbach's alpha reliability coe cient was recalculated after the deletion of such items.

Results
The panel members' summary statistics were already discussed in the materials and methodology in the SCT evolution.

Panel Members Response Summary
The grid score was applied to the panel members' responses. The expert panel members had a mean score of 79.5 (+/-4.4 SD). All panel members had a score within two standard deviations from the mean.

Optimization of Panel members responses
On analysing the response patterns of the panel members, we noticed ve types of response patterns as described above (Fig 1). Based on the above classi cation, we noticed that 37 items showed an ideal response pattern, another 37 items showed a partial ideal response pattern, and 8 items showed a discrete outlier pattern. The rest of the items had uniform (n=14) or bimodal response patterns (n=2). These 16 items (2 bimodal and 14 uniform response pattern items) were deleted, and the 82 items SCT was developed for administering to the students.
Over the 98 item SCT on calculating the consensus index for each item, we found that our study's consensus index ranged from 25.8 to 100. We tried to identify if there is any relation to the panel response patterns to the consensus index (Table 4). We noticed that an item with low agreeability among the expert panel members (low consensus index) is an item that is poorly constructed or overtly confusing.

Panel response pattern and consensus index
The responses recorded from the 20 membered expert panels to the original 98-item SCT and the response pattern type identi ed were sorted in the descending order of each item's consensus index. (Table 5) In this study, we analyzed the panel members' response patterns and identi ed 16 items having bimodal or uniform response patterns. Interestingly when we analyzed the items with a consensus index of less than 60, it had 2-bimodal, 13-uniform, and 1partial ideal response pattern. Setting a consensus index cut-off at 60 was able to identify 15 of the 16 items with bimodal or uniform response patterns. In this way, we propose using a consensus index to improve the quality of the items in the SCT.
We need more studies of SCT to determine the permissible cut-off value for the consensus index among the panel members for each SCT item.

Students Response Summary
On administering the 82-item SCT to the participants, the 30 undergraduate students had a mean score of 61.1 (+/-7.5 SD), and the 10 postgraduate students had a mean score of 67.7 (+/-6.3 SD). (Table 6) Item total Correlation Item-total Correlation was done with Pearson's Correlation. We categorised the 82-item SCT into 3 groups -r<0.05 = poor item; 0.05<r<0.2 = fair item, r>0.2 = good item. 22 items were r<0.05, 15 items had 0.05<r<0.2 and 44* items had r>0.2. *In the analysis, the undergraduate scores for one item (Item no.6) was found to have a Mean score of 1 with SD 0 as all the students responded with the modal panel responses (+1 and +2).

Cronbach's Alpha
The Cronbach's alpha of the 82 items SCT administered for the students was 0.74. Among these 82 items, 22 items had a poor item-total correlation (<0.05). Excluding those poor items, the nal SCT instrument of 60 items had a Cronbach's alpha of 0.82.

Relationship of Consensus index with Cronbach's alpha
We extrapolated our study and tried to identify a permissible cut off of the consensus index. We tested on different consensus indices to learn how it in uences the Cronbach's alpha and the number of items on SCT. (Table 7) Discussion In our study, we developed 98 items of SCT. We found that 14 of them had a uniform response, and 2 of them had bimodal response patterns. The consensus index calculated for the 98 item SCT ranged from 25.81 to 100. When we eliminated the 16 items of bimodal and uniform response pattern, the consensus index ranged from 58.65 to 100. We administered this 82 items SCT to 30 undergraduate and 10 postgraduate students. The mean score of undergraduate students was 61.1 (+/-7.5 SD), and that of postgraduate students was 67.7 (+/-6.3 SD), which was statistically signi cant using independent t-test. Cronbach's alpha for this 82 item SCT was calculated to be 0.74. On analysing the item-total correlation, we found 22 items had a correlation of less than 0.05. Excluding these 22 poor items, the nal SCT instrument of 60 items had a Cronbach's alpha of 0.82.
The currently available methods to evaluate an undergraduate medical student include Case presentation, Objective structured clinical examination (OSCE), and Multiple Choice Questions (MCQ). These strategies have their own shortfalls of being cumbersome, time-consuming, and resource-intensive. Script Concordance Testing (SCT) offers an innovative approach for the same while addressing these shortcomings. It can be used to assess undergraduate students' clinical reasoning and analytical knowledge as it is a simple yet effective instrument.
In this study, we found the panel response pattern and the consensus index to be good drivers for designing a Script Concordance testing tool. Based on our observation, we noticed the ve patterns as described above. The study by SH Wan et al.
collated the panel responses into four types (3). We additionally described a partial ideal response as it seemed necessary as the response clustering was robust in either of the polarities. They have described ideal as a pattern with relatively close convergence with some variation. They meant the responses should be clustered within three contiguous responses on either side, including the zero. At the same time, this leaves out a meaningful pattern that was interestingly found in our study, which was the response pattern, including the fourth contiguous response. Still, the last response is very meagre, which forced us to classify it as partial ideal. The partial ideal category also ful lled the necessity of close convergence with minimal variation.
In our study, we calculated the consensus index for the responses from the panel members. Whereas the consensus index was not calculated in other studies, studies show that the consensus index may be appropriate or even superior in analysing the Likert scale's ordinal pattern used in Script Concordance Testing (8). The consensus index is the closest measure of capturing the collective opinion of the panel member, which maybe effectively used in optimizing the items. The items with a low consensus index re ect more variation and need further modi cation before administering to the students. It is also clear that the range of consensus index matches the panel response clusters. The relationship between the consensus index and the response clusters may be evident from the table will require a further detailed statistical analysis. The optimal cut-off for the consensus index to differentiate a bad and good item may be the prospects of future study and will require a bigger sample size.
From the undergraduate student's response perspective, we calculated the item-total correlation for each item. We noticed 22 items were poorly correlating to the total score despite the items having an agreeable response from the panel members. Such poor items were identi ed and excluded to improve the strength of the SCT. The Cronbach's Alpha of the nal 60 items was 0.82, which indicates a very good internal consistency of the Script Concordance Test. Even with initially administered 82 items, the Cronbach's Alpha was 0.74. The Cronbach's alpha we got is comparable to other studies like 0.80 in Kamyar Iravani et al. (4), 0.73 in Aloysius Humbert et al. (11).
It is also clear that the mean scores of the undergraduate and the Postgraduate students are signi cantly different, proving the ability of the SCT to discriminate the pro ciency levels of the students. As could also be noted, the maximum score of the undergraduate student and the postgraduate student was very close, showing no statistical difference. It shows that a topperforming undergraduate student will reason out well at the post graduate's pro ciency level. Similar ndings of SCT scores differentiating the participants based on their expertise level have been noted in the literature (13)(14)(15).
Robert Gagnon (7) concluded that optimizing done by an independent review by another three experts to remove the deviant answers ('judgment-by-experts') is superior of the three. They also concluded that distance form mode is also a practical and e cient method. The 'distance-from-mode' method described by them is more or less similar to the pattern response clustering theme described by Wan S (3) and also as used in our study. In our study, an item scoring the consensus index of more than 60 seems to optimize the panel responses. The results of the consensus index are also able to shadow the response clustering pattern. The consensus index is easy to calculate and provides an objective mathematical value for optimizing the panel responses. Finding the appropriate cut off value for the consensus index that can be used in script concordance testing universally may be the future prospects. when the panel members mark -2 as their response, a student marking -3 might get 0 credit though his thinking was in the right direction (16). Similarly, the 5-point Likert scale was found to be better than the 3-point Likert scale in various studies (6,11).
The number of panel members used in our study was 20. This is in accordance with studies (9), which showed no statistically signi cant advantage of using panel members more than 20.

Conclusions
This study clearly revealed that the consensus index could also be used to item analyze the script concordance test. It also provides a clear-cut numerical value so that it is objective in further analysis. Our study revealed that a consensus index of above 60 had a good item-total correlation with good internal consistency. Further studies are required to nd out the exact cut-off values for the consensus index. Our study also revealed that the panel response clustering pattern could also be used to categorize the items though bimodal and uniform distribution patterns need further differentiation.
Our study also showed Cronbach's alpha and item-total correlation were useful in detecting the psychometric properties of the Script Concordance Test.
The developed Script Concordance test (SCT) tool has high internal reliability and can be used in the routine assessment of undergraduate students. In the future, such SCT can be integrated into the curriculum of medical students during their academic years.