Setting and participants
The study was done in a clinical setting, at the Department of Internal Medicine in our University Medical Centre. The outpatient clinic of Internal Medicine has two consulting rooms equipped with a fixed two camera set-up to be used for educational purposes which were used in this study. Video-observation and feedback are a standard part of medical training during the clerkship. The clerkship of internal medicine is the first clinical clerkship in the first master year (M1).
Participating students were asked to record the history taking of their meetings with real patients. They were asked to start doing this from the third patient they would see. The first meetings could be used to get accustomed to the process of the clinical practical setting and the contacts with real patients. As common practice, students were not accompanied by their supervisor during the encounter with the patient but received feedback from the supervising physician directly after the consultation during the presentation of the patient. Thereafter, the student and supervisor would see the patient together. After the encounter, the supervisor registered among other clinical criteria, a global rating for the clinical reasoning in this case, as a grade from 1-10 in accordance to standard practice.
For this study, the students would complete an extra activity; the completion of a post-encounter form (PEF). This was done after history taking and physical examination, before receiving feedback of their supervising physician. Students would send their completed PEF digitally to the researchers. The assessors would afterwards observe and rate the history taking and rate the PEF based on their observation of the video registration and would then complete the ORT and the PERT of this patient-student interaction.
Participating assessors involved in assessment of the students by means of the ORT and PERT, were six principal lecturers, i.e., clinicians with degrees and assignments in medical education. They were not involved as the supervising physicians of the students at the time the recordings were made. They assessed the recordings of the students at the time of their choosing. These assessors received a short instruction about the two rating tools. They were asked to observe the recordings as long as they deemed needed to complete the observation rating tool (ORT). After completing this form, they also completed the post-encounter rating tool (PERT). The time needed to fulfil the assessment was registered by the participants.
Materials
The observation rating tool (ORT) was developed in a previous qualitative research project. (8) It consisted of eleven pairs of opposite statements about student behaviour related to clinical reasoning. On 5-point scale participants could rate which of the opposite statements was most applicable. (additional file 1)
The student post- encounter form (PEF) and the PERT were modified after Durning et al (6). (additional file 2&3) This PEF the students had to complete was made up of items capturing the essential parts of clinical reasoning; summary statement, problem list, differential diagnosis, most likely diagnosis with foundation for the most likely diagnosis. A physical examination plan was added. Assessment of these 5 items by the assessors was done by means of the PERT that includes a 5-point Likert-scale for the mentioned elements.
Case selection
Twenty students were invited to participate in the study. One case per student was used. The first case that did not meet the exclusion criteria was selected. Cases were excluded when a student was not well visible or when audio quality was low, or the patient was not able to communicate easily, e.g., because of language barrier. Cases that would not induce clinical reasoning, for example when a patient presented with an established diagnosis, were also excluded. After inclusion of 15 cases, selection was stopped. We calculated that 15 cases and 6 observers were needed to reach a power of > 80% to detect an intra-class correlation of 0.30 for the new observation form (9).
Measurements
Feasibility was measured as student completion rate and the time the assessors needed to complete both the rating instruments. Assessors were asked about their approval and satisfaction with the instruments. We included the time for completion, because it is a limiting factor in clinical practice.
Validity for both instruments was measured as follows: (10) Content validity; For both instruments content validity was measured in two previous studies (6, 8). Internal structure; analysis of internal consistency and Generalizability study to explore factors of variance. Response process: This was measured by rater evaluation of both instruments. Relation with other variables: This was done by analysing the association between both instruments.
Reliability of the use of the PERT was already analysed in an OSCE(6), but not for assessment in clinical practice. For both rating tools (PERT and ORT) inter-rater reliability was tested and a Generalizability study (G-study) was performed. A G-study is a means of identifying systematic and unsystematic error variation and estimating the contribution of each one.
Since it is known for workplace-based assessment tools that reliability, when using one rater, is often low, a decision study (D-study) to assess how many raters would be needed for each instrument to reach an acceptable reliability was done. To that purpose the information from the G-study is used in the D-study that pinpoints to one particular element. In this way a more precise estimate can be made.
Statistics
Cronbach’s alpha was used to calculate internal consistency of scales. Inter-rater agreement was computed using the Intra Class Correlation. We used the ‘two-way mixed model’, because assessments will be done by a selected group of assessors, with measures of ‘consistency’, because the assessment will not be used as a pass or fail test. We used ‘single measures’, since one assessor on the work floor usually performs the assessment during the clerkships.
A G-study was conducted to identify various sources of variance. For evaluation of the observation rating tool a two-facet crossed design with six assessors and 11 items was used. For evaluation of the post-encounter rating tool a two-facet crossed design with six assessors and 5 items was used. A relative G coefficient was computed since we were mostly interested in the rank order of the measurement objects rather than consistency in raw scores.
A D-study was performed to forecast changes in G coefficients with alternate levels of facets (assessors and checklist items).
The association between the two rating tools was calculated using Pearson’s correlation coefficient.
We used SPSS version 20 for intra-class correlation and Pearson’s correlation coefficient. Generalizability study was done using the G1 SPSS program (11)
Ethics
Participation was voluntary for all students. All patients were informed about the study, agreed and signed the informed consent form for recording of history taking for educational purposes. Approval of the institutional ethical review board was not required according to the guidelines of the institutional research board based on the Dutch law (WMO)