This paper presents initial validity evidence for the use of the UCSOM CPE as an assessment of PE competence in medical students. In the results we detailed the validity evidence obtained from each of Messick’s five sources. Here we interpret validity evidence gathered using Messick’s sources using Kane’s validity framework to summarize the inference of CPE assessment scores and the overall validity argument for the use of the CPE as an assessment of PE competence in medical students and identify evidence gaps. The argument follows a stepwise approach through each of the four inferences in Kane’s validity framework—scoring, generalization, extrapolation, and implications.10
The scoring inference (translating an observation into a score) was supported by expert review of UCSOM CPE items and ongoing quality assurance processes in the clinical performance center. Reliance on these quality assurance processes as an assessment of response process requires that quality assurance processes occur annually to ensure acceptable rater performance and training. Formal evaluation of inter-rater reliability of both the real-time quality checks and the video reviews of the borderline and failing students would further strengthen this inference. High overall means of PE performance across the assessments in the first two years of medical training suggest that students are able to perform recently learned PE skills in a clinical performance center assessment setting.
The generalization inference involves the extent to which a score on a given assessment is representative of performance in a testing setting. The overall phi coefficient for the G study of 0.258 suggests low reliability for a single assessment, and that sources of error not considered as facets in this G study contribute significantly to the variance in scores. The largest contributors to score variability were the person-occasion (5.5%) and person-item (5.4%) interactions, indicating that individual student performance varied by occasion (from Spring M1 to Spring M2) and across items. Learner-specific factors contributing to score differences between occasions may include learning or unlearning (decay) of PE skills from Spring of M1 to Spring of M2 and/or changes in the motivation of individual learners to prepare for the assessment. Rater-specific factors related to the scoring of specific items are another likely source of error contributing to variance. The low generalizability coefficient suggests that inferences about PE skills based on the UCSOM CPE alone should be made with caution, and that the UCSOM CPE in isolation should be used primarily as a formative assessment.
The extrapolation inference relates to using the score as a predictor of real-world performance.
Absent performance measures in clinical settings, the relationship of UCSOM CPE scores to other assessments of PE skills may provide some indication of the transfer of skills beyond the UCSOM CPE. The correlations between the various UCSOM clinical skills assessments are similar to the correlations between cases of an OSCE, which has been shown to be in the range of 0.1 to 0.3 between stations.11,12 These correlations between different PE assessments are consistent with case specificity, since each of the system-based assessments included different subsets of CPE and non-CPE items. Correlations between performance on these PE assessments and measures of PE skill during clerkships would strengthen this inference.
The relatively low M3 OSCE PE scores are striking, and consistent with low PE scores in other studies of end-of-M3 OSCEs. 13,14,15 The lower scores in M3 may be due to a decay in PE skills during clinical clerkships; alternatively, M3 students may know how to perform the maneuvers but struggle to select the appropriate PE items to perform in a given encounter, indicating the need for more practice in clinical reasoning. This could be addressed by adding a hypothesis-driven (i.e. HDPE) component to PE instruction, to provide practice in the use of the PE in the service of accurate diagnosis of the patient.2
The implication inference (applying the score to inform a decision) was probed by exploring the impact of different passing standards. The consensus pass-fail cut scores and the normative cut scores were significantly lower than the cut score established using the modified Angoff procedure, which resulted in an unacceptably high failure rate. Our experience at UCSOM suggests that the Angoff 90% cut score may represent the PE competence of a student who is well prepared for entering into supervised clinical practice in the clinical clerkships, rather than that of the targeted minimally competent or borderline student who is preparing for entry into a pre-clinical supervised preceptorship. Repeating the Angoff exercise after a more detailed discussion of the target student and the intended inference might correct this disjunct. In the meantime, the clinical skills course directors considered the 80% consensus pass-fail cut score for entry into supervised practice within a clinical preceptorship experience as both defensible, because of the lack of high correlations of the CPE to other PE competence assessments, and practical, because of the costs involved in remediating large numbers of students.
Next steps in the evolution of the UCSOM PE curriculum towards a Core + Cluster curriculum include a transition away from body systems to specific PE clusters related to specific chief complaints with a continuing emphasis on the UCSOM CPE. Additional assessments of PE performance should be considered in clerkships experiences. Composite reliability of PE assessments across the pre-clinical and clinical years may allow for high stakes decisions to be made about PE performance. Programmatic assessment is an emerging approach to assessment in which multiple assessments over time may be combined to make high stakes decisions about advancement and promotions. 16 Scholarly work related to the development and incorporation of PE clusters as part of the PE curriculum for the Core + Cluster curricula at UCSOM, an expansion of the items included in the UCSOM CPE, and correlations to additional variables, such as performance in clerkships or to the USMLE Step 2 Clinical Skills assessment would be reasonable next steps in the evolving considerations for teaching and assessing PE competence in medical students using a programmatic approach.
Limitations
Our study uses the UCSOM CPE as a representative exemplar of the Core Physical Exam approach to teaching and assessing the PE. As the UCSOM CPE is an institution-specific, 25-item version of the published 37-item CPE, the applicability and generalizability of these results to the full CPE and to other settings is limited. The low generalizability coefficient suggests that inferences about PE skills based on the UCSOM CPE should be made with caution.