Validity Evidence for the Medical Student Core Physical Examination

The Core Physical Exam (CPE) has been proposed as a set of key physical exam (PE) items for teaching and assessing PE skills in medical students, and as the basis of a Core + Cluster curriculum. Beyond the initial development of the CPE and proposal of the CPE and the Core + Cluster curriculum, no additional validity evidence has been presented for use of the CPE to teach or assess PE skills of medical students. As a result, a modied version of the CPE was developed by faculty at the University of Colorado School of Medicine (UCSOM) and implemented in the school’s clinical skills course in the context of an evolving Core + Cluster curriculum.


Introduction
Teaching strategies and methodologies to promote medical students' physical examination (PE) skills include the Head-to-Toe Physical Examination, Hypothesis Driven Physical Examination, and Core + Cluster Physical Examination. 1,2,3 The strengths and limitations of these approaches have been debated in recently published literature with no single best methodology for teaching PE emerging. 4 The Head-To-Toe physical examination (HTT), a screening PE comprised of about 140 maneuvers performed on Standardized Patients (SPs), has been used to assess the acquisition of foundational PE skills prior to entering clerkships. The HTT has been shown to be useful as a summative assessment of PE skills prior to entering clerkships. 1 The Hypothesis-Driven Physical Examination (HDPE) was developed to promote critical thinking related to the PE in the context of diagnostic challenges within a patient presentation. 2 The HDPE approach provides students with targeted practice in anticipating, eliciting, and interpreting PE maneuvers in the context of patient cases with focused diagnostic challenges. Unlike the HTT method, the HDPE has the potential to promote the development of clinical reasoning through targeted practice and feedback in the process of the selection and interpretation of PE maneuvers using patient based diagnostic challenges. 2 More recently, the Core Physical Examination (CPE), has been promoted as an instructional and assessment methodology for teaching PE skills as part of a Core + Cluster curriculum. 5 The CPE consists of 37 key PE items based on a survey of internal medicine clerkship directors and clinical skills course directors. 3 Advocates of the CPE intend that the CPE maneuvers should be taught in combination with symptom-driven Clusters of additional PE maneuvers. Beyond the initial development of the CPE and proposal of the CPE and the Core + Cluster curriculum by Gowda et al, no additional validity evidence has been presented for use of the CPE to assess PE skills of medical students.
A modi ed version of the CPE was developed by faculty at the University of Colorado School of Medicine (UCSOM) and implemented in the school's clinical skills course in the context of an evolving Core + Cluster curriculum. The purpose of this study was to provide initial validity evidence for the use of the UCSOM CPE, as an exemplar of the CPE approach, in the assessment of PE skills in medical students. We gather validity evidence using Messick's uni ed validity framework in the methods and use Kane's validity framework to make arguments on score inference and interpretation in the discussion.

Instructional Methods and Assessments
The UCSOM PE curriculum is a body system-based curriculum. The CPE was introduced as the rst step towards a Core + Cluster curriculum. The UCSOM CPE consists of a 25-item subset of the originally published 37-item CPE; 3 the remaining 12 items are taught at UCSOM in the context of body-system maneuvers. Most of the remaining 12 items are neurologic items and these items were not included in the rst iteration of the CPE as they were not taught until the second year of training when the basic science and anatomy of the neurologic system were introduced. The UCSOM PE curriculum is implemented by PE teaching assistants (who serve as SPs and raters for the clinical performance center), senior medical students, and clinical faculty in the pre-clinical years in the context of a clinical skills course. See Appendix A for a detailed listing of the UCSOM CPE items and scoring criteria.
In the rst year of medical school, UCSOM students learn six complete body systems: head and neck, pulmonary, cardiovascular, abdominal, upper musculoskeletal and lower musculoskeletal body systems. In the second year, the neurologic body system is taught. Each set of body system PE maneuvers is made up of a subset of items contained within the UCSOM CPE plus additional PE maneuvers. The total number of items contained within the six body systems is similar in scope to most versions of the HTT with a total of 104 PE items taught at the UCSOM. Later in the rst year, the UCSOM CPE is taught as a 25-item cohesive subset of PE maneuvers that borrows from all six body systems. Appendix A includes the CPE items with instructions that students are provided. UCSOM students are assessed on their clinical skills in each semester during their clinical skills course. In the rst two years of training, clinical skills PE assessments emphasize either body systems or the UCSOM CPE. In the third year, a single 10-station clinical skills assessment emphasizes the selection and performance of PE maneuvers in the context of a series of clinical cases. Table 1  items based on the presenting symptoms of the patient. PE items were scored as performed, performed incorrectly (half credit), or not performed. One of the cases was a telephone only case and did not include any PE items.

Study Participants
This study was conducted with longitudinal cohort data from clinical skills assessments of the medical student Classes of 2019 and 2020 during the rst three years of training of each cohort. Data collection occurred from September 2015 through December 2019.
The study was considered exempt by the University of Colorado and University of Illinois Chicago institutional review boards.

Validity Evidence
Validity evidence was sought based on Messick's uni ed validity framework. 7 Content evidence was obtained by reviewing the process of developing and selecting the items of the UCSOM CPE.
Response process evidence was based on review of the materials provided to students and of the quality assurance processes related to scoring the assessment.
Internal structure. Reliability estimates were obtained through a Generalizability (G) study using G-String IV (Hamilton, Canada) across the M1-Spring: CPE assessment and the M2-Spring: CPE assessments for the class of 2020. 8 Persons (p) were the objects of measurement, items (i) were xed (the set of CPE items assessed), and the occasion (o) for the assessment was considered random (M1-Spring: CPE and M2-Spring: CPE). The design was fully-crossed, person (p) crossed with UCSOM CPE items (i) and occasion (o). Raters were not included as rater data were not available.
Relationship to other variables. Spearman correlation coe cients were calculated to measure associations between the ve clinical skills assessments across the rst three years of the curriculum. Spearman correlations were performed in lieu of Pearson correlations as the results of the assessments were not normally distributed given the high overall means for PE performance.
Consequences. The consequences of establishing pass-fail cut scores at the UCSOM using normative standards (1.5 or 2 standard deviations (SD) below the mean), clinical course director determined consensus scores, and an item-level, modi ed Angoff score were explored. Historically, pass-fail cut scores for assessments had been established as either clinical skills course director determined consensus pass-fail cut scores (80% or 75%) or normative determined pass-fail cut scores. For the UCSOM CPE, an item-level, modi ed Angoff standard setting exercise was conducted with 8 faculty including 2 clinical preceptors, 2 clinical block directors, and 4 clinical skills course directors. The experts were asked to estimate the percentage of borderline students who would correctly perform each item. The borderline student was de ned as a minimally competent student to enter into supervised practice with an individual preceptor in their clinical practice setting. Prior to the start of the standard setting process, judgments were informed by performance data from the initial M1-Spring: CPE assessment. The pass-fail cut score was determined following two iterations of discussion at the item level. 9

Results
Performance data were obtained for 366 students comprising the medical student Classes of 2019 (N = 182) and 2020 (N = 184). Response Process. Assessment materials created by the clinical skills course directors included the SP case, scoring rubrics, and instructions for students, SPs, and raters. The clinical performance center training process involved both a 4-hour SP and rater training session and 4-hour SP portrayal and rater practice session. A subset of all ratings for each SP and rater were reviewed in real time by another rater as a quality check of rater performance. Expert raters, in a blinded fashion, re-watched and re-scored videos of all the borderline and failing students, corrected any errors in the initial rater scoring, and provided feedback to the raters for any errors identi ed.
Internal Structure. Descriptive statistics (means and standard deviations) for each assessment across the rst three years of medical training are shown in Table 1. The results of the G study for the Class of 2020 are shown in Table 2. The largest contributors to score variability were the person-occasion (5.5%) and person-item (5.4%) interactions, indicating variability in learner performance depending on occasion and on speci c items performed, respectively. The overall phi coe cient reliability was 0.258 and the G coe cient reliability was 0.308. Decision (D) studies determined that increasing the number of iterations of assessing the UCSOM CPE to six occasions would increase the phi coe cient to 0.486.Increasing the number of items in the UCSOM CPE to 37 items (similar to the published CPE) would increase the phi coe cient to 0.281.  Table 3. The UCSOM CPE assessments showed low correlations to the assessments during years one and two and the M3-Spring: OSCE. Correlations between the M1-Spring: CPE assessment to the M2-Spring: CPE assessment, both of which contain all UCSOM CPE items, were generally higher than correlations to the body system assessments.  Consequences. The outcome of the modi ed Angoff pass-fail score determination was 90%, which would have resulted in a failure rate of 10-13% in the M1-Spring CPE and 36-39% for the M2-Spring CPE. Failure rates for the 1.5 SD below the mean pass-fail cut score were in the range of 6-8% in the M1-Spring CPE and 5-10% for the M2-Spring CPE. Failure rates for the 80% consensus pass-fail cut score were in the range of 1-2% in the M1-Spring CPE and 8-10% for the M2-Spring CPE. Table 4 shows the numbers of students who would have failed each year based upon the various pass-fail cut scores.

Discussion
This paper presents initial validity evidence for the use of the UCSOM CPE as an assessment of PE competence in medical students. In the results we detailed the validity evidence obtained from each of Messick's ve sources. Here we interpret validity evidence gathered using Messick's sources using Kane's validity framework to summarize the inference of CPE assessment scores and the overall validity argument for the use of the CPE as an assessment of PE competence in medical students and identify evidence gaps. The argument follows a stepwise approach through each of the four inferences in Kane's validity framework-scoring, generalization, extrapolation, and implications. 10 The scoring inference (translating an observation into a score) was supported by expert review of UCSOM CPE items and ongoing quality assurance processes in the clinical performance center. Reliance on these quality assurance processes as an assessment of response process requires that quality assurance processes occur annually to ensure acceptable rater performance and training. Formal evaluation of interrater reliability of both the real-time quality checks and the video reviews of the borderline and failing students would further strengthen this inference. High overall means of PE performance across the assessments in the rst two years of medical training suggest that students are able to perform recently learned PE skills in a clinical performance center assessment setting.
The generalization inference involves the extent to which a score on a given assessment is representative of performance in a testing setting. The overall phi coe cient for the G study of 0.258 suggests low reliability for a single assessment, and that sources of error not considered as facets in this G study contribute signi cantly to the variance in scores. The largest contributors to score variability were the person-occasion (5.5%) and person-item (5.4%) interactions, indicating that individual student performance varied by occasion (from Spring M1 to Spring M2) and across items. Learner-speci c factors contributing to score differences between occasions may include learning or unlearning (decay) of PE skills from Spring of M1 to Spring of M2 and/or changes in the motivation of individual learners to prepare for the assessment. Rater-speci c factors related to the scoring of speci c items are another likely source of error contributing to variance. The low generalizability coe cient suggests that inferences about PE skills based on the UCSOM CPE alone should be made with caution, and that the UCSOM CPE in isolation should be used primarily as a formative assessment.
The extrapolation inference relates to using the score as a predictor of real-world performance.
Absent performance measures in clinical settings, the relationship of UCSOM CPE scores to other assessments of PE skills may provide some indication of the transfer of skills beyond the UCSOM CPE.
The correlations between the various UCSOM clinical skills assessments are similar to the correlations between cases of an OSCE, which has been shown to be in the range of 0.1 to 0.3 between stations. 11,12 These correlations between different PE assessments are consistent with case speci city, since each of the system-based assessments included different subsets of CPE and non-CPE items. Correlations between performance on these PE assessments and measures of PE skill during clerkships would strengthen this inference.
The relatively low M3 OSCE PE scores are striking, and consistent with low PE scores in other studies of end-of-M3 OSCEs. 13,14,15 The lower scores in M3 may be due to a decay in PE skills during clinical clerkships; alternatively, M3 students may know how to perform the maneuvers but struggle to select the appropriate PE items to perform in a given encounter, indicating the need for more practice in clinical reasoning. This could be addressed by adding a hypothesis-driven (i.e. HDPE) component to PE instruction, to provide practice in the use of the PE in the service of accurate diagnosis of the patient. 2 The implication inference (applying the score to inform a decision) was probed by exploring the impact of different passing standards. The consensus pass-fail cut scores and the normative cut scores were signi cantly lower than the cut score established using the modi ed Angoff procedure, which resulted in an unacceptably high failure rate. Our experience at UCSOM suggests that the Angoff 90% cut score may represent the PE competence of a student who is well prepared for entering into supervised clinical practice in the clinical clerkships, rather than that of the targeted minimally competent or borderline student who is preparing for entry into a pre-clinical supervised preceptorship. Repeating the Angoff exercise after a more detailed discussion of the target student and the intended inference might correct this disjunct. In the meantime, the clinical skills course directors considered the 80% consensus pass-fail cut score for entry into supervised practice within a clinical preceptorship experience as both defensible, because of the lack of high correlations of the CPE to other PE competence assessments, and practical, because of the costs involved in remediating large numbers of students.
Next steps in the evolution of the UCSOM PE curriculum towards a Core + Cluster curriculum include a transition away from body systems to speci c PE clusters related to speci c chief complaints with a continuing emphasis on the UCSOM CPE. Additional assessments of PE performance should be considered in clerkships experiences. Composite reliability of PE assessments across the pre-clinical and clinical years may allow for high stakes decisions to be made about PE performance. Programmatic assessment is an emerging approach to assessment in which multiple assessments over time may be combined to make high stakes decisions about advancement and promotions. 16 Scholarly work related to the development and incorporation of PE clusters as part of the PE curriculum for the Core + Cluster curricula at UCSOM, an expansion of the items included in the UCSOM CPE, and correlations to additional variables, such as performance in clerkships or to the USMLE Step 2 Clinical Skills assessment would be reasonable next steps in the evolving considerations for teaching and assessing PE competence in medical students using a programmatic approach.

Limitations
Our study uses the UCSOM CPE as a representative exemplar of the Core Physical Exam approach to teaching and assessing the PE. As the UCSOM CPE is an institution-speci c, 25-item version of the published 37-item CPE, the applicability and generalizability of these results to the full CPE and to other settings is limited. The low generalizability coe cient suggests that inferences about PE skills based on the UCSOM CPE should be made with caution.

Conclusion
This paper presents the initial argument for the use of UCSOM CPE in the assessment of the PE skills of pre-clinical medical students. Initial validity evidence supports the use of the UCSOM CPE as an instructional strategy for teaching medical students physical examination skills and as a formative assessment of physical exam skills in readiness for precepted clinical experiences.