Generalizability of a Progress Clinical Skills Examination and the Assessment of the Growth in Clinical Skills in a Medical School Curriculum with Early Clinical Experiences

Background This study evaluates the generalizability of an eight-station progress clinical skills examination and assesses the growth in performance for six clinical skills domains among first- and second-year medial students over four time points during the academic year. We conducted a generalizability study for longitudinal and cross-sectional comparisons and assessed growth in six clinical skill domains via repeated measures ANOVA over the first and second year of medical school. The generalizability of the examination domain scores was low but consistent with previous studies of data gathering and communication skills. Variations in case difficulty across administrations of the examination made it difficult to assess longitudinal growth. It was possible to compare students at different training levels and the interaction of level of training and growth. Second-year students outperformed first-year students, but first-year students’ clinical skills performance grew faster than second-year students narrowing the gap in clinical skills over the students’ first year of medical school. Case specificity limits the ability to assess longitudinal growth in clinical skills through progress Providing students with early clinical skills and authentic clinical experiences appears to result the rapid growth clinical skills during the first year medical school.


Background
This study evaluates the generalizability of an eight-station progress clinical skills examination and assesses the growth in performance for six clinical skills domains among first-and second-year medial students over four time points during the academic year.

Methods
We conducted a generalizability study for longitudinal and cross-sectional comparisons and assessed growth in six clinical skill domains via repeated measures ANOVA over the first and second year of medical school.

Results
The generalizability of the examination domain scores was low but consistent with previous studies of data gathering and communication skills. Variations in case difficulty across administrations of the examination made it difficult to assess longitudinal growth.
It was possible to compare students at different training levels and the interaction of level of training and growth. Second-year students outperformed first-year students, but firstyear students' clinical skills performance grew faster than second-year students narrowing the gap in clinical skills over the students' first year of medical school.

Conclusions
Case specificity limits the ability to assess longitudinal growth in clinical skills through progress testing. Providing students with early clinical skills training and authentic clinical experiences appears to result in the rapid growth of clinical skills during the first year of medical school.

Background
Progress testing uses broad-based examinations that are designed to assess end of curriculum objectives and are given at regular intervals over a course of study. Progress testing was first implemented in the 1970's and has grown to be widely used in medical schools and to some extent residency programs. 1 There is a considerable body of research that suggests progress testing is particularly well suited for less structured problem focused curricula and encourages learners to study in ways that promote understanding rather than rote memorization. 2 Progress testing also provides a rich and systematic source of data for student feedback, program evaluation and identifying students that need remediation.
Progress testing has almost exclusively been implemented in the form of written examinations and rarely as clinical skills examinations. We could only identify one systematic implementation of clinical skills progress testing, and it was in an internal medicine residency program. 3 Many medical schools are now placing first-year students into authentic clinical settings, increasing the potential value of Progress Clinical Skills Examinations (PCSEs) for providing systematic feedback on the growth of the students' clinical skills. Furthermore, in the USA, the expectations for medical school graduates entering residency training are being operationalized through Core Entrustable Professional Activities (EPAs) for Entering Residencies 4 that do not necessarily lend themselves to being assessed via written examinations.
The changes in both the structure of undergraduate medical education and the expectations for graduates has increased the value of PCSE as an integral part of medical school assessment and evaluation programs. Although there has been ample research on the psychometrics, acceptance and impact of written progress testing 1 , there has been relatively little research done on PCSEs.
This study focuses on the psychometric and practical challenges of implementing a progress clinical skills program and our early findings about the growth in clinical skills over the first two years of medical school in a curriculum that includes early clinical experiences and extensive clinical skills training. Specifically, this study estimates the generalizability of an eight-station PCSE for assessing longitudinal growth in clinical skills over the course of the curriculum as well as for assessing cross-sectional differences in student performance at different levels of training. Secondly, this study assesses the impact of authentic clinical experiences and weekly clinical skills training in a simulation laboratory with faculty feedback on the growth of clinical skills over the first two years of medical school. The standardized patients used in the PCSE are trained to the specific PCSE cases they simulate. Both their portrayal of the case script and the accuracy of their completion of the checklist/rating forms are assessed by the staff in the simulation center before each PCSE is given and include measurements of inter-rater reliability. Adjustments to either the case or how the SP is trained are made when these quality assurance efforts identify a problem.

Subjects and Setting
Design -The PCSE is given as part of a broad-based progress assessment that also includes written examinations. These progress assessments occur twice each semester for a total of 20 assessments over the course of the medical school curriculum. Third-and fourth-year students are assessed using the PCSE once each semester. Depending on their rotational schedule each third-and fourth-year student is assessed either in the first or second PCSE given that semester. To pass in a semester, students must pass at least one of the two PCSE given that semester with scores deemed appropriate for their level of training. Third-and fourth-year students who do not meet course-specific expectations for all skill areas on the PCSE take a make-up exam to demonstrate their competency.
Since students in all four years of training take the same PCSE examination at roughly the same time, we can potentially observe the growth in clinical skills both longitudinally over the course of each students' medical training as well as cross-sectionally between students with different levels of training taking the same PCSE. The SP cases for each PCSE are drawn from a pool of cases that are continually being developed. The SP cases will eventually be reused but only after the students who were originally assessed via the case have graduated. As a result, students will not encounter cases from a previous PCSE in which they were evaluated.
As noted, third-and fourth-year students take a single PSCE each semester with a portion of the students taking the first administration of the PCSE in a given semester and others, the second. Given this complication, we chose to focus on first-and second-year student performance for this study. During fall semester 2017 and spring semester 2018, four PCSEs were conducted as part of the SDC progress assessment. Second-year students from the first matriculation class in the SDC and first-year students from the second matriculation class completed the assessments. The scores in these four administrations of the PCSE for the two classes of students were used to assess growth in the students' clinical skills during the first two years of the curriculum and the psychometric characteristics of the PCSE.
Generalizability Study -We conducted a generalizability analysis 8 of the PCSE domain scores separately for both first-and second-year students. We considered standardized patient cases as the only facet in the universe of admissible observations. This resulted in a student by SP case ANOVA design for estimating the variance components used in the generalizability study.
As noted above, we are interested in both cross-sectional comparisons of the first-and second-year students' performance as well as the longitudinal growth of the students' performance across multiple administrations of the PCSE. These two types of comparisons have different generalizability coefficients, and standard errors of measurement. 9 In the cross-sectional comparisons, the students at each level of training are assessed on the same eight SP cases. The error variance is equivalent to the error variance as defined in classical test theory. 10 When making longitudinal comparisons of the same students over multiple examinations, the comparisons are based on different SP cases that are not perfectly parallel. As such, longitudinal comparisons include an additional source of error from the variation among cases, have lower generalizability and larger standard errors of measurement than cross-sectional comparisons. The difference between these two types of measurement is also often referred to as "norm-referenced" and "domain-referenced" measurement. 11 We used GENOVA for conducting the generalizability analyses. 12 As noted, PCSE scores are reported as the percentage of possible points a student achieves in the domain across all eight cases. Since the generalizability analysis is based on case-level data, we conducted the generalizability analysis on the number of points achieved for each case. Since the generalizability coefficients are ratios of the expected values of variance components, the difference in metric did not impact the generalizability coefficients. It did, however, impact the standard error of measurement provided by GENOVA. To avoid this problem, we calculated standard errors of measurement from the observed standard deviation of the domain scores and the generalizability coefficients using a formula provided by Magnusson. 10 We conducted the analysis separately for first-and second-year students. Since there was no easy way to combine the estimated variance components from multiple administrations of the PCSE, we conducted the generalizability analysis on a single administration of the PCSE and used the data from the first administration of the PCSE given in spring semester 2018 for conducting the generalizability analysis.  Table 1 presents the means and standard deviations for each of the six domains in each class over the four administrations of the PCSE. Figure 1 presents the mean performance for each class across the four administrations as graphs. Table 2 provides a summary of the generalizability coefficients and estimated standard errors of measurement for cross-sectional and longitudinal comparisons.

Repeated Measures Analysis -
In the repeated measures analysis, the main effect for medical school year was significant (p < 0.001). As can be seen in Table 1 and Figure 1, the second-year students outperformed first-year students in all six domains. The main effects for administration (linear, quadratic, and cubic) were also significant (p < 0.001). It appears that case difficulty significantly impacted on the change in scores from administration to administration. There was a statistically significant (p < 0.001) interaction between year and linear change over administrations for all six domains. The quadratic and cubic components also were significant for the history domain (p < 0.001). As can be seen in Figure 1, the gap in performance between the first-and second-year students narrowed across the four administrations for all six domains. Table 1, Table 2 and Figure 1 about here Discussion PCSEs offers many of the benefits of written progress testing while assessing skills that cannot be measured via written examinations. Unfortunately, measurement error is a significant challenge when implementing a PCSE program, particularly for assessing longitudinal growth. We found the generalizability of the PCSE domain scores as our examination is currently configured to be lower than is generally acceptable for high stakes examinations. Physical examination and the post-encounter stations had the highest generalizability coefficients at around 0.50 for cross-sectional comparisons. The generalizability coefficients were the lowest for second-year students in the patient interaction domain and in the safety domain for both first-and second-year students.
The low generalizability coefficients for the patient interaction domain for the second-year students appears to be due mainly to a ceiling effect. Students in the SDC have largely mastered these skills by the end of the first year of the curriculum. At that point, students on average achieve over 90% of the possible points in this domain. The variability in the scores that is left appears to be mostly error variance. This, in of itself, is not necessarily a problem. It means we cannot easily differentiate among second-year students in their ability to communicate with patients because, for the most part, they have mastered this skill domain and what little difference there is in the scores does not replicate from case to case.
There is general agreement that ensuring patient safety and avoiding potential medical errors is a very important focus in medical training. These skills often cannot be adequately assessed through written examinations. We believe our PCSE is the first clinical skills examination to break these skills out as a specific domain that is scored Measurement error in classical test theory is assumed to be random with an expectation of zero. 10 While this source of measurement error remains a significant problem for assessing the performance of individual students using PCSE, it is less of a problem for assessing groups of students for in research and evaluation since the error in individual student scores tends to cancel out when averaged over multiple students in a research or evaluation study. This is not the case for the error associated with measuring longitudinal growth over multiple PCSE administrations using different sets of cases. The difficulty of the cases in different administrations of the PCSE is confounded with student growth in performance, making it difficult to assess longitudinal growth in clinical skills.
Unfortunately, assessing longitudinal growth one of the important benefits of progress testing.
The interaction between longitudinal growth and level of training is not directly subject to this type of measurement error. The repeated measures analysis demonstrated that there was a statistically significant interaction between level of training and linear growth in all six domains. As can be seen in Table 1 and Figure 1

Conclusions
PCSEs provide a standardized methodology for assessing important clinical skills that cannot be evaluated by written examinations. As in previous research on clinical skills examinations, we found that case specificity and random measurement error are major impediments for using PCSE scores to make high stakes decisions about student competency. While the situation is not ideal, this limitation can be addressed by focusing on pass/fail decisions based on rigorous standard setting and using sequential testing to help ensure the accuracy of decisions about student competency.
One of the advantages of progress testing is the ability to assess longitudinal growth in knowledge and skills over a course of training. Unfortunately, the variability in case difficulty among different administrations of the PCSE limited our ability to assess longitudinal growth. Over time, as we develop a pool of cases that have been used previously in the PCSE. Hopefully we will be able to use the data from previous administrations to balance out case difficulty in each administration of the PCSE and be in a better position to assess longitudinal growth in clinical skills.
While measurement error associated with using different cases makes it difficult to directly assess longitudinal growth in clinical skills, we did find that first-year students in the SDC gain basic clinical skills rapidly, narrowing the gap in skill level with their second- Consent for publication-Not relevant for this paper.
Availability of data and material -The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Competing interests -The authors declare that they have no competing interests.

Funding -
The authors received no financial support for the research, authorship, and/or publication of this paper.
Authors' contribution -HF and ME oversee the administration of the PCSE, case development and scoring. DS performed the analyses and wrote the original draft. CC and LW provided advice on the psychometric and statistical approaches. All authors provided advice and support to the project, edited and approved the manuscript.