Internal Structure
Model Fit
The 34 items showed an acceptable fit to the Rasch Rating Scale Model, based on Linacre’s (13) guidelines. All items had an infit meansquare statistic between .79 and 1.49 (M = 1.03, SD = .15), and 32 had an outfit meansquare statistic between .75 and 1.43 (M = 1.07, SD = .29), with two items exceeding 1.50. Items 11 and 6 had respectively outfit meansquare values of 1.59 and 1.93. We decided nevertheless to keep both items for two reasons. First, removing them would negatively affect content validity, as these are the 34 items retained from a larger set of competencies to better represent the CanMEDSFM framework (9). Second, because items with infit or outfit meansquare statistics between 1.5 and 2.0 are considered “unproductive for construction of measurement, but not degrading” (13). Infit and outfit meansquare statistics for persons had a mean of 0.97 (SD = .42) and of .98 (SD = 1.16), respectively. Out of the 1 432 persons observed, 43 (3%) had a statistically significant infit or outfit value at a .01 level of significance (i.e., standardized value greater than 2.58). They were removed from subsequent analyses. Upon removal, mean item and person fit statistics improved slightly. Items infit and outfit meansquare values were thereafter respectively 1.01 (SD = 0.12) and 1.00 (SD = 0.38), while person infit and outfit meansquare values were respectively 0.98 (SD = 0.36) and 0.90 (SD = 0.89).
Rating Scale Functioning
Option characteristic curves are illustrated in Fig. 1. Analysis of the rating scale structure was carried out using Linacre’s (14) eight guidelines, summarized in Table 1. Guidelines 1, 3, 4, 5, 7 were respected, while guidelines 2, 6, and 8 were not. Nonrespect of the second guideline (Regular observation distribution) reflected the fact that only 0.2% of the observations received the lowest rating (1 = Supervision by direct supervision), while the majority (85.7%) of the observations received the highest rating (3 = Independent). Regarding the sixth guideline (Ratings imply measures, and measures imply ratings), the low congruence between ratings and measures concerned the lowest rating (option 1) and therefore relied on only 54 observations for this estimate. Nonrespect of the eighth guideline (Step difficulties advance by less than 5.0 logits) implies large steps on the latent variable between rating options and therefore less measurement precision.
Table 1
Analysis of the rating scale structure using Linacre’s (14) eight guidelines
Linacre’s (2004) guidelines

Result

1. 1. At least 10 observations of each category

There were at least 10 observations per response option (54 observations in the first option; 3615 in the second; and 22023 in the third).

2. 2. Regular observation distribution

Distribution of observations across response options was irregular, meaning that option 3 was clearly the most frequent option, followed by option 2, while option 1 was seldom chosen.

3. 3. Average measures advance monotonically with category

Average ability estimates advanced monotonically with options going from − 1.10 logits (option 1) to 2.77 logits (option 2) and then to 6.59 logits (option 3).

4. 4. Outfit meansquares less than 2.0

Infit and outfit indices were acceptable, all comprised between .99 and 1.30.

5. 5. Step calibrations advance

Step calibrations advanced, indicating no disordered thresholds. The step between option 1 and 2 was estimated at 3.61 logits, and the step between option 2 and 3 was estimated at 3.61 logits.

6. 6. Ratings imply measures, and measures imply ratings

Congruence between measures and ratings as well as between ratings and measures was generally good. It varied between 66% and 93% for options 2 and 3. For option 1, the congruence between measures and ratings was acceptable at 55%, but the congruence between ratings and measures was at 11%.

7. 7. Step difficulties advance by at least 1.4 logit

The distance of 7.22 logits between the two steps was larger than 1.4 logits.

8. 8. Step difficulties advance by less than 5.0 logits

The distance of 7.22 logits between the two steps was larger than 5 logits.

Dimensionality and Local Independence
A principal residuals component analysis showed that the first dimensions had an Eigenvalue of 33.3 and explained 49.5% of score variability. The second dimension had an Eigenvalue of 1.9 and explained 2.8% of score variability. The second dimension having a strength of less than two items, the structure of the DBSFM was considered unidimensional. Regarding local independence, the largest standardized residual correlation between the items had a value of .48 (between items 1 and 2), indicating that the maximum amount of shared variance between two items was 23%. Items were therefore considered locally independent.
Differential Item Functioning
We tested the invariance of the measurement scale between year 1 and year 2 observations. This was done by investigating for the presence of differential item functioning (DIF) based on residency level (year 1 versus year 2) using Welch’s ttest. A Bonferroni correction was applied to guard against the inflation of type 1 error because this analysis resulted in 34 tests, i.e. one for each item. The alpha level of statistical significance was therefore set at .05/34 = .001. Two items (21 and 22) showed significant DIF, both being easier for year 2 residents. The Item 21 (Clinical expertise – Technical gestures) parameter estimate was 3.05 logits for year 1 residents and 1.84 logits for year 2 residents, with an estimated difference of 1.22 logits between the two. The Item 22 (Clinical expertise – Investigation and treatment) parameter estimate was 2.38 logits for year 1 residents and 1.39 logits for year 2 residents, with an estimated difference of .98 logits between the two. To test the impact of these DIF on ability estimates, we correlated resident ability estimated with and without these two items. The correlation between these two score sets was 0.99.
Reliability of CASs
The reliability of residents’ CASs was estimated at .83 for observations not having an extreme score (n = 752) (i.e. ability parameter of 7.00 logits or lower), and at .66 (n = 1389) when including an analysis of the 637 residents having an extreme score. As can be seen in Fig. 2 below, the extreme scores, especially those at the top of the scale, have the highest standard error or, in other words, the lowest measurement precision. Classical reliability estimates for the subsets of items used in the different clinical rotations, using Cronbach’s alpha, were between .76 and .93.
Item Targeting
Residents’ ability parameters ranged from − 4.33 to 9.45 logits (M = 6.34 logits, SD = 2.43). More precisely, as illustrated in Fig. 3, ability parameters for year 1 residents ranged from − 4.33 to 9.45 logits (M = 4.89 logits, SD = 2.46) (n = 803 assessments), and from − 0.09 to 9.45 logits (M = 7.75 logits, SD = 1.85) for year 2 residents (n = 629 assessments). In comparison, difficulty parameters for the 34 items of the DBSFM ranged from − 4.24 to 2.72 logits (M = 0.00 logits, SD = 1.79). The Wright map (Fig. 4) shows the location of the candidates (“person” column) and items (“measure” column) relative to each other on the latent variable. The “BOTTOM P = 50%” column shows the RaschThurstone thresholds for the lowest rating (option 1) on each item, where the probably of being rated as “1” or higher is 50%. The “TOP P = 50%” column shows the RaschThurstone thresholds for the highest rating (option 3) on each item, where the probably of being rated 3 or below is 50%. The distance between the bottom and upper RaschThurstone thresholds is the operational range of the scale, in other words the latent variable range where the scale is able to discriminate between different competency levels, i.e. between approximately − 8.00 and 7.00 logits. Therefore, the scale cannot discriminate between the highest scoring residents, located between 7.00 and 9.45 logits. For year 1 residents, 232 (32%) out of the 803 assessments were higher than 7.00 logits. For year 2 residents, 489 (68%) of the 629 assessments were higher than 7.00 logits.
Item Hierarchy
The expected item hierarchy corresponded to the ordering of competencies by time of expected achievement by the 28 experts at the last phase of the Delphi study (9). This ordering was highly reliable, both the Generalizability coefficient (15) and the Dependability index (16) being .91. The empirical item hierarchy estimate was also reliable (Rasch item reliability = 0.99). The correlation between the expected item hierarchy according to experts and the empirical item hierarchy estimated by the Rasch item difficulty parameters was.78, p < .0001.
Global Score Responsiveness
Figure 5 shows the average CAS on the DBSFM with 95% confidence intervals for the 26 periods of the residency program. The average CAS was .71 (SD = .18) for year 1 residents (clinical rotations 1 to 13) and .83 (SD = .10) for year 2 residents (clinical rotations 14 to 26). A paired sample ttest showed that the difference between the average CAS for year 2 and year 1 residents is statistically significant, t(94) = 7.52, p < .0001. Using the Rasch ability parameters rather than the CASs yielded similar results, t(1427.6) = 25.00, p < .0001.
However, the difference between those two years is lower than expected. The expected CAS (Fig. 6) for the first year of residency varied between .23 and .49 for an average student, which is much lower than the observed CAS, which varied between .59 and .74. The expected CAS for year 2 residents varied between .73 and .91, which is comparable to the observed CAS that varied from .74 to .94.