Following all item development activities, the authors arrived at a field testing-ready item pool of 35 items. The item context for all items is “Since starting medical school” with response options: 0=Never, 1=Rarely, 2=Sometimes, 3=Often, and 4=Always.
Calibration Testing Results
In total, 348 medical students completed the survey (M1= 144 (41%), M2=145 (42%), M3=26 (7%), M4=33 (9%)), of which 175 (50%) were male and 173 (50%) were female. In terms of ethnicity, 197 (57%) responded white/Caucasian, 155 (45%) responded non-Caucasian, and 2 (1%) did not respond.
As is typical with factor enumeration, indices did not provide clear support for the same number of factors. The AIC and BIC favored a 3-factor solution (1-factor AIC 95% CI 29733-29734 BIC 30403-30405; 2-factor AIC 29284-29286 BIC 30085-30087; 3-factor AIC 29124-29126 BIC 30053-30055), while the MAP slightly favored the 2-factor solution (1-factor 0.014; 2-factor 0.010; 3-factor 0.011). As the 3-factor model was more consistent with the theoretical model, the authors selected this model and labeled the factors: 1) Social Challenges; 2) High Activation; and 3) Low Activation. The Social Challenges factor represented items such as difficulty asking for help, feeling unsupported by faculty and peers, feeling taken advantage of by faculty, and feeling pressure to get good grades. The High Activation factor represented items such as feeling anxious, being unable to relax, being overly self-critical, and feeling overwhelmed. Finally, the Low Activation factor represented items such as feeling hopeless, depressed, having difficulty motivating oneself, and feeling like dropping out of school. See Table 2 below for item coefficients by factor. Note that generally, item coefficients >.30 characterize each respective factor, however in the case of cross-loadings (due to conceptual overlap), the higher coefficient is to be used.
The authors removed two items that did not load well on any factor (e.g., stress about finances and exercising less), as well as two items with negative cross loadings (e.g., pressure to get good grades and need to be perfect). The authors retained other items with cross-loadings if they were conceptually/clinically relevant for content validity. It is important to note that while there are plausible 3-factors, they are not necessarily conceptually “separate” and one scale could still give a precise score that encompasses all three.
[INSERT Table 2 HERE]
The authors proceeded with testing a 3-factor solution in a restricted, hierarchical CFA model, allowing cross-loading items to co-load between High and Low Activation. The High and Low Activation factors came to represent locally dependent doublets or triplets (e.g., alcohol & drugs: msss9 & msss10; pressure: msss26, msss27, & msss28; dropping out: msss4, msss34, & msss35; and feeling unmotivated: msss14 & msss25), rather than two distinct factors. The authors decided to remove one item in locally dependent pairs and 1-2 items in locally dependent triplets, and remove items with poor remaining relationships, based on content relevance and available item information. The optimal final model was a bi-factor model with a general factor representing stress/burnout and a specific factor with six items (msss2, msss19, msss20, msss21, msss22, msss24) representing both general stress/burnout and social challenges. This is graphically represented in the supplementary online material (Figure-SF2).
Item Calibration Using Item-Response Theory Modeling
The retained 22 items underwent item response theory (IRT) bi-factor calibration, which provides specific information about each item’s discriminability and performance along a severity continuum from mild to severe. Marginalizing the social challenges factor to emphasize the stress/burnout, the primary factor provided item slopes (e.g., how discriminating each item is), item thresholds (e.g., how difficult each item is in order for a person to endorse a specific response category), marginal item characteristic curves (e.g., a visual depiction of each item’s discrimination between response categories and how informative an item is across a continuum), and a test information function (e.g., how informative and precise the entire set of items is across the continuum of the latent trait) (18) . The IRT parameters, marginal item characteristic curves, and marginal information plots, are available as online supplementary material (Table-ST1).
Figure 1 below illustrates how well the MSSS estimates a respondents’ latent trait of medical student stress over the whole range of scores. Since test information function will be much higher than any single item information function, a test measures ability more precisely than does a single item. The MSSS is a reliable scale (Cronbach’s α=0.89; omegatotal=0.94; omegahierarchical=0.91), covers a wide range of medical student stress, and only declines in precision (i.e. reliability) towards the very extremes of stress.
[INSERT FIGURE 1 HERE]
Administering and Scoring the MSSS-22
See Figure 2 for MSSS-22 instructions and recall period, response options, and items.
[INSERT FIGURE 2 HERE]
Items from the MSSS-22 may be summed into a total score and converted into a T Score with a mean of 50 and standard deviation of 10 by using the conversion table are available as online supplementary material (Table-ST2).
Validity Evidence with External Validity Measures.
Convergent validity was established with moderately high associations with the Burnout Scale Short Version (r=0.800, p<.01), PROMIS Anxiety (r=0.672, p<.01), PSS-4 (r=0.739, p<.01), and a stress visual analog scale (r=0.641, p<.01) (see Table 3). Criterion-related validity was established with small inverse associations with self-reported regularity of exercise (r= -0.261, p<.01) and hours of sleep on average (r= -0.237, p<.01).
[INSERT TABLE 3 HERE]
Known Groups validity was established with statistically significant differences in the MSSS scores between M1s and M2s, and also between male and female medical students (Table 4). Given the more complete samples for M1s and M2s, known groups validity focused on these cohorts as opposed to M3s and M4s with less complete samples. Other validity measures, such as the burnout measure short version and PSS-4, were unable to significantly differentiate difference between M1s and M2s; however, the Burnout Measure was also able to demonstrate a significant difference between male and female students (Table 4).
[INSERT TABLE 4 HERE]