This study sought to establish evidence for construct validity and reliability of the psychological and social modules of the PPLA-Q in grade 10 to 12 (15–18 years) adolescents through investigation of their dimensionality, measurement invariance and reliability (total-score and test-retest).
Dimensionality
We used Mokken Scale Analysis (MSA) to gather evidence on the dimensionality of each of the eight scales composing the psychological and social modules of the PPLA-Q. Most local dependencies occurred within items initially designed for the same difficulty (i.e., foundation or mastery), and within the same specific trait (e.g., P9 and P11 with the same motivational regulation) with similar wording. This was expected since scale development ensured a desirable degree of redundancy (DeVellis, 2017).
All eight scales, after removal of offending items, adhered to the assumptions of the MHM (scalability, local independence, and monotonicity), with total scale scalability coefficients estimates (H) ranging from .46 to .62 – thus evaluated as moderate to strong scales. This values support the convergent validity (at item-level) of each scale (Sijtsma et al., 2011). Sum scores of items in these scales can, as such, be considered a sufficient indicator of the position in latent trait of each individual (Wind, 2017).
For all eight scales, the additional invariant item ordering (IIO) assumption held – assessed through the method of Manifest IIO (Ligtvoet et al., 2010) – as such, they adhered to the DMM. This evidence supports the interpretation that an invariant order of items’ difficulty can be established across different ranges of development, for all students, in the respective constructs (Wind, 2017), as warranted in the initial development of these scales. However, four of these scales had a HT coefficient lower than .30 (Confidence, Emotional Regulation, Collaboration and Relationships), meaning that their IRF are too close together and that respondents might find difficult to distinguish between neighbor item, in difficulty terms (Sijtsma et al., 2011). Albeit still presenting an overall valid assessment of the position of a student (and items) on a continuum of difficulty, no specific use of this ordering (e.g., application of scales from an estimated difficulty point onward) is recommended for these four scales.
For the Motivation scale, items generally formed a difficulty continuum from controlled to more autonomous forms of motivation (Table 19) with weak accuracy (HT =.33). Despite this, the continuum found does not entirely adhere to the posited order of the Organism Integration mini-theory of Self-Determination Theory regulations (Ryan & Deci, 2017): P8 (“I feel good when I practice PA”), developed to assess intrinsically regulated motivation was deemed easier (i.e., higher mean score) than P2–targeting externally regulated motivation at the diametrical side of the theoretical continuum. We argue that this might be due to the wording of P8 targeting a general well-being perception, which makes it easier to endorse that the more targeted expressions of intrinsic motivation like pleasure or satisfaction. As such, we recommend rewriting this item so that it more closely adheres to expected difficulty range. Similarly, P7 (developed to assess intrinsically regulated motivation, mean = 2.6) and P11 (integrated regulation, mean = 2.2) switched places, as the first is usually expected to be the most autonomous form of motivational regulation. This result agrees with previous results of bifactor modelling suggesting (Howard et al., 2016) that these two regulations might be closely placed in the continuum. To the intended application of the scale, however, this switch might have little consequence, as we generally discuss in the next paragraphs.
For the Physical Regulation scale, items formed a moderate accurate (HT= .41) continuum from identifying physiological signs of effort and awareness of physical limits to using strategies to manage effort during PA, adhering to the a priori expectations. P42 (“I take action to improve my physical skills”, mean = 2.9) wording might need to be adjusted in the future, as it appears to be interpreted as identical difficulty-wise as P37 (“I recognize my physical limits”, mean = 2.8) – as evidenced by near-touching IRF – as both were to have different difficulties by design (i.e., P42 developmentally more complex than P37).
For the Culture scale, items formed a weakly accurate (HT= .32) continuum from participation in the movement culture through use of specific PA terminology, to endorsing and encouraging others to so as well. Albeit designed to be among the easier items in this scale, S2 (“I participate in PA rituals (e.g., greetings, hymns/chants, cheers, applauses”) figured, difficulty-wise, among the harder items in this scale; this might result from a misunderstanding regarding the concept of what “rituals” in a movement context truly mean, despite examples being provided in the item, as such, this item might merit further scrutiny in the future. Also, S6 (“I like to keep up with PA events (e.g., competitions, spectacles, shows)”) wording might also be refined, to differentiate itself from S5 (“I watch PA events (e.g., competitions, spectacles, shows)”) in terms of difficulty.
For the Ethics scale, items formed a strongly accurate (HT= .51) continuum from immature forms (i.e., pragmatic) to mature forms (i.e., value-based) of moral development, adhering to the a priori development expectations based on Gibbs (2014)’s model.
Items developed to figure as global items (P1, P13, P23, P33, S1, S12, S23, S32) - to act as convergent validity indicators in future analysis (Cheah et al., 2018) – showed adequate scalability in all scales, strengthening the evidence for their convergent validity, as these were developed to generally represent each latent construct. Only in one of the scales (Culture) was one of these items (S1) flagged for local dependence – likely due to similar wording – and removed. Difficulty-wise, in scales with interpretable IIO (HT > 30), they figured in the middle to more difficult part of the difficulty continuum (i.e., lower mean score); this, again, was to be expected, as these were based on the operational definition of each element which state the development of each skill/construct in its final stages. Nonetheless, the usefulness of these items should be further examined (i.e., whether they are invaluable for scale scalability and validity), since their removal might result in a slight increase in feasibility in subsequent applications of this questionnaire, with no content representation trade-off.
Item developed to assess Relational Thinking, the highest development stage in the Structure of Observed Learning Outcomes taxonomy (Biggs & Collis, 1982) – items P43-P46, S40-S44 – did not fit the tested models, either for being unscalable or for being in local dependence pair; exception to this observation were the items in the Motivation and Confidence scales. These dealt with the degree to which skills developed in PA contexts are applied in context of the student’s life. We argue that this might be due to: 1) endorsement of these items being highly dependent on the capacity of the respondent to draw a connection between his actual psychological and social skills in PA to their application in other contexts (e.g., being able to apply emotional regulation strategies developed or recurrently applied in PA contexts, to daily stressing occurrences), which by itself might be a different skill altogether – as is inferred from the most recent version of the Australian PL framework (Sport Australia, 2019); 2) the wording might not be clear enough to capture this phenomena among adolescents. As such, further efforts should be done to refine these items, and subsequently analyze their dimensionality – either as part of each of the scales, or as a separate latent trait by itself.
Additional exploration of the dimensionality – using the Automated Item Selection Procedures (AISP) and Genetic Algorithms (GA) at lower-bound c = .45) – of the Motivation, Emotional Regulation and Physical Regulation scales revealed an alternative cluster structure for these scales. Generally, at lower c values, these algorithms captured the higher-level constructs (i.e., unidimensional elements), while increasingly higher c values retrieved the lower-level constructs (i.e., foundation and mastery levels) and even specific subtraits within these (Straat et al., 2013). The clustering pattern of the Motivation scale throughout different lower-bound c values is coherent with previous research that posit that different motivational regulations differ not only in degree, but also in kind (Howard et al., 2020), with a general underlying continuum structure (Howard et al., 2016). Here, introjected regulation items were the exception, as they tended to cluster away from the remaining items at lower c values. These results suggest that this specific regulation might stand-out from all others, and along with the clustering of autonomous motivations is coherent with previous results on adolescents (Navarro et al., 2021; Vasconcellos et al., 2019).
For the Emotional Regulation scale, clustering patterns suggests that the identification of emotions in themselves and in others might be two different skills, although we initially equated them as part of the same continuum. Finally, the Physical Regulation results conform with the interpretation of a continuum underlying the development of all its skills, with two strong lower-level clusters of dimensionality within, coherent with the a priori construction of Foundation and Mastery levels. We argue that, although these alternative dimensionality structure could be supported for these proposed unidimensional scales, given that the aim of these scales is to be integrated within an overarching assessment framework for PL, instead of specific theory development on their construct, they possess enough dimensionality to be used to locate an individual in each of these latent traits, as evidenced by their total scalability coefficients.
Additionally, refinements to item’s difficulty in scales with below standard, or borderline IIO accuracy (HT ≈ .30) are warranted, to better target different development stages across each construct. Parametric IRT models might support this effort, although its restrictive assumptions regarding Item Response Functions might not fit the functions we observed in this study.
For ease of interpretation and comparability between scales, we recommend that scores on this scale be transformed into a 0-100 metric using the maximum possible number of summed points as upper bound. Given that scales have mostly a balanced number of items designed to measure Foundational, and Mastery skills, a middle point score (50%) can be used as a heuristic cut-score to identify students transitioning into a deeper phase of learning.