We sought to gather evidence to support construct validity (internal structure and measurement invariance) and reliability (score reliability and test-retest) of the cognitive module of the PPLA-Q (content knowledge test) through the lens of IRT. Secondary aims of this study were to assess a) whether modelling data from distractors posed an advantage in locating students in the latent continuum; b) whether the sum-score possessed enough accuracy for practical-oriented settings.
Model fit
Overall, the mixed-format (2PNL + GRM) model provided the best trade-off between model fit, total information of the test, and parsimony. This model also provides more readily interpretable item parameters than the pure 2PNL for the ordinal items (De Ayala, 2009; Desjardins & Bulut, 2018) since under the latter, different discrimination parameters (category slopes) are estimated for each scoring level of the item (assumed as unordered nominal categories), which could be, in practice, constituted by different combinations of responses, and not a single discrete distractor.
Dimensionality analysis under this model suggested the existence of a possible second factor. Given the complexity and number of cognitive and personality factors at play during item response, it is usually the case that tests are not strictly unidimensional (Hambleton et al., 1991), and that the substantive consequences of a violation of this assumption must be analyzed according to the intended application of the test (Wells & Faulkner-Bond, 2016): in practice, small degrees of multidimensionality might not distort item parameters and score estimates as long as essential unidimensionality is assured (Harrison, 1986). Analysis of the residual correlation between items does not suggest any significant clustering pattern (> |.20|) between content duplets which could happen due to sampling from the same specific subdomain (i.e., content theme). Some residual correlation (|.10 − .16|) did happen between item 2, 8 and 10 which, we surmise, could be due to similarity of the cognitive processes involved in response (i.e., analysis), or due to closer relationship between content domain for these items (energy balance, health benefits of different types of training, and body composition and its effect on health). As such, these results seem compatible with a parsimonious stance: that a single essential latent trait is being measured in grade 10 to 12 students – general content knowledge in the context of PA. Nonetheless, further studies should test this idea using other methods (e.g., bifactorial IRT modelling), as well as different stances on measurement – assuming that content knowledge could be surmised under a composite-formative model (Stadler et al., 2021).
Score reliability & correlations
Regarding reliability of the test score, both the 2PNL and mixed-format models outperformed the dichotomous models (1PL and 2PL), the nominal model (NRM) and the CTT-based estimates, as consequence of providing more information across the latent continuum. These results show that modelling information present in distractors is advantageous for estimating θ and increasing the reliability of scores, and are coherent with similar research (Storme et al., 2019). A similar inference can be drawn from the correlation between different models.
There was a perfect correlation between 1PL-derived scores and a simple-sum score, as expected, since in 1PL model, the scores are a simple transformation of raw scores, without weights assigned to different items (Wu et al., 2016). As the parameterization increases, the correlation with sum-score is attenuated and results in differences in estimated scores, especially for students with lower knowledge.
Marginal reliability for the mixed-format model did not achieve the general acceptable threshold of .70 (Nunnaly & Bernstein, 1994), indicating that the test is still lacking on the capability to score students with desired precision across the whole range of ability. However, conditional analysis at different ranges of θ reveal that this single estimate seems to be underrepresenting reliability around the peak of test information - -2 to -1 θ – while overrepresenting the reliability in θ ≥ 0 (De Ayala, 2009). Taken together, this data leads to different implications regarding the intended uses of the test score (American Educational Research Association et al., 2014; Lane et al., 2015).
The sum-score might serve a purpose when a quick diagnosis and feedback to students is the chief concern since students can score their own test and detect areas of improvement with little, to no intervention from teachers. From a teacher’s perspective it might also be useful to consider the raw score by content theme, allowing for specific changes to the curriculum to promote learning in these areas.
The scores derived from the 2PNL + GRM model would be better used to obtain a fined-tuned score including distractor information and measure student’s knowledge around the transition point from structural knowledge (foundation level) to relational knowledge (mastery level) as the test might provide precise enough information in this range – a hypothetical student scoring all foundational items (odd numbered items) correctly would have an estimated θ of -1.21. This is specifically useful for creating class groups based on these general levels and provide appropriate learning tasks. To facilitate interpretation, we suggest a transformation so that these scores provide a 0 to 100 interpretation – like other scores in PPLA. For this transformation, the maximum obtainable θ in the test (1.591; not shown) can be used as the upper bound, and the estimated θ score for a student with the least informative response pattern (in all least correct distractors) as a lower bound (θ =-3.510, not shown). As such,

with X being the new 0-100 score, and θ the estimated θ score.
For specific research in content knowledge about PA and healthy lifestyles, or high-stakes applications (summative assessment), the test needs further improvements so that items provide enough information across the whole spectrum of development.
One option for this would be to increase the number of items in the test, targeting higher θ ranges, as test length is related with the accuracy of its estimates (DeMars, 2010; Harrison, 1986). Some care should be taken however, as one of the emphasis of all PPLA measures during development was feasibility without compromising validity or reliability, to maximize application of the tool in PE contexts.
Another option would be to review both the plausibility and wording of flat curved distractors in items providing low amounts of information / low discrimination (items 5,7, and 9). This could lead to improved discrimination – approaching the guideline of 0.8 (De Ayala, 2009; Green et al., 1984) – by reducing guessing and confusion, and thus higher information and reliability especially for measuring higher ability students (θ > 0). These choices can be further substantiated by estimating a guessing parameter (in a 3PNL model) to identify which items and distractors are more prone to guessing, and remove parameter confounding. This will, however, require a larger sample (De Ayala, 2009).
Item parameters
Regarding item’s estimated difficulty versus their intended difficulty, 3 out of 5 duplets behaved as expected (i.e., item evoking higher-order cognitive abilities as harder, than their lower-order counterparts) with item pairs 5 and 6, and 7 and 8 not adhering to this. In the first case, both are scored as multiple selection items, and our data suggests that item 5 is only more difficult than item 6 at maximum score (Table 4), while it is easier for intermediate scores (i.e., scoring points in the latter requires higher ability, than in the former, except for maximum score). This could be result of higher plausibility of distractors in item 5 (selected by ~ 60 to 65% of respondents; Table 20), and scoring penalization to wrong selection inflating the difficulty of achieving maximum score. It is also plausible that our decisions regarding coding of multiple selection items (5 and 6) might have introduced a degree of bias in the results by restricting the range of possible combinations (i.e., 2 points in item 5 could be obtained by multiple combinations of right and wrong answers). In the future, different coding schemes might be considered and compared.
In the second’s duplet case, multiple factors might be at play: a) the ability to recall information which is not used daily (i.e., recommendations for physical activity, in item 7) might be confounding the intended difficulty as students were not aware that they were going to be tested; b) despite being based on a lower-order cognitive ability (memorization), these guidelines require a specific knowledge that cannot be inferred using an understanding of biology, or general health literacy, and as such, need to be taught explicitly during PE classes. This data is in accordance with previous research (Marques et al., 2015) that suggests that Portuguese students do not know the PA guidelines for health promotion. Nonetheless, a careful look at the distractor’s popularity (δ; Table 23) suggests that they seem to be aware of the guidelines for children and adolescents, while not knowing the specific ones for adults (distractor D). This implies that more attention should be dedicated to explicitly teaching these guidelines, with more emphasis on those for adults, since arguably, they will be of most importance in the near-future of high-school students.
DIF and DTF
We found evidence of non-uniform DIF according to sex in item 1, however this did not result in significant DTF. Despite the possibility that actual differences in interpretation of the item exist between sexes, there is also a possibility that this might be due to parameter inaccuracy due to sampling variability (as suggested by the magnitude of the standard errors of distractor parameters, ranging from 62.866 to 94.708; not shown in tables), as there were no students with estimated θ in the difficulty range of this item (around − 3). Similarly, the differential distractor functioning in this item might stem from a sparse selection of distractors – due to it being a very easy item – resulting in difficulties at estimation of distractors thresholds (Ostini et al., 2015). As such, if total score is of chief interest, the bias in scores will likely be negligible, as the sDTF statistics imply; whether if any specific inference is required at item-level, methods that account for DIF should be used, so that the suggested sex bias is minimized. Furthermore, other methods specifically designed for exploring differential distractor functioning could be used (Suh & Bolt, 2011), along with a larger sample.
Test -retest reliability
Test-retest reliability of estimated θ scores was poor to moderate (ICC = .51, [.32, .66]) (Koo & Li, 2016) over a 15 days interval. This might stem from a violation of the assumption of stability of the assessed trait that precludes the calculation of test-retest reliability (Polit, 2014), as learning between applications – either due to teacher intervention, or due to student’s curiosity – is plausible; Longmuir et al. (2018) suggested as much in their assessment of a similar tool. Results from item-level analysis of agreement between the two time points lend some support to this idea. Out of the four items not achieving acceptable agreement (.70), one (item 10) was mostly due to an increase in correct responses in the second instance; despite achieving the threshold for agreement, items 3 and 6 also display a similar pattern. As for the remaining three items (4, 7 and 9), disagreement was mostly due to individual variability which could be indicative either of guessing, carelessness or low-quality of the items resulting in different understanding of the item across time points. In the future, a 3PNL model (accounting for guessing) could further improve this assertion and clarify the role of individual variability.
Strengths and limitations
To our knowledge, our study is the first to apply IRT to content knowledge in PA and healthy lifestyles. It exemplifies how applying nested logit models provides an increase in precision for estimating latent trait scores versus both a sum-score or dichotomous IRT models (1PL and 2PL), through the modelling of distractor information. It also provides an example of how to use these models to identify functional distractors. As such, use of IRT benefits the test in the short term, but also in the long term, as it opens the possibility of comparison between different versions of the cognitive module of the PPLA-Q by test-linking and equating; and adaptative testing.
Despite the pandemic context imposed by COVID-19, we recruited a diverse sample, mimicking the relative composition of grade 10 to 12 students’ population in Portugal according to both grade and course major. Nonetheless, given its convenience nature, we advise caution before generalizing any findings of validity or reliability outside of this population, without further testing. A similar cautionary note should be made regarding the sample size used. Given the relative paucity of research using IRT nested logit models, no consensus on guidelines regarding sample exist. Even when referring to common-place models like the 2PL, NRM or GR, sample size recommendations vary widely across sources and seem to be dependent on various complex interaction between test length, number of response categories per item, number of parameters to estimate (De Ayala, 2009) and estimation method (Şahin & Anıl, 2017). Another factor in determining the sample size is the intended level of precision in the estimated parameters: while high-stakes testing will require larger sample sizes to attain small standard errors on estimated scores, other less demanding contexts might require smaller ones (De Ayala, 2009; Nguyen et al., 2014). As such, further testing using a larger, more representative sample should try to replicate, and improve upon our findings using a 3-parameters logistic model (3PL; Birnbaum, 1968) which accounts for the possibility of guessing. The same applies for DIF and DTF testing
Another limitation pertains to the use of test-retest reliability. This type of reliability is essentially a CTT concept, conceptualizing measurement error as a single statistic, whether IRT permits a detailed analysis of reliability at each θ point, as shown. Usage of IRT to model growth over time, or invariance over two time points would be better suited to the general framework of this study and allow for better inferences regarding adequacy of scores over time; this, however, was currently impossible to achieve due to sample size requirements.
Finally, concurrent with modification of items to improve the information available across higher ranges of knowledge, a second round of content validity with an expert panel might provide further support to the adequacy of these items to the pretended knowledge domain in grade 10 to 12 of Portuguese PE.