Sample
The whole sample available after withdrawals were considered was 502,591 UK Biobank participants aged 37 to 73 years (M = 56.53 years; SD = 8.05), 54% female. Models were run listwise and the number of participants included in analyses were: 401,648 (12-item); 434,693 (7-item); 473,940 (3-item).
IRT analysis
A 2-PL IRT model was estimated whereby difficulty and discrimination parameters were computed (Table 1). The discrimination (item-information) parameters across the scale range between 1.34 and 2.28. The item measuring ‘Does your mood often go up and down?’ exhibits the highest level of discrimination at 2.28, suggesting that this ‘mood’ question possesses the highest amount of information synonymous with the neurotic trait. In contrast, the item ‘Are you an irritable person?’, 1.34, is the lowest, and below the suggested recommended level of 1.7 for an ideal discrimination level for items measuring trait values (20). The items, ‘Are you a worrier?’; ‘Do you suffer from nerves’; ‘Do you ever feel just miserable for no reason?’; ‘Do you often feel fed-up?’ and ‘Would you call yourself tense or highly strung’ also have discrimination values of above 1.7.
The difficulty parameter functions as a probability scale with the item position on Ө indicating the probability value of a respondent endorsing an item. Figure 2 shows the item characteristic curves (ICCs) for each of the items, presenting both the steepness of the discrimination curve and position of the difficulty value on the Ө continuum. For example, for the item ‘Does your mood often go up and down?’, there is a 50% probability that someone with a Ө of 0.21 (someone who does experience neurotic trait characteristics) would endorse this item, therefore it is considered an item characteristic of neuroticism, albeit low. On contrary, for the item ‘’Are you a worrier?”, there is a 50% chance of someone with a Ө of -0.13 endorsing this item, therefore, someone who does not experience neurotic trait charateristics.
Additional item discrimination is available by graphing the IIF curves (see Figure 3). The IIF curves thereby display the relationship between difficulty (trait level) and discrimination (information), and an important feature of this graph is also the position on the continuum from which the point is drawn perpendicular from the apex of each item curve. The items which have their maximum curvature positioned along the Ө continuum in the positive half provide information about the neurotic trait when there is an endorsement (presence) of the trait characteristic. For example, the item ‘Do you often feel lonely?’ is an endorsement of neuroticism if a respondent endorses it, as its apex is positioned in positive Ө and is more likely to be endorsed by someone with a higher level of neuroticism (1.41) than a person endorsing the item ‘Does your mood often go up and down?’ which is also positioned in the positive Ө but has a lower difficulty value (0.21). Therefore, although the ‘mood’ item has the highest discrimination value (see previous), it does not provide sufficient information about respondents who possess a high level (presence) of the trait (+1 to +4) or a low level (absence) of the trait (-1 to -4), instead it provides the most information for respondents who possesses an average (Ө=0) to a minimal amount of the neuroticism trait (see Table 1). The item which possesses the least trait characteristic discrimination is the item, ‘Are you an irritable person?’, Although the IIF curve apex is positioned over a positive Ө (0.95), and may be endorsed by a respondent possessing an amount of the trait characteristic, the discrimination value is low (1.34).
In summary, the overall pattern of item distribution across the Ө continuum suggests that across the 12-item EPQ-R neuroticism scale there are no items which measure an extreme level of neurotic trait characteristics or an extreme level of non-neurotic trait characteristics. It suggests that the questions are mostly measuring the neurotic trait characteristics which have a higher probability of endorsement by individuals who are experiencing a minimal to no level of neuroticism (Ө = -0.13 to 1.41).
Reliability
In IRT, reliability may be calculated at multiple point values of Ө along the continuum rather than a single reliability score as in CTT. Reliability is defined at different points of Ө with the mean of Ө fixed at 0 and the variance at 1, facilitating identification of the model and reliability for all points along the Ө continuum, distinguishing respondents according to specific values of Ө (23). For the 12-item scale there is reliable information to differentiate respondents who possess no or just above an average amount of trait information (Ө=0; 0.87 and Ө=1; 0.88), considered very good for reliability However, reliability then decreases (Ө=2; 0.76 and Ө=-1; 0.71) suggesting that the highest reliability of measuring the neurotic trait is at normal or a minimal amount of neuroticism, Ө=0 or 1. Thereafter, reliability reduces so that the extreme end of the continuum, Ө=3; 4; -2; -3; -4, is no longer reliably measured (See Table 2) .
Statistical assumptions
1. Item independence
A correlation analysis assessed initial item independency and all items were significantly correlated (p <.000) but the majority of values were lower than 0.50, suggesting basic local item independence. A residual coefficient matrix, requested after estimation of a single-factor model showed that no residuals were too highly correlated, R >0.20 (24), suggesting basic item independence.
2. Monotonicity
A Mokken analysis produced a Loevinger H coefficient (25) which measures the scalable quality of items, expressed as a probability measure, independent of a respondent’s Ө. These coefficients ranged between 0.35 and 0.47 (Table 3), suggesting a weak (H=0.3-0.4) to moderate (H=0.4-0.5) monotonicity, no items reached strong scalability (H≥0.5) (25).
3. Unidimensionality
A principal component analysis (PCA) shows that a single major factor is responsible for 36% of the variance and a second factor responsible for 11% of the variance, the difference of which is above the suggested 20% indicating a single major factor is being measured (26). A post-IRT estimation model measure of unidimensionality was also computed using a semi-partial correlation controlling for Ө. This analysis provides individual item variance contribution after adjusting for all the other variables including Ө. It demonstrates the relationship between local independence and unidimensionality, reflecting a conservative assessment whereby the desired R 2 should ideally be zero or as close to zero as possible (27). Items ranged between 0.01 and 0.02, suggesting unidimensionality. To our knowledge, there is still no standardised cut-off criterium for assessing this value (i.e., how close to zero all items should be across a scale).
IRT revised analysis
To assess a revised scale, items were systematically removed from the scale according to discrimination value with the lowest discrimating item removed first (‘Are you an irritable person?’, 1.34) whereafter a 11-item 2-PL IRT model was esitmated with the remaining items and the process repeated, removing the lowest discrimating item, below 1.7. In order of removal, the items systematically removed thereafter were: ‘Do you often feel lonely?’; ‘Are you often troubled by feelings of guilt?’; ‘Do you worry too long after an embarrassing experience?’ and ‘Are your feelings easily hurt?’ at which stage the 7 remaining items were maintained as most were > 1.70 on 434,693 individuals.
The item parameters for the 7-item scale are presented in Table 4. Statistical assumptions were computed on the revised scale of 7 items (Table 4) and importantly a Mokken analysis suggests improved scalability (monotonicity) compared to the full 12-item scale with two items reaching values >0.50 (Table 5). Reliability across the scale is marginally improved compared to the full scale suggesting redundancy of the removed items (Table 6). Acceptable metrics for unidimensionality and item independence were achieved for this revised scale. The ICC and IIF graphs for the revised 7-item scale are presented in Figures 4 and 5 where improved item information over the 12-item scale is evident.
Further item reduction was explored to investigate a ‘minimal’ scale. After systematic item-removal, three items remained when the 2-PL was estimated on sample of 473,940 individuals. The scale parameters suggest those items which possessed high discrimination and positive difficulty values, ‘Does your mood often go up and down?’ (3.44; 0.14); ‘Do you ever feel just miserable for no reason?’ (2.79; 0.22) and ‘Do you often feel fed-up?’ (2.92; 0.28) (Table 7). A Mokken analysis suggests that scalability is strong (H≥ 0.50) across all items (Table 8), a semi-partial correlation analysis controlling for Ө showed all values were 0.00. Reliability is only good at Ө = 0 suggesting this scale is only reliable to measure those with an average trait (Table 9). The ICC and IIF graphs suggest the three-item scale may present an efficient, alternative and highly informative scale, however, the scale is narrow in range and does not possess items measuring neurotic traits above or below average Ө, at the extreme ends of the trait spectrum (Figure 6).
Differential-Item Functioning (DIF) Analysis
To investigate gender differences in item functioning, a logistic DIF analysis was conducted across all three versions of the scale with gender as the observed group. A uniform and nonuniform DIF assessed whether specific items favoured one group over the other (male vs. female) for all values of the latent trait (uniform) or just selected values of the latent trait (nonuniform). The output of these analyses are presented in Table 10 where evidence of significant uniform DIF for gender was found across all three versions.