Multidimensional item response theory to assess psychometric properties of GHQ-12 in parents of school children

Background: Multidimensional item response theory (MIRT) model provides an ideal foundation to assess psychological properties of a questionnaire designed with multidimensional structure. This study aimed to present the first use of MIRT models to investigate psychometric properties of general health questionnaire (GHQ-12) in parents of school children. Methods: A total of 1104 parents of school children completed the Persian version of GHQ-12 questionnaire. Unidimensional IRT model and MIRT models with two and three factors were applied to model the observed scores for each GHQ-12 item as a function of the subject’s latent traits while taking the correlation between dimensions of the questionnaire into account. The goodness of fit indices were reported for the three models, and items fit were assessed for the best model. Individual items were described in detail through item characteristic curves, and the amount of information carried by different items was presented using information curves. Results: The MIRT analysis with two factors corresponding to psychological distress and social dysfunction provided the best account of the GHQ-12 data. The model showed that all items were fitted adequately. Items varied in their discrimination ranged from 0.86 to 2.35 and 1.18 to 2.41 for psychological distress and social dysfunction, respectively. Moreover, items 8 and 2 provided the least information in psychological distress and social dysfunction dimensions, respectively. Conclusions: The developed framework to evaluate psychometric properties of GHQ-12 can be a suitable alternative to traditional approaches and also unidimensional IRT models, the use of which has been restricted due to multidimensional structure of the questionnaire.


Background
The general health questionnaire (GHQ) is a self-report measure of minor psychiatric morbidity that has been widely used since its development by Goldberg in 1972 [1]. The original instrument consists of 60 items, but different shorter versions, including GHQ-30, GHQ-28 and GHQ-12, have also been adapted and validated in different studies [2]. The 12-item version of the questionnaire, GHQ-12, was used broadly due to its relatively good psychometric properties and its brevity [3,4]. Further, the GHQ-12 is recommended by the world health organization (WHO) as a well-validated and standard psychiatric screening instrument [5].
The GHQ-12 consists of 12 items, each of which is rated on a four-point scale, typically worded: less than usual, no more than usual, rather more than usual, or much more than usual. The two most commonly used scoring methods are bi-modal (0-0-1-1) and Likert scoring styles (0-1-2-3) [6].
Since the GHQ-12 exhibits considerable appeal as a quick and well-documented screening tool, it was translated into different languages to study its reliability and validity and explore its psychometric properties in various population and countries [6][7][8][9][10][11][12]. For the first time, the Persian version of the questionnaire was prepared and its psychometric properties were assessed by Montazeri et al. [13]. Since then, several studies have been conducted to assess its applicability among university students and Iranian elder population [14,5].
The questionnaire was designed as a unidimensional scale to capture a single trait, and some empirical studies supported this assumption [15,16]. However, studies have frequently revealed the existence of two or three factor solutions [12]. Most of the studies yielded a two factor solution named "anxiety/depression" and "social dysfunction" [17][18][19]7,[20][21][22]. Some studies, however, revealed a third factor expressing "loss of confidence" [23][24][25]. For Persian version of the questionnaire, a two factor model was the best explanation of the Iranian sample [13].
Traditionally, classical test theory (CTT) including construct validity, reproducibility and sensitivity to change was used to assess psychometric properties of questionnaires [26]. Furthermore, confirmatory factor analysis (CFA), as a common method, can be used to evaluate hypothesis about the dimensionality of questionnaires [27]. Although CTT and CFA are popular methods, they do not consider the measurement errors that refer to the difference between an observed score and an individual's actual trait [28]. The item response theory (IRT) is able to consider measurement error and provides a more detailed assessment of a questionnaire's items. This theory, also known as the latent response theory attempts to explain the relationship between an individual response to the items on the questionnaire and the latent trait [29,30]. It establishes a link between the properties of items on a questionnaire, individuals responding to these items and the underlying trait being measured.
Despite IRT benefits, most studies on psychometric properties Of GHQ-12 used CTT methods, exploratory factor analysis and confirmatory factor analysis. However, several studies used unidimensional IRT model to assess the hypothesis on factorial structure of GHQ-12 [31][32][33]. Further, Alexandrowicz et al. [34] applied IRT models with a different aim to compare the 30-, 20-, and 12 item version of GHQ with four different recording schemes.
When questionnaires comprise multiple dimensions, the utility of unidimensional IRT is largely restricted. An improved version of IRT models named multidimensional IRT (MIRT) models take multiple latent traits into account simultaneously; also, the correlation amongst latent traits is considered. MIRT models have been rarely used in GHQ-12 although the aims were different in these studies [35,36].
Further, it appears that there is no reported MIRT-based study on the psychiatric morbidity of the parents of school children measured by GHQ-12. Whereas children's quality of life is one of the important and complementary outcomes in clinical studies, several studies have focused on this subject [37][38][39]. On the other hand, health-related quality of life in children is strongly influenced by the mental health of their parents. Therefore, it is crucial to evaluate the parents' psychiatric morbidity in a population.
The present study aimed to use MIRT models to investigate the properties of the questionnaire with more detail. The unidimensional IRT and MIRT models with two and three factors were applied to the data and the three models were compared to each other using several goodness of fit indices. Afterwards, in the best-fitted model, individual items were described in detail through item characteristic curves and item information curves.

Participants and instrument
The Persian version of the GHQ-12 translated and validated previously in Iran [13] was filled out by 1104 parents of Iranian secondary school adolescents aged 13-18 years. A two-stage cluster random sampling technique was used to select the participants randomly. At the first stage, four schools were selected at random from 60 secondary schools in each of the four educational districts in Shiraz, southern Iran. Afterwards, two classes from each school were chosen through a simple random sampling and all parents of the students in the chosen classes were considered as the study population in the second stage. The students took the informed consent forms and the questionnaires home for their parents, and then the filled questionnaires were returned to the schools. The ethics committee of Shiraz University of Medical Sciences approved the study. The GHQ-12 includes 12 ordered categorical questions or items which are rated in four categories 0, 1, 2 and 3, indicating less than usual, no more than usual, rather more than usual, or much more than usual, respectively. The GHQ-12 scoring protocol has reversed-scored items such that the higher scores show better psychological health state, and model was fitted accordingly.

Multidimensional item response theory
IRT models assume that there is only one latent variable, , to explain the relationship between latent traits and observed responses. However, MIRT, as an extension of IRT models, attempts to explain an item response according to an individual's standing multiple latent dimensions [40]. There are several forms of IRT models that have been used for ordered categorical data including rating scale model, partial credit model, generalized partial credit model (GPCM), and graded response model (GRM) [30]. The most common IRT-based approach for multiple-response questionnaires in patient-reported outcome studies has been GRM [29]. In this study, in the first step, a unidimensional GRM was used to analyze the data. Thereafter, a multidimensional extension of GRM was used to describe the probability of a given score as a function of two and three latent variables.
The functional form of the multidimensional GRM is given by: (1) Where ( ≥ | = ) is the probability that observed scores for item j and subject i given the ability on latent trait obtain a score greater or equal to k, with k=0 to 3. In this equation, and denote, respectively, the item discrimination and intercept, where intercepts are ordered and one less than the number of response categories for each item. A high discrimination value shows that an item is able to differentiate between the subjects at different latent trait levels. The intercept, , can be transformed into a difficulty parameter, , through the following formula: Where a low value for difficulty parameter indicates an easy item and a high difficulty indicates a difficult item. Further, in Eq (1), latent traits are distributed normally, ~(0, Ω), where Ω is the covariance matrix for individual i's latent traits. The correlation between the dimensions is taken into account in the multidimensional GRM model through Ω [27].

Statistical analysis
All analyses were performed in the R programming environment with the multidimensional item response theory (mirt) package [41]. The unidimensional IRT model and the MIRT models with two and three factors were compared using Akaike information criterion (AIC). Further, the goodness of fit of the models was evaluated by comparative fit index (CFI), Tuker-Lewis index (TLI), root-mean-square error of approximation (RMSEA). The following cut-off values for good fit was suggested by Hu and Bentler [42]: CFI > 0.95, TLI > 0.95, RMSEA < 0.06. Item characteristic curves (ICC) were provided to describe the probability of each score in each item visually. Furthermore, item information curves were included to investigate which items of GHQ-12 carried the most information to detect psychiatric morbidity of the parents. Information content of the items was calculated using Fisher information which is formulated as minus the expectation of the second derivative of the log-likelihood of the model [29]. To evaluate the item fit, the generalized Orlando and Thissen's S-X 2 index for polytomous data was used [43], comparing the observed and expected response frequencies under the estimated MIRT model. Eventually, the items with S-X 2 p-value<0.01 were considered poorly fitted [44,45].

Result
In this study, there were 13248 observations from 1104 parents of school children. A unidimensional IRT and MIRT models with two and three factors were fitted on the GHQ-12 data set. Table 1 summarizes the goodness of fit of the models, representing the MIRT model with two factors named psychological distress and social dysfunction, which reflected the data better compared to the other models. This model had the lowest AIC and met cut off values for a good fit. Thus, the MIRT model with two factors was considered for further evaluation.
The distributions of the observed responses of items for psychological distress and social dysfunction dimensions are shown in Fig.1. The frequency of ordinal items showed a diverse pattern in two dimensions. In psychological distress dimension, most items were skewed toward high scores (2 and more), indicating a better psychological health state, while items of social dysfunction were more symmetrically distributed. In the MIRT model with two factors, item specific parameters and the correlation between the two factors were estimated successfully. Table 2 displays the estimation of item discrimination and item difficulty parameters and their standard error for two dimensions. For all items in the two dimensions, discrimination estimates ranged from 0.86 to 2.41, indicating that all items discriminated between low and high levels of GHQ-12 latent traits (or psychological health state) of parents very well. Further, the estimated correlation between the two factors was 0.85, showing that increasing in psychological distress of the parents leads to an increase in their social dysfunction latent trait. Fig 2. shows the obtained ICCs for all items in GHQ-12. This figure indicates that a person with better psychological health state (higher latent trait, the latent trait is either psychological distress or social dysfunction) has a higher probability of increased scores for each item. The lowest slope of 0.86 for face up to problems (item 8) indicates a lower discrimination power in psychological distress of parents. In other words, a large increment in health state just yields a small increment in the probability for the score on this item. However, the high slope parameter of 2.41 and 2.35 for feeling unhappy and depressed (item 9) and losing confidence (item 10) indicates a higher discrimination power in social dysfunction and psychological distress latent traits, respectively. For all items, when psychological health state score increases, the probability of a 0 score decreases. which item carry the most information and where on the latent trait they are most informative. The information content carried by items was different. In social dysfunction, feeling unhappy and depressed (item 9) was the most informative over the moderate range of latent trait, while lost much sleep (item 2) was the least informative over a broad range of the latent trait. Moreover, in psychological distress, losing confidence (item 10) and thinking of self as worthless (item 11) carried the most information on the moderate latent trait. However, face up to problems (8) carried little to almost no information in this study. Table 3 shows full results for item fit statistics. Based on S-X 2 p-value, all the items fit the GHQ-12 questionnaire properly. Fig1. Distributions of observed item responses (0= much more than usual, 1= rather more than usual, 2= no more than usual, 3= less than usual) for each dimension. The name of the items is provided in Table 2.

Discussion
The present study is the first to apply MIRT model to evaluate psychometric properties of GHQ-12 questionnaire in parents of school children. This study included 1104 parents to measure their minor psychiatric morbidity. Since maternal and paternal psychological health affects the children's development and health during school, assessment of their psychiatric morbidity is essential. The analysis of questionnaires and assessment of their psychometric properties through CTT approach focusing on summated scores disregards the underlying nature of the data. Traditionally, CFA analysis has been used widely to assess the dimensionality or underlying latent variable structure of a questionnaire.
An IRT model provides some advantages over CFA to assess the hypothesis about the dimensionality of the questionnaires. The most important point is taking the measurement error into account, while CFA considers the observed scores as actual individual's latent trait [46]. Moreover, IRT-based models are sample independent; that is the estimated latent trait is not seriously affected by the population [30]. Although the GRM is mathematically equivalent to factor analysis of the estimated polychoric correlation matrix, item difficulty and discrimination are the functions of item intercept and factor loading [47,48],IRT-based models provide a deeper insight into the measurement properties of a questionnaire and its items. In this approach, ICC curves visually present the power of discrimination and difficulty of individual items. Further, item information functions are obtained through IRT models and estimate the precision and reliability of individual items independent of other items on the questionnaire. In addition, item information curves indicate the content of information carried by individual items. As a result, a subset of items can be selected, and a reduced questionnaire can be developed by omitting uninformative items.
Notwithstanding the advantages of IRT over CFA, it suffers from one limitation which is the need for large samples. A summary of the recommended sample sizes for various IRT models is provided by Yen and Fitzpatrick [49]. MIRT as an extension of IRT approach model multiple dimensions simultaneously to take the correlation amongst the dimensions into account. Since these correlation parameters are estimated amongst the dimensions, MIRT models need a larger sample compared to IRT models. In this study, a sufficiently large sample was employed to obtain stable parameter estimates in the MIRT model. In the present study, the MIRT model with two factors reflected the data better than the other models. Our findings were in the same line with other studies that reported two dimensional structure including psychological distress and social dysfunction although they used CTT and CFA [14,13,5]. Smith et al. [31] applied a Rasch model and CFA to the 12-item GHQ and identified 6 misfitting items. In the mentioned study, they focused more on differential item functioning by age, gender, and treatment aims. However, the discrimination and difficulty parameters, ICC and information curves were not reported [31]. Our findings highlight no misfitting items which are not in line with the mentioned study. This inconsistency may be explained by the difference between MIRT models, considering correlation amongst dimensions, and unidimensional IRT models. Further, in our study, a graded response model was used through the MIRT model, while Smith et al. [31] applied the Rasch model in the IRT approach. Since graded response models have fewer assumptions compared to Rasch models, they are more flexible and likely to fit the data generated from the patients' reported outcomes [50].
As noted before, MIRT models are seldom applied on GHQ-12. Stochl et al. [35] combined GHQ-12 and Affectometer-2 in an item bank through computerized adaptive testing method for public mental health research. They applied the MIRT model on the pooled items and reported that the proposed item bank was more efficient than the use of either measure alone. Our findings are not comparable with the mentioned study because the MIRT model was not applied in the two questionnaires separately. Further, in another study, MIRT models were applied to GHQ-12, Warwick-Edinburgh Mental Well-being Scale (WEMWBS) and EQ-5D items (Health Survey for England) [36]. It was reported that a model with two factors provided the best account of the GHQ-12 data, which is in line with our findings. As mentioned before, an advantage of IRT-based models is the amount of item information calculated based on item characteristic curves. They provide the relative contribution of different items to total information across different regions along the latent trait. Consequently, item information curves play a significant role in description of the items, optimal selection of the most informative subset of items, and comparing efficiency between different tests [28,29]. In psychological distress dimension, two items including face up to problems (item 8) and capable of making decision (item 4) were found to have the least information. Furthermore, in the social dysfunction dimension, the lost much sleep (item 2) included lower information in a broad range of the latent trait compared to other items. Hence, a subset of more informative items can be selected, and a shortened version of GHQ-12 can be developed.
The present study had a number of limitations which should be taken into consideration. First, the participants were from a general population. Thus, the results obtained could not be extended to subgroups suffering a serious chronic illness. Second, in this study, participants consisted of fathers or mothers of school children. Probably, fathers and mothers have a different perception of specific item in GHQ-12 questionnaire and, methodologically, combining them may be misleading [37]. Therefore, measurement invariance of GHQ-12 across fathers and mothers should be assessed in future studies. The third limitation of this study was that the estimation of the MIRT parameters was not adjusted according to cluster sampling. However, in this study, the number of cluster participants was almost the same in each cluster and a simulation study by Lee et al. [51] indicated that two stage cluster estimator should be used when the number of participants per cluster is significantly different. Therefore, this limitation cannot be considered as an effective issue to the results presented. Finally, it is recommended that future studies should address these limitations and try to expand our findings in GHQ-12 to different subgroups.

Conclusion
Based on GHQ-12 data from the parents of school children, a MIRT model with two factors, namely psychological distress and social dysfunction, was successfully developed to examine the psychometric properties of the questionnaire. Additionally, item fit statistics assessed individual items. Further, information curves described the amount of information carried by individual items. MIRT models can be adapted as a powerful tool to examine the psychometric properties of the questionnaires designed with an intentional multidimensional structure. It is hoped that the published articles on MIRT models stimulate its increased use in field of health psychology.

List of abbreviations
MIRT: multidimensional item response theory