Participants and procedures
This cross-sectional health care workers’ study, conducted at Chiang Mai University (CMU) between February and June 2013, was approved by the Ethics Review Committee for Research in Human Subjects, Faculty of Medicine, Chiang Mai University. Health care workers were classified in three groups. The first group consisted of doctors, dentists, nurses and pharmacists (42.1%). The second group was ‘‘other health professionals’’ and other health-related positions (19.4%). The last group was ‘‘nonhealth professionals’’ and mainly consisted of workers (38.5%). A detailed description of the study has been published [34]. Concisely, 5,364 people working for the Faculty of Medicine of CMU, 4,022 people (75.0%) responded to the survey and 3,532 (65.8%) consented to participate in the study. In the end, 3204 participants (59.7% response rate) completed the self-rating online questionnaires concerning PHQ-9 as well as their demographic information. This comprised age, sex, education level and alcohol consumption.
Measure
Patient Health Questionnaire (PHQ-9)
PHQ-9 is a self-report tool, consisting of nine questions regarding depressive symptoms based on the DSM-IV criteria for a major depressive episode (Kroenke et al., 2001). The questions included the symptoms of: lack of interest, depressed mood, sleeping difficulties, tiredness, appetite problems, concentration problems, psychomotor agitation/retardation, negative feelings about self and suicidal ideation. The respondent was asked how many symptoms he/she experienced during past the two weeks. Items were administered on a 4-point Likert scale with the response options: 0 “not at all”, 1“several days”, 2“more than one half of the days”, and 3“nearly every day”. The Thai version of the PHQ-9 was shown to have acceptable psychometric properties to screen for major depression in the primary care setting (Lotrakul et al., 2008).
Statistical analysis
Demographic data were described using mean, SD and frequency. The Rasch rating scale model was used to verify the construct validity of the PHQ-9.
Rasch analysis is a mathematical method to calibrate linear logit measures of item difficulty and person ability from ordinal data. To examine the PHQ-9 construct, a firmly established calibration of item measures was needed to make inference about the construct. According to the Rasch model, the probability of an individual’s response counts on both “person ability” and “item difficulty” [35]. Herein, “person ability” refers to as the extent to which the participants experience depression and “item difficulty” refers to the severity of depression expressed by the item. The response probabilities of each person to each of the individual items, according to the Rasch model, are modeled as a logistic function of the latent depression trait. This model yields person and item depression estimates, as well as estimates of a set between response category thresholds common to all items. Item estimates below 0 (mean) are considered easy for the person to endorse, comparable to a person with a lower level of depression. The opposite meaning is applied when item and person estimates are above 0.
To test whether the data could fit the Rasch model, fit statistics, e.g., information-weighted fit statistics (infit) mean square (MnSq) and outlier-sensitive fit statistics (outfit) MnSq were used. An item with infit or outfit MnSq out of the 0.7–1.5 range was considered a misfit [36]. The performance of the scale was examined using Rasch fit statistics, and the dimensionality of the scale was examined using principal component analysis (PCA) of the standardized residuals. To indicate unidimensionality, there should be an absence of any meaningful pattern in the residuals. The first residual dimension is usually expected to have a value smaller than 2.0, which has been shown to happen entirely due to random variation [37]. In addition, fit statistics <0.6 indicate items overfit the model, usually because they share some components of meaning with other items [22].
Local dependency, referring to the items containing a latent trait other than depression, was tested using the correlation (r) of the Rasch residuals between each pair of item; r ≤0.3 was considered acceptable [38].
Item ordering, indicating that a higher severity of a symptom should score a higher category, was examined using category function. The threshold estimates for a 4 –category response option was examined to verify whether participants discriminated between the available ordered response categories. The disordering threshold could be examined in two ways, first) by considering infit and outfit MnSq within 0.7 and 1.3 and second) by the ordering of the "observed averages; acceptable response scores should monotonically increase average difficulties (average measure) and step difficulties (step measure).
We used a person-item (Wright) map to plot item difficulty and the individual’s abilities along its continuum on the same axis of the logits allowing the evaluation of the fit of the item difficulties matched to the abilities of the individuals. We examine to what the extent the item positions match the person positions (targeting) using the Wright map. The best targeting of a measurement is when the mean items are at the same measure as the mean persons. Researchers suggests the difference between the mean value of the mean person measure should be within one logit [39]. Floor or ceiling effects could also be visualized using this map.
We tested the differential item functioning (DIF) across sex, age, education and alcohol consumption. Both statistical test and DIF contrast were used, and a DIF contrast >0.64 indicated a substantial DIF [40].
Finally, reliability was evaluated using person separation index (comparable to Cronbach’s alpha). Person separation index denotes how well the test is able to differentiate
among groups of respondents with different levels of depression. An acceptable value for separation is at least 2. Item reliability was assessed using item separation index. Separation value was less than 3 and item reliability was less than 0.9, implying that the sample is not large enough to endorse construct validity or a difficulty exists with the item hierarchy of the instrument [40].
All analyses were conducted using IBM SPSS for Windows, Version 22 (Chicago, IL, USA), STATA, Version 14 () and Rasch models using WINSTEPS [40].