A 13-item Health of the Nation Outcome Scale (HoNOS 13): Validation by Item Response Theory (IRT) in people with Substance Use Disorder

The Health of the Nation Outcome Scale (HoNOS) is a widely used 12-item tool to assess mental health and social functioning. The French version has an added 13 th item measuring adherence to psychotropic medication. The aim of the current study is to uncover the unknown pattern of item 13 and to compare the unidimensional and multi-dimensional t of both the original HoNOS 12 and the new HoNOS 13 using Item Response Theory (IRT) modelling. This research question was studied among inpatients with substance use disorder (SUD). Methods using of IRT graded-response tted and two-factor model. social-related

In the same vein, several other factor structures have been proposed but none of these have acceptable t. Rasch analyses (8) demonstrate the absence of an underlying construct in the composite scale (9,10).
Moreover, perceptions of the value of the outcome measurement system seem to be mixed (11). For instance, Bebbington et al. (12) perceive HoNOS at best as a measure of social functioning and Bender (13) reported lack of studies related to the ability of HoNOS to serve for the improvement of mental health services. Despite these limitations, the HoNOS continues to be widely used to evaluate mental health patients in inpatient and ambulatory settings (14,15).
Medication non-adherence is known to be an important factor in uencing clinical outcomes (16). In order to take this factor into account a 13 th item concerning "Problems with psychotropic medication compliance" was added to the HoNOS (HoNOS-13) in its French version (17).
So far, the psychometric properties of HoNOS were measured for patients with general psychiatric disorders. Only few studies (18) have speci cally measured these in patients with a main diagnosis of substance use disorders (SUD). In spite of several controversies related to HoNOS factorial structure, it was suggested that the items could help to identify sub-speci c groups of patients with particular needs (19).
A well-established statistical model used to represent both item and test taker characteristics is Item Response Theory (IRT) modelling. IRT can help to understand the impact of each item on the latent construct. Such a statistical approach clearly contrasts with methods used in previous HoNOS-related studies, based on reliability scores (focus on the way each item relates to the total score) such as Exploratory Factor Analysis (EFA) used in Classical Test Theory context (CTT) or Con rmatory Factor Analysis (CFA).
An important feature of IRT compared to CTT is that item properties do not depend on a representative sample (20). IRT is based on the idea that the probability of a correct response to an item is a mathematical function of person and item parameters. Although its origins date back to the mid-20 th century, its application did not become widely implemented until the late 1970-80s following the work of pioneers like Rasch (21), Samejima (22) and Bock (23). IRT typically uses a logistic model to estimate the probability of various types of item responses and thus to describe item functioning along a continuum (24). Under IRT, the primary purpose for administering a psychometric test is to locate the person taking it on the latent trait scale. If such a latent trait measure can be obtained for each person taking the test, two goals can be achieved. First, the respondent can be evaluated for the severity of the characteristic of interest and second, comparisons among respondents can be made to assign severity grades (25) under the appropriate IRT model. Within the IRT family, the logistic graded response model (GRM) is a cumulative probability model developed by Samejima (22) and designed for Likert-type items.
The aim of the current study is to uncover the unknown pattern of item 13 and to compare the t of unidimensional and multi-dimensional of both the original HoNOS 12 and the new HoNOS 13 using IRT modelling in a sample of patients with SUD.

Methods
The data of this study were collected by experienced data extractors from the hospital electronic medical record system from February 2015 to September 2019. They concerned patients with SUD admitted to a specialized addiction unit of a large university hospital. The population were mainly men (70.7%), with a mean age of 43.3 (SD 11.5) years. During the reported period, the number of hospitalizations ranged from 1 to 13 with a median length of stay of 15 days (2-690). The median HoNOS score was 16  at admission and 11 (0-37) at discharge. The questionnaire was administered by the psychiatrists working in the hospital unit who had received a training session for the use of this tool. The Geneva ethics comity approved this study (ClinicalTrials.gov, Identi er: NCT03551301). Six hundred nine (609) valid questionnaires of the HoNOS were analyzed. The validated French version of HoNOS (HoNOS-F) (7) was used with an added 13 th item (HoNOS 13) concerning medication adherence.

Statistical analysis
In this study, we used the multidimensional extension of the IRT (MIRT) graded-response modelling (GRM) because HoNOS is a polytomous-ordered categorical scale. The items are ranked on a 5-point Likert scale from 0 (no problem) to 4 (severe to very severe problem). In GRM, the following two types of parameters are estimated: the discrimination parameter and the di culty parameter.
Because GRM is an ordered logistic model, parameters of each item were naturally estimated in increasing order.
For a K-category (k = 0, …, K-1) ordinal items, the multidimensional GRM model can be written as follows (26): In which k is the response category selected by individual j for item i. MIRT model parameters are estimates using the same procedures as for unidimensional models, the only difference being the number of dimensions included in the model.
These parameter estimates can be obtained using the Mirt package (27) of the free R program (28).
A high discrimination parameter suggests that an item has a high ability to differentiate subjects. In practice, a high discrimination parameter value means that the probability of endorsing an item response increases more rapidly as the latent trait or severity increases (29).
When discrimination is high (and the item response function is steep), the item provides more information on the latent trait and the information is concentrated around item di culty. Items with low discrimination parameters, albeit less informative, may provide information over a wider range of the latent trait. With a logistic model for the item characteristic curve (ICC), Baker (25) proposed the following different ranges of values to better interpret the discrimination parameter: 0=non discriminative power; 0.01-0.34=very low; 0.35-0.64=low; 0.65-1.34=moderate; 1.35-1.69=high; >1.70=very high; and + in nity=perfect.
In GRM, the number of thresholds is equal to the outcome categories minus 1. In this study, we had ve alternative responses yielding four thresholds. The item threshold in the GRM model refers to the level of the latent variable an individual needs to endorse the item with 50% probability (30). Table 1 pictures the distribution of HoNOS and its four dimensions as found by Wing et al. (1).
Using the data at admission, we rst tted a four-factor (latent trait) model of HoNOS 12 as found by Wing et al. (1). Next, a one-factor model as found by Lauzon et al. (7) was tted. For HoNOS 13, we tted respectively a one-factor and a two-factor model. This two-factor model was identi ed by expert consensus, suggesting to group items 1 to 8 and 13 on one side and items 9 to 12 on the other side. The rst factor would capture psychiatric/impairment-related issues while the second factor would re ect social-related issues.
Nested models were compared with the anova function of the Mirt package of R.
Good t of the models was assessed by the root mean square error of approximation (RMSEA) of <0.08 and <0.06, respectively, and the comparative t index (CFI) values of >0.90 and >0.95, respectively (31,32). Other information criteria, speci cally the Akaike information criterion (AIC), AIC corrected (AICc), Bayesian information criterion (BIC), and the sample size adjusted BIC (SABIC) were also used, knowing that AIC and BIC are speci cally designed to penalize for model complexity. Robust values for RMSEA and CFI were reported.
A cross-validation study was performed on HoNOS data recorded at discharge.
All analyses and plots were obtained using R program. More speci cally, the multidimensional item response theory package using the full-information maximum likelihood (FIML) estimator was used. The FIML estimator is recommended when estimating IRT models with relatively small sample sizes (33). We obtained the CFA analysis with the Lavaan package.

Sample size requirements
Forero and Maydeu-Olivares (33) cited by Depaoli et al. (34) have found that sample sizes as small as 200 were su cient for the parameter estimation of a graded response model. On the other hand, Jiang and al. also cited by Depaoli et al. (34) showed that a sample size of 500 provided accurate parameter estimates in the case a three-dimensional GRM composed from 30 to 90 items each with four response categories (35). The sample size at hand (609) is adequate for two-dimensional GRM with 13 items between and ful lled the necessary requirements.

Results
The results of the CFA performed in HoNOS 12 and HoNOS 13 respectively are presented in Table 2.

Model t
We note that while the one-factor model of HoNOS 12 yielded mediocre ts RMSEA (0.093) and CFI (0.899), the 4-Factor model as advocated by Wing et al. (1) was not identi ed. There are several reasons for this phenomenon, the most likely being that the model was too complex.
As for HoNOS 13, the expert consensus two-factor model yielded better goodness-of-t values compared to the one-factor solution. AIC, AICc, BIC and SABIC were lower than in the one-factor model. Moreover, the two-factor model ful lled the criteria of satisfactory RMSEA and CFI statistics (0.075 and 0.929 compared to 0.088 and 0.901). We conclude with these empirical ndings that the 13-item scale can be conceptualized as a two-factor model because it produces a signi cantly improved t over the unidimensional model. Moreover, the signi cant p-value yielded by an anova test comparing the two nested Mirt objects, suggests that the two-factor model is superior to the competing one-Factor model (p<0.001). The consensus across indices therefore made the two-factor model the best candidate for further evaluation. Table 3 presents the results of item loading in a one-factor solution compared to a two-factor solution. It can be seen that most of the items have a higher loading onto their respective factor in the two-factor model than in the one-factor model. The standardized loadings of all items in their respective factor exceed 0.30, except for item 5 that presents a value loading of 0.20. The high loading of item 13 onto its component, the highest compared to the others, strongly legitimates its presence in the conceptualization of the scale.
In Table 4 we present the GRM estimates. In terms of the ranges proposed by Baker (25), we observed that items 9, 10, 11 and 12 had very high discriminative power with a range of 1.75-2.73, items 1, 2, 3, 4, 7, 8 and 13 had moderate discriminative power (range: 0.70 to 1.17) and items 5 and 6 showed very low to low discriminative power (range: 0.33 and 0.57). Items with negative di culty parameters are considered to be more frequently endorsed (this is the case of items 3, item 7 to item 13), and items with positive di culties are linked to the less frequently endorsed ones (items 1 and 2, 4 to 6). Two other links can be established between the most and the less endorsed items: 1) the discrimination parameters are higher in the former than in the latter and 2), the most endorsed items tend to be clustered along Factor 2 while the less endorsed ones are clustered along Factor 1.
The GRM being de ned in terms of cumulative probabilities allows cumulative comparisons. The di culties represent a point at which a person with θ= b ik has a 50% chance of responding in category k or higher (36). For example, looking at the estimated parameters for Item 13, we see that a person with θ=-0.14 has a 50% chance of answering 1 versus greater than or equal to 2, a person with θ=0.47 has a 50% chance of answering 1 or 2 versus greater than or equal to 3. Similarly, a person with θ=1.48 has a 50% chance of answering 1, 2, or 3 versus greater than or equal to 4, and a person with θ=2.64 has a 50% chance of answering 1, 2, 3, or 4 versus 5. We note that the ratings for Item 13 span a broad range of the latent trait and that its discrimination parameter is relatively high.
Unlike the one-dimensional item characteristic curve (ICC) in IRT, for a general Mirt with latent variable of dimension p, the expected total score is a response surface in a p-dimensional surface. In our case, Figure 1 (Expected Total Score) represents the total expected score in a two-dimension space. It can be seen that the rst component (items 1 to 8 and 13) accounts less in the expected total than the second one after standardization.
The cross-validation study conducted on HoNOS data at discharge con rmed the Expert-model as superior to the onecomponent model (RMSEA = 0.062 and CFI statistics =0.955).

Discussion
The present study investigated the psychometric properties of the HoNOS 13 compared to HoNOS 12 in a large sample of inpatients with SUD. The present study is the rst to our knowledge to cover such differences using an IRT model. We found that a one-dimensional instrument was not reliable to use as a primary outcome. The two-factor model of HoNOS 13 which resulted by expert consensus, seemed to re ect the data best. This model groups psychiatric/impairment-related issues (symptoms) while the second factor re ects social-related issues (problems). The lower loading observed for factor 1 (especially for items 5 and 6) is likely due to the heterogeneity of the psychiatric symptoms (19,37) assessed by the HoNOS. This is probably the main reason that leads to the heterogeneity of results of CTT-based studies (38,39), giving more importance to speci c item scores than to a global score. The higher loadings observed for the social-related issues may re ect a form of commonality of such problems among individuals with SUD and/or psychiatric disorders. Similar gures for the social-related items were observed in another study using a sample with psychiatric disorders (19).
We also found that the discrimination estimates for the items ranged from 0.33 to 2.73, indicating that some items of HoNOS 13, show rather low discrimination ability whereas others have high levels (Figure 2 (Item Characteristic Curves (ICC)) and Table 4). However, the strength of the factor loadings of items 5 and 6 in the two-component model is a matter of concern which their item characteristic curves (Cf. Figure 2) re ect. Item 5 measures physical impairment and item 6 hallucinations.
These items seem to be less important in our speci c group of patients with SUD. As the sample was taken from a specialized addiction unit, patients were typically treated for substance withdrawal and were less commonly admitted for acute psychiatric disorders. This may explain fewer problems with hallucinations (item 6) as found in a study by Andreas et al. (18). Even though comorbid substance use is common among patients with psychotic disorders (40) these are more likely to be treated in psychiatric units. In the present sample, 22.9% of the subjects scored higher than zero in this item showing some kinds of symptoms, however not enough linked to overall severity of the latent trait (Table 4). A similar comment could be made for the items 5 (physical illness or disabilities problems) where 37.4% of the participants (scored from 1 to 4) on this item showing that such issues are common among patients with SUD (41,42) however without having a strong contribution to catch the severity of the latent trait. Patients presenting important physical impairments are perhaps more often admitted to general hospital units for withdrawal and treatment of comorbid physical disorders. The removal of items 5 and 6 could yield stronger goodness-of-t measures. But recalling that the development of a scale is not solely a question of statistical matter, model modi cation based on modi cation indices may result in models that lack validity, highly susceptible to capitalization on chance. Therefore, the modi cations should be defensible from a theoretical point of view (43). For these reasons, a safe approach is to consider the scale in its integrality, that is, using all 13 items. Particularly removing such items could be problematic when considering other populations such as the ones admitted in acute psychiatric wards. However, the present data lead to expect loadings and IRT results variation according to the speci c population (specially for the Factor 1, symptoms related items). For instance, in the present study, item 3 (problem drinking or drug taking), show the highest discriminative ability among the Factor 1 (moderate discriminative power) related items, as expected for patients admitted in a specialized addictive disorders unit. A similar gure is observed for item 13 (Figure 2, Table 4). This item may contribute in a more transdiagnostic way to the latent construct. Further studies using IRT on other populations are needed to assess the role of this item.
By contrast, the issues assessed by the Factor 2-related items were found to have very high discriminative power. These problems are common among patients with SUD as well as patients with other mental disorders (44,45) and were also observed in studies using HoNOS in inpatients admitted for psychiatric disorders (19). Importance of social problems among people with addictive disorders (46,47), and their in uence in the rate of service use (48) were repeatedly observed especially for more severe forms and longer duration of substance use. Social problems-related symptoms seem to play an important role in the overall severity. This highlights the importance of community and recovery-oriented interventions (49,50) as well as for approaches focusing on transdiagnostic factors involved in such di culties such as theory of mind (51) or selfstigma (52).
HoNOS 13 can be recommended as a clinical evaluation tool to assess the problems and treatment needs for inpatients with SUD. It is necessary to assess the two-factor model suggested in this study in other patient groups. It could be hypothesized that loadings and discriminative power may change across items depending on the clinical characteristics of a given population. For people with psychiatric and addictive disorders, the items related to the second factor and probably item 13 may show more constant characteristics.
Nevertheless, IRT seems to be of particular interest when analyzing symptoms using the HoNOS scale.
This analysis presents one main limitation as it used routinely collected administrative and clinical data. It was therefore not possible to have more detailed information about individual patients such as speci c measures on addiction severity, duration of treatment, and marital or family status.

Conclusions
The 13-item questionnaire including medication compliance was validated in this analysis. In spite of the above limitation, the HoNOS-13 scale including a question "Problems with psychotropic medication compliance" can be recommended as a valid clinical evaluation tool to assess the problems and treatment needs for inpatients with SUD. In IRT analyses, the items related to substance use and item 13 showed moderate discriminative abilities to catch the severity of the latent construct whereas the items related to the second factor (social problems) showed higher discriminative abilities. No funding was obtained for this study.

Abbreviations
Disclosure statement: The authors declare that they have no con ict of interest.
Competing interests: Not applicable.
Data availability: Data can be made available by the corresponding author upon request.
Authors' contribution: All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by AC. The rst draft of the manuscript was written by AC and LP and all authors commented on previous versions of the manuscript. All authors read and approved the nal manuscript.      Item Characteristic Curves (ICC)