A series of web-based surveys were developed for administration to Australian adolescents (age 11–17 years) living in the community through Pureprofile (an online survey panel company) following informed dyad (parent and adolescent) consent to participate. All consenting adolescents were asked to complete the CHU9D and one of the other five PROMs (AQoL-6D, HUI 2&3, EQ-5D-Y, KIDSCREEN-10 and PedsQL SF15). In addition, each respondents also completed self-reported questions on general health, disability status and socio-demographic characteristics including the Family Affluence Scale (FAS, a 4-item adolescent specific measure of family socio-economic status; the total score ranges from 0 to 7 and were categorised into three groups as 0–3 low, 4–5 medium and 6–7 high ) [26]. More details on the web-based surveys have been published elsewhere [23, 24].
Generic Preference-based Instruments
The CHU9D is a generic preference-based measure of HRQoL specifically developed with and validated for use in paediatric populations aged 7–17 years [27, 28]. The CHU9D has nine dimensions (worried, sad, pain, tired, annoyed, schoolwork/homework, sleep, daily routine, and activities) each represented by a single item and has five levels of severity response categories.
The AQoL-6D Adolescent version is a generic preference-based utility instrument originally develop for application in adults but later adapted for use in adolescents [29]. The instrument has 20 items distributed across six dimensions (independent, mental health, coping, relationships, pain, and senses) with four to six levels of severity response categories.
The HUI 2/3 are generic preference-based utility instruments to measure health status and HRQoL [30]. The HUI2 was initially developed in the early 1990s for measuring and valuing long-term outcomes in patients with childhood cancer. The HUI3 represents a further development for use in both clinical and general populations. These two instruments are independent but complementary systems. The HUI2 has seven dimensions of HRQoL (sensation, mobility, emotion, cognition, self-care, pain and fertility) assessed on 4 or 5 levels and the HUI3 has eight dimensions (vision, hearing, speech, ambulation, dexterity, emotion, cognition and pain) on 6 distinct levels of response categories. Together, the 15-item HUI 2/3 instruments were included excluding the item related to fertility (not relevant to our study population).
The EQ-5D-Y is a pediatric version adaptation of the widely applied generic instrument (EQ-5D-3L) originally developed for adults [31]. The EQ-5D-Y was developed by revising the content and wording of the adult version to ensure their relevance and clarity for children and adolescents. The EQ-5D-Y contains five dimensions (mobility, looking after myself, usual activities, pain or discomfort, and feeling worried, sad or unhappy) assessed on 3 levels of severity response categories.
Generic Non-preference-based Instruments
The KIDSCREEN-10 is developed to assess pediatric health and well-being specifically relevant for aged 8 to 18 years [32]. It was simultaneously developed across 13 European countries. The original instrument was a 52-item covering 10 dimensions of HRQoL [33]. A 27-item shorter version was derived from the 52-item instrument and subsequently a 10-item instrument was derived, which was used in the study.
The PedsQL SF15 is developed to be used in in paediatric health research [34]. It has four dimensions and consists of 15 items which measure physical (5 items), emotional (4 items), social (3 items) and school functioning (3 items) rated on a 5-point Likert scale. In this study, the adolescent self-report version was used.
Item Banking
A total of 77 items from the 6 instruments were available and initially considered for creating an item bank. Two items on global health (KIDSCREEN-10 item-11, “In general, how would you say your health is?” and EQ-VAS, “How good is your health today?”) were not considered for further analysis as they did not measure a specific HRQoL domain leaving 75 items. After pooling raw response data from the 6 instruments, two approaches were taken:
-
Overall item bank: All 75 items were pooled together to assess whether these items together could be calibrated on a single continuum scale to form a valid item bank. As all the respondents completed the CHU9D, it was used as the anchoring instrument to link data from the other 5 instruments. For this, a separate Rasch analysis (describe below in detail) was conducted on the CHU9D data only and item parameter values were obtained for its 9 items, which were anchored to build a 75-item item bank.
-
Dimension-specific item banks: 75 items were classified and binned across 9 dimensions of HRQoL by the authors separately. The separate classifications were reviewed by the authors together and reconciled into one, with any discrepancies resolved through group discussion. The final dimensions were physical activities (13 items), mobility (6 items), mental health (17 items), social relationship (8 items), pain (7 items), fatigue & memory (5 items), coping (5 items), sensory (9 items) and school activities (5 items), See Supplementary Table 1.
Psychometric Properties Assessment Using Rasch Analysis
Rasch analysis is a probabilistic mathematical model which assumes that the probability of a given respondent affirming an item is a logistic function of the relative distance between the item’s location (i.e. item parameter) and the respondent’s health status [35]. Assessing these probabilities across the items, Rasch analysis estimates the health status of individual respondents (person parameters) and the values of the heath states represented by items (weighted item parameters) on the same latent scale expressed in log of the odds units (or Logits, a measure in interval-level scale). Rasch analysis is a widely accepted methodology to develop and validate PROMs including item banks [13, 36]. The advantage of Rasch analysis is that it also provides surgical insights into whether an instrument could form a valid scale based on its assessment on a series of psychometric properties [37]. The following Rasch-based psychometric properties were assessed for the item bank [20, 38].
Measurement precision
It is the ability of an instrument to discriminate between respondents with differing levels of the underlying construct. Precision is indicated by person separation index (PSI) and person reliability (PR) coefficients, where values of ≥ 2.0 and ≥ 0.8, respectively, are considered minimally acceptable levels where the instrument could differentiate people into three strata of the underlying construct. PR is also equivalent to Cronbach’s Alpha (a traditional test of reliability), therefore a PR ≥ 0.8 indicates acceptable internal consistency of the instrument [39].
Targeting
Ideally, there should be a good spread of items across the full range of respondents’ scores. When respondents have higher (ceiling) or lower (floor) construct than most of the items in the instrument, the range of construct coverage by the instrument may not be adequate. This leads to poor targeting. Targeting is estimated by determining the difference between the item mean (defined as 0 logits by default) and the mean of respondents’ measure; a difference of < 1.0 logits is desired.
Unidimensionality
Item fit statistics and principal component analysis (PCA) of residuals were used to assess whether the item bank attained the requirements of unidimensionality.
Item fit statistics
Chi-square fit statistics (mean square, MnSq) were used to assess how well the data fit the Rasch model. Misfitting items may indicate that they are measuring a construct different than other items in the instrument, indicating multi-dimensionality. There are two item fit statistics, infit and outfit. Infit is more sensitive to the pattern of responses to items closely targeted to the respondents’ ability whereas outfit statistics is more influenced by outliers (respondents with very high or low construct). A fit statistics value between 0.50 and 1.50 is considered acceptable and is still conducive for productive measurement [40].
Principal Component Analysis (PCA) of residuals
A PCA of residuals was conducted to assess for patterns in the data that did not accord with the Rasch requirements, suggesting that groups of items may be forming a secondary dimension. An instrument is considered unidimensional if the raw variance explained for the first factor (i.e. primary dimension) is expected to be ≥ 50%, and the unexplained variance by the first contrast (i.e. first component in the correlation matrix of the residuals) is < 2 eigenvalues [38].
Differential Item Functioning (DIF)
DIF determines whether item bias exists for sample subgroups (e.g. age group, gender, disability). A DIF contrast of < 1.0 logits and corresponding p-value of < 0.05 is acceptable [20]. DIF was assessed for age group (11–14 yrs Vs 15–17 yrs), gender (boys vs girls) and disability (yes vs no).
Validity And Reliability Assessments Of The Hrqol Item Bank
Known group validity
Known group validity (the extent to which the item bank could discriminate between groups known to be different) was assessed by demonstrating that respondents with different ratings of self-reported health, disability and affluence levels (measured by FAS) have significantly different HRQoL scores.
Construct validity
Item separation index (ISD) and item reliability (IR) coefficient were used to verify validity of the item hierarchy in the item bank. An ISD value of > 3.00 and IR > 0.9 imply that the sample is large enough to establish reproducible item calibration hierarchy. These statistics inform construct validity (the extent to which an instrument measure what it purports to be measuring) of the item bank [41].
Statistical analysis
Rasch analysis was performed on the 75-item pool (with anchored CHU9D items) and each of 9 dimensions separately using Winsteps ver 4.4.3 software (Chicago, USA) using Andrich rating scale model per question format (i.e. common item stem and response categories) [42].
Firstly, the 75 items in the pool were classified into 38 groups based on whether they shared common rating scale (i.e. the same preceding statement and categories). Secondly, the response polarity was reversed for all 10 KIDSCREEN items to make them consistent with other items, such that a higher response scores meant worse HRQoL. Finally, the pool data for 75-items was subjected to a group Rasch analysis with 1 Andrich Rating scale per question format (i.e. 38 rating scales) [43]. Rasch analysis also assessed the psychometric properties of the item pools without items from the 2 non-preference based instruments (PedsQL and KIDSCREEN).
Descriptive data were analysed using STATA Version 15.1, Stata Corp LLC, Texas, USA (Texas, USA) [44]. To compare median HRQoL item bank scores, Wilcoson rank-sum was used to compare between two groups (gender, age-group and disability) and Kruskal-Wallis test (self-reported health and affluence levels) was used to test between multiple groups. Dunn’s test was carried out following Kruskal-Wallis test for multiple pairwise comparison between the groups [45]. All statistical tests used a level of significance at 2-sided alpha of 0.05.