Exploring the potential for item banking in assessing quality of life for evaluating adolescent health interventions

To develop and validate an item bank using existing items from both preference and non-preference-based health-related quality of life (HRQoL) instruments targeted for application with adolescent populations.


Background
The use of patient-reported outcome measures (PROMs) as central components of quality assessment and as endpoints to the evaluation of health-speci c interventions and products has been endorsed by regulatory, pricing and reimbursement authorities in several countries [1][2][3]. As a result, a plethora of PROMs has been developed and utilised to assess health-related quality of life (HRQoL) across health, social care and public health sectors [4][5][6]. In the context of evaluation of adolescent health programmes, several PROMs have gained prominence in recent years [4,7]. These PROMs can be broadly divided into preference and non-preference-based on whether they produce (1) a single preference-based health state utility index or (2) a simple summative (unweighted) score and/or a pro le of individual dimension scores respectively [8]. Preference-based PROMs can be utilised to generate quality-adjusted life year (QALY) estimates that are useful for assessing clinical effectiveness and form an integral component of cost-utility analysis. The relative preferences/weights for health states de ned by the respective PROMs descriptive system are typically obtained from large samples of the general public [9].
Generally, there is a lack of consensus regarding the selection of appropriate PROMs in speci c contexts because a PROM developed for a speci c population may not be directly applicable to other populations [10][11][12]. A relatively new approach called item banking has emerged to address many shortcomings of the existing PROMs [13]. An item bank is a large collection of calibrated items to measure different HRQoL dimensions. Whilst the application of item banks to health economics and preference-based HRQoL instruments in particular is new [14,15], item banks have previously been created and are gaining prominence in health services research through the adoption of modern psychometric methods such as Rasch analysis and computer-based adaptive testings (CAT) [12,14,16,17]. Studies have shown that item banking may signi cantly improves accuracy in a PROM measurement, thereby reducing the sample sizes required to achieve meaningful differences in HRQoL between groups in clinical trials and subsequently reduces the cost of large scale clinical studies [18,19]. Such item banking has been developed in a series of health elds, for example, PROMIS item banks, the Eye-tem Bank [13,20,21].
However, in health economics which heavily relies on PROMs data for economic evaluations the application of item banking is relatively novel. The approach has been applied recently in adult populations by the PROMIS group in the US [15,22].
Our group carried out a series of mapping studies for predicting preference-based CHU9D estimates from preference-based (Assessment of Quality of Life (AQoL)-6D Adolescent version, AQoL-6D, Health Utilities Index (HUI) Mark 2 and HUI Mark 3; EQ-5D-Youth, EQ-5D-Y) [23,24] and non-preference-based instruments (KIDSCREEN-10 and Pediatric Quality of Life Inventory-Short Form, PedsQoL-SF15) [25]. For the mapping project, data was collected from over four thousand healthy adolescents from the community in Australia. Building on this database, the aim was to investigate the feasibility of developing an adolescent speci c item bank by pooling data from the existing PROMs for subsequent application in health economic evaluations. The study described in this manuscript is the rst phase of a multi-phased programme of research which has an overarching aim to validate the HRQoL item bank in adolescents living in the community with and without health condition/s and to develop preference-based utility scores for the relevant item bank dimensions.

Methods
A series of web-based surveys were developed for administration to Australian adolescents (age 11-17 years) living in the community through Purepro le (an online survey panel company) following informed dyad (parent and adolescent) consent to participate. All consenting adolescents were asked to complete the CHU9D and one of the other ve PROMs (AQoL-6D, HUI 2&3, EQ-5D-Y, KIDSCREEN-10 and  PedsQL SF15). In addition, each respondents also completed self-reported questions on general health, disability status and socio-demographic characteristics including the Family A uence Scale (FAS, a 4item adolescent speci c measure of family socio-economic status; the total score ranges from 0 to 7 and were categorised into three groups as 0-3 low, 4-5 medium and 6-7 high ) [26]. More details on the webbased surveys have been published elsewhere [23,24].

Generic Preference-based Instruments
The CHU9D is a generic preference-based measure of HRQoL speci cally developed with and validated for use in paediatric populations aged 7-17 years [27,28]. The CHU9D has nine dimensions (worried, sad, pain, tired, annoyed, schoolwork/homework, sleep, daily routine, and activities) each represented by a single item and has ve levels of severity response categories.
The AQoL-6D Adolescent version is a generic preference-based utility instrument originally develop for application in adults but later adapted for use in adolescents [29]. The instrument has 20 items distributed across six dimensions (independent, mental health, coping, relationships, pain, and senses) with four to six levels of severity response categories.
The HUI 2/3 are generic preference-based utility instruments to measure health status and HRQoL [30]. The HUI2 was initially developed in the early 1990s for measuring and valuing long-term outcomes in patients with childhood cancer. The HUI3 represents a further development for use in both clinical and general populations. These two instruments are independent but complementary systems. The HUI2 has seven dimensions of HRQoL (sensation, mobility, emotion, cognition, self-care, pain and fertility) assessed on 4 or 5 levels and the HUI3 has eight dimensions (vision, hearing, speech, ambulation, dexterity, emotion, cognition and pain) on 6 distinct levels of response categories. Together, the 15-item HUI 2/3 instruments were included excluding the item related to fertility (not relevant to our study population).
The EQ-5D-Y is a pediatric version adaptation of the widely applied generic instrument (EQ-5D-3L) originally developed for adults [31]. The EQ-5D-Y was developed by revising the content and wording of the adult version to ensure their relevance and clarity for children and adolescents. The EQ-5D-Y contains ve dimensions (mobility, looking after myself, usual activities, pain or discomfort, and feeling worried, sad or unhappy) assessed on 3 levels of severity response categories.

Generic Non-preference-based Instruments
The KIDSCREEN-10 is developed to assess pediatric health and well-being speci cally relevant for aged 8 to 18 years [32]. It was simultaneously developed across 13 European countries. The original instrument was a 52-item covering 10 dimensions of HRQoL [33]. A 27-item shorter version was derived from the 52item instrument and subsequently a 10-item instrument was derived, which was used in the study.
The PedsQL SF15 is developed to be used in in paediatric health research [34]. It has four dimensions and consists of 15 items which measure physical (5 items), emotional (4 items), social (3 items) and school functioning (3 items) rated on a 5-point Likert scale. In this study, the adolescent self-report version was used.

Item Banking
A total of 77 items from the 6 instruments were available and initially considered for creating an item bank. Two items on global health (KIDSCREEN-10 item-11, "In general, how would you say your health is?" and EQ-VAS, "How good is your health today?") were not considered for further analysis as they did not measure a speci c HRQoL domain leaving 75 items. After pooling raw response data from the 6 instruments, two approaches were taken: 1. Overall item bank: All 75 items were pooled together to assess whether these items together could be calibrated on a single continuum scale to form a valid item bank. As all the respondents completed the CHU9D, it was used as the anchoring instrument to link data from the other 5 instruments. For this, a separate Rasch analysis (describe below in detail) was conducted on the CHU9D data only and item parameter values were obtained for its 9 items, which were anchored to build a 75-item item bank.
2. Dimension-speci c item banks: 75 items were classi ed and binned across 9 dimensions of HRQoL by the authors separately. The separate classi cations were reviewed by the authors together and reconciled into one, with any discrepancies resolved through group discussion. The nal dimensions were physical activities (13 items), mobility (6 items), mental health (17 items), social relationship (8 items), pain (7 items), fatigue & memory (5 items), coping (5 items), sensory (9 items) and school activities (5 items), See Supplementary Table 1.

Psychometric Properties Assessment Using Rasch Analysis
Rasch analysis is a probabilistic mathematical model which assumes that the probability of a given respondent a rming an item is a logistic function of the relative distance between the item's location (i.e. item parameter) and the respondent's health status [35]. Assessing these probabilities across the items, Rasch analysis estimates the health status of individual respondents (person parameters) and the values of the heath states represented by items (weighted item parameters) on the same latent scale expressed in log of the odds units (or Logits, a measure in interval-level scale). Rasch analysis is a widely accepted methodology to develop and validate PROMs including item banks [13,36]. The advantage of Rasch analysis is that it also provides surgical insights into whether an instrument could form a valid scale based on its assessment on a series of psychometric properties [37]. The following Rasch-based psychometric properties were assessed for the item bank [20,38].

Measurement precision
It is the ability of an instrument to discriminate between respondents with differing levels of the underlying construct. Precision is indicated by person separation index (PSI) and person reliability (PR) coe cients, where values of ≥ 2.0 and ≥ 0.8, respectively, are considered minimally acceptable levels where the instrument could differentiate people into three strata of the underlying construct. PR is also equivalent to Cronbach's Alpha (a traditional test of reliability), therefore a PR ≥ 0.8 indicates acceptable internal consistency of the instrument [39].

Targeting
Ideally, there should be a good spread of items across the full range of respondents' scores. When respondents have higher (ceiling) or lower ( oor) construct than most of the items in the instrument, the range of construct coverage by the instrument may not be adequate. This leads to poor targeting. Targeting is estimated by determining the difference between the item mean (de ned as 0 logits by default) and the mean of respondents' measure; a difference of < 1.0 logits is desired.

Unidimensionality
Item t statistics and principal component analysis (PCA) of residuals were used to assess whether the item bank attained the requirements of unidimensionality.

Item t statistics
Chi-square t statistics (mean square, MnSq) were used to assess how well the data t the Rasch model. Mis tting items may indicate that they are measuring a construct different than other items in the instrument, indicating multi-dimensionality. There are two item t statistics, in t and out t. In t is more sensitive to the pattern of responses to items closely targeted to the respondents' ability whereas out t statistics is more in uenced by outliers (respondents with very high or low construct). A t statistics value between 0.50 and 1.50 is considered acceptable and is still conducive for productive measurement [40].
Principal Component Analysis (PCA) of residuals A PCA of residuals was conducted to assess for patterns in the data that did not accord with the Rasch requirements, suggesting that groups of items may be forming a secondary dimension. An instrument is considered unidimensional if the raw variance explained for the rst factor (i.e. primary dimension) is expected to be ≥ 50%, and the unexplained variance by the rst contrast (i.e. rst component in the correlation matrix of the residuals) is < 2 eigenvalues [38].
Differential Item Functioning (DIF) DIF determines whether item bias exists for sample subgroups (e.g. age group, gender, disability). A DIF contrast of < 1.0 logits and corresponding p-value of < 0.05 is acceptable [20]. DIF was assessed for age group (11-14 yrs Vs 15-17 yrs), gender (boys vs girls) and disability (yes vs no).

Validity And Reliability Assessments Of The Hrqol Item Bank
Known group validity Known group validity (the extent to which the item bank could discriminate between groups known to be different) was assessed by demonstrating that respondents with different ratings of self-reported health, disability and a uence levels (measured by FAS) have signi cantly different HRQoL scores.

Construct validity
Item separation index (ISD) and item reliability (IR) coe cient were used to verify validity of the item hierarchy in the item bank. An ISD value of > 3.00 and IR > 0.9 imply that the sample is large enough to establish reproducible item calibration hierarchy. These statistics inform construct validity (the extent to which an instrument measure what it purports to be measuring) of the item bank [41].

Statistical analysis
Rasch analysis was performed on the 75-item pool (with anchored CHU9D items) and each of 9 dimensions separately using Winsteps ver 4.4.3 software (Chicago, USA) using Andrich rating scale model per question format (i.e. common item stem and response categories) [42].
Firstly, the 75 items in the pool were classi ed into 38 groups based on whether they shared common rating scale (i.e. the same preceding statement and categories). Secondly, the response polarity was reversed for all 10 KIDSCREEN items to make them consistent with other items, such that a higher response scores meant worse HRQoL. Finally, the pool data for 75-items was subjected to a group Rasch analysis with 1 Andrich Rating scale per question format (i.e. 38 rating scales) [43]. Rasch analysis also assessed the psychometric properties of the item pools without items from the 2 non-preference based instruments (PedsQL and KIDSCREEN).
To compare median HRQoL item bank scores, Wilcoson rank-sum was used to compare between two groups (gender, age-group and disability) and Kruskal-Wallis test (self-reported health and a uence levels) was used to test between multiple groups. Dunn's test was carried out following Kruskal-Wallis test for multiple pairwise comparison between the groups [45]. All statistical tests used a level of signi cance at 2-sided alpha of 0.05.

Results
The study cohort included 4,352 Australian adolescents from the community. After exclusion of 246 (5.7%) respondents who demonstrated very high ceiling effects (i.e. they responded to only the highest category "no problem/ no issue" to all the items), there were 4,106 eligible individuals. The mean age of the respondents was 14.7 (± 1.88) years and there were approximately equal males and females. The majority of the respondents self-reported to have no disability (87.6%), had excellent or very good health ratings (72.1%) and were from moderate to high family a uence backgrounds ( Table 1). Typical of a general population response, 58 out of 75 items in the item pool demonstrated a ceiling effect (> 15% respondents reported no problems/issue).   Fig. 1a), adolescents aged 16-17 years (Z=-4.74, p < 0.001) and people with selfreported disability (z = 14.23, p < 0.001, Fig. 1b) reported worse QoL. Self-reported health ratings were associated with poor QoL scores (Chi-squared = 1390.2, df = 4, p < 0.001, Fig. 1c) with signi cant differences within and between all the groups (P < 0.001 between all the groups except for between health rating 3 &5, p = 0.01).
Similarly, HRQoL scores were signi cantly better for high (Chi-squared = 3.05, 2 d.f., P = 0.001) and medium socio-economic status (as approximated by FAS) (Chi-squared = 8.58, P < 0.001) groups than for those in the low a uence group (Fig. 1d). When compared between groups, there was a signi cant difference in HRQoL between low and medium (Chi-squared = 8.62, P < 0.001), and low and high a uence groups (Chi-squared = 3.07, P = 0.001). However, high and medium a uence groups (Chi-squared = 0.72, P = 0.24) did not have a signi cantly different HRQoL scores. These results also demonstrate knowngroup validity of the 73-item HRQoL item bank.

Rasch Analysis Of The Dimensions
As expected a priori none of the 9 dimensions demonstrated adequate measurement precision (all had a PSI < 2.00 or PR coe cient < 0.80) indicating that they lacked enough sensitivity to form a standalone dimension, hence no further analysis was carried out.

Discussion
The main objective of this study was to demonstrate "proof of concept" of item banking in the development of adolescent speci c HRQoL instruments for subsequent application in health economic evaluations. This was achieved by pooling data from six PROMs suitable for application in adolescent populations, four (CHU9D, EQ-5D-Y, HUI2/3, AQoL-6D) are preference based and two (PedsQL SF15 and KIDSCREEN-10) are non-preference based but have established mapping algorithms which facilitate their application in economic evaluation [46,47]. By utilising Rasch analysis and linking pooled PROMs data, an item bank was constructed that contained a large volume of items calibrated on a single continuum scale of HRQoL.
Although the item bank contained items representing different QoL dimensions, the 73-item bank met all the Rasch-based psychometric requirements to qualify as a unidimensional scale. The PCA analysis also demonstrated that four items referring to sports activities clustered together suggesting that these items might form a secondary dimension. However, the removal of the four items reduced the precision of the item bank, which suggests that these items were adding more signal than noise, hence they were retained. Similar approaches were utilised to tackle item clustering and multi-dimensionality while developing item banks in other health elds [20,48]. It is found that when a range of items representing different HRQoL dimensions pooled together a valid unidimensional latent scale was identi ed. This unidimensional scale represents a latent concept of HRQoL. This nding may look puzzling at rst but it is analogous to a mathematic test which constitutes different components (e.g. word problems, algebra, geometry, calculus etc), which are conceptually different, but they all contribute to the measurement of the overall performance in mathematics of students.
One issue with the item bank was that inadequate coverage towards the higher end of HRQoL. This is likely to be a by-product of the study population itself which was drawn from Australian general population. The majority of the respondents had good health, no disability and were from relatively a uent family backgrounds (Table 1), hence they were expected a priori to have relatively good HRQoL.
However, whilst the item bank may not differentiate well between adolescents who have a high HRQoL status, it may match well for those with low to moderate HRQoL. This is likely for this item bank for which the items were driven from generic instruments developed to assess HRQoL impact in populations demonstrating more health-related quality of life impairments e.g. patient samples rather than for healthy populations. Several disease-speci c PROMs have demonstrated poor targeting when used on less affected or healthy individuals [49][50][51]. Similarly, none of the HRQoL dimensions formed a valid scale either due to inadequate number of items and/or content coverage for this largely healthy population.
The next phase of this programme of research will test the validity of these dimensions in adolescent patients with more regular and on-going engagement with health services due to the presence of chronic health conditions. Once valid dimensions are identi ed, we will take a similar methodological approach to that pursued by the PROMIS group to develop the PROMIS-Preference PROPr for application in adults (≥ 18 years), to produce adolescent-speci c preference scores (or utility weights) for these HrQoL dimensions [22].
A key advantage of Rasch analysis is that it also assesses whether respondents who have approximately similar HRQoL status across different demographic groups perform in a similar way on each individual item, a test of item invariance. None of the 73 items had signi cant DIF indicating that the item bank and its items were invariant by key socio-demographic differences. This is a key psychometric property for the item bank which implies that items in the item bank work in similar ways for the different groups to be compared [52]. No DIF also con rms that it is valid to compare QoL scores across different demographic groups. Further, the item bank was able to discriminate between adolescents with different health ratings, disability and a uence levels, demonstrating its validity. That is, adolescents with better self-reported health ratings, no disability and a higher a uence were associated with better QoL scores and vice versa.
A clear advantage of an item bank is that it contains a large collection of calibrated items to provide a comprehensive measure of HRQoL suitable for wide range of people. However, following its initial construction and due to its length, an item bank typically requires a CAT system to administer it. A computer enabled "adaptive test" presents items that are more accurately targeted to the individuals' status based upon their previous responses. This process provides highly accurate measurement and the test can be continued until a desired level of measurement precision is achieved [49]. By tailoring the test to the individual, the problem of poor targeting which we observed in our item bank can be eliminated.
The CAT system not only streamlines the administration of the item banks but also help in expansion of item banks by adding uncalibrated items to the item bank and determining their calibration with Rasch analysis against the calibrated items already in the bank [49]. The CAT also creates on opportunity for electronic implementation via digital portals including smart phones for real time scoring and recording of data. Such systems have been widely developed across different health elds [20,48,51,53]. This study adds to the previous study reported upon by the PROMIS group conducted in adult populations by providing early evidence that item banking is feasible in the development of HRQoL instruments that may subsequently be applied in health economics context [14,22].
Leading from this feasibility study, the next steps are to test the validity of the item bank in adolescent patient groups who have regular engagement with health services due to the presence of one or more chronic health condition/s, generate adolescent speci c scoring algorithms for utility assessment and test feasibility of a CAT to elicit tailored utility scores in real-time for health economics evaluations.
Studies have shown that item banks implemented via CAT need fewer items to obtain superior precision and sensitivity compared to the traditional full-length paper-pencil PROMs, minimising respondent burden [53][54][55]. This is an important consideration for enhancing participation and completion rates in adolescent populations.
There are some limitations to this study, unlike other item banks that utilised common question format and response categories across all the items to improve measurement accuracy, [11,56] the items in our item bank retained their original question formats and response scales which might have introduced noise in the measurement. Further the data were obtained from a web-based survey which raises questions around data quality and whether the respondents may or may not provide accurate information. However, appropriate data checks (including data completeness, time taken to complete, identifying respondents who selected perfect responses ( atliners)) were used to deal with this limitation [46].
In conclusion, an item bank has been developed by pooling data from six PROMs. The item bank has demonstrated adequate Rasch-based psychometric properties demonstrating the feasibility of the construction of an item bank in the eld of health economics and for the development of instruments suitable for quality of life assessment with adolescents for economic evaluation. The addition of more targeted items for adolescents in the general population or the addition of respondents with more health impairments may improve the applicability and generalisability of the item bank for general population and patient cohorts. Generating adolescent-speci c scoring algorithms and utilities and development of a CAT system to administer the item bank are the natural next steps.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.