Development of the QLICD-PU
The QLICD-PU consists of a general module QLICD-GM and a module dedicated to PUD. The development process of QLICD-GM has been described in another paper [20]. Here, we briefly summarize the development steps and results. The programmed procedures which include focus group discussions, in-depth interviews, pre-testing and four quantitative statistical analyses were used to select items. Finally, the QLICD-GM has 30 items which included 3 domains and 10 facets. The QLICD-GM has showed good psychometrics (reliability, validity, responsiveness) by the analysis from the data of 620 patients with seven kinds of chronic diseases such as coronary heart disease and hypertension [20].
For a specific module, 29 items reflecting symptoms and side-effects of PUD were selected to constitute the initial item pool. We selected these items from literature reviews and nominal / focus group discussions. Focus groups evaluate the importance of each item by ranking each item independently and then discussing the 9 lowest ranked items that are excluded. Consequently, the remaining 20 items constitute a preliminary questionnaire for conducting the pilot test and also Interviews with 29 PUD patients and 14 clinicians and researchers with extensive experience. We focus on patient opinion, which is most important for assessing the acceptability of interventions and related compliance. Based on the pilot data, the items were re-screened using similar development process to the generic module (statistical procedures and focus group discussion). The final specific module consists of 14 items coded PU1-PU14 (see table 1 in detail), which can be classified into 6 facets.
Validation of the QLICD-PU
Data Collection and Scoring
In this study, we enrolled participants with PUD at any stage who were: (1) be able to provide written informed consent; (2) be able to read and write words with assistance. There were no protocol requirements regarding specific clinical treatment of patients. Physicians could treat the patients according to what they deemed clinically appropriate.
The survey was carried out at the First Affiliated Hospital of Kunming Medical University after approved by the ethics committee of this University. The respondents were voluntary and provided written consent for participation. Each interviewee was required to answer the questionnaire upon admission. Researchers including doctors and medical graduate students explained the purpose of the study and obtained informed consent before the test. The respondents were voluntary and provided written consent for participation.
To assess the reliability of the test-retest, a subsample is randomly selected for the second assessments on the second day of hospitalization. All patients available at the scheduled third evaluation time point have completed discharge measures to assess the responsiveness of the questionnaire.
Besides, the Chinese version of SF-36 [24] was also used to provide data for assessing the criterion-related validity, as well as convergent and discriminant validity of the QLICD-PU because of the lack of an agreed-upon gold standard for PUD. Baseline socio-demographic characteristics were recorded from hospital medical records, including age, gender, education level, marital status, clinical history, and treatment. Each investigator checked the answers immediately to ensure their integrity.
Since each item uses the five-point Likert format (not at all, a little bit, somewhat, quite a bit, and very much), positively stated items will be scored directly from 1 to 5, while negatively stated items will receive the opposite score. The domain/facet and overall scores are obtained by adding related item scores, all of which are linearly converted to standardized scores on a scale of 0-100. The higher the score of QLICD-PU means the better quality of life for both raw and standardized scores.
Psychometric Analysis
The validity, reliability, and responsiveness of QLICD-PU were evaluated in this study. The construct validity was evaluated by the Pearson correlation coefficient (r) between the items and the domains and also by factor analysis, while the criterion-related validity was assessed by correlating the corresponding domains of QLICD-PU and SF-36. Multi-trait scaling analysis [25] was used to test the convergence validity and discriminant validity. There are two validity criteria: (1) When the item-domain correlation is 0.40 or higher, it supports convergence validity; (2) discriminant validity is revealed when item-domain correlation is higher than that with other domains.
In terms of reliability, for each domain/facet and the overall scale, the internal consistency was assessed by Cronbach's alpha coefficients using the first measurement data (at admission) for large sample. Evaluation of test-retest reliability was by Pearson correlation coefficient and intra-class correlation (ICC) [26-27] between the first and second assessments. The responsiveness (sensitivity to detect change) was assessed by using a paired t-test to compare the average score change between the two assessments before and after treatments and also the effect size, standardized response mean (SRM) [28-29].
Generalizability Theory Analysis
In addition to the classical test theory analysis, we also applied the Generalizability Theory (GT) in this research to study the reliability of the QLICD-PU score. GT is a modern test theory developed based on the combination of experiment design and analysis of variance. It is proposed as a method to improve measurement program design in an attempt to obtain reliable data [30-33]. To control the measurement errors, GT introduces independent variables or factors that interfere with test scores into measurement models, such as research objects, item difficulty, scoring criteria, and the interaction between these factors. An analysis of variance was then used to assess the impact of these variables or factors on test scores, using the variance component as an index. GT includes generalizability study (G-study) and decision study (D-study). G study quantified the amount of variance related to the different facets (factors) to be examined, while D study provides information about which protocol is best for a particular measurement by generating a generalizability (G) coefficient.
In our research, both G study and D study were completed in one measurement model to estimate the variance components and dependability coefficients in one-facet crossed design (person-by-item design, ie. p × i design ). We defined the patient's quality of life as the measurement target and the item as a facet of measurement error. Specifically, we defined an acceptable observation range composed of measurement objects and measurement errors and estimated variance components for G-Study. And for D-study, we defined the allowable summary based on the measurement object and the measurement facet that the researchers are willing to summarize to express the measurement conditions. At the same time, the generalized coefficients of each facet and the variance components of the reliability indicators and their interactions were calculated.