Development of the QLICD-PU
The QLICD-PU consists of a general module QLICD-GM and a module dedicated to PUD. The development process of QLICD-GM has been described in another paper . Here, we briefly summarize the development steps and results. The programmed develop procedures which include focus group discussions, in-depth interviews, pre-testing, and four quantitative statistical analyses were used in the QLICD-GM. Finally, the QLICD-GM has 30 items which included 3 domains and 10 facets. Based on the data of 620 patients with seven kinds of chronic diseases such as coronary heart disease and hypertension, the QLICD-GM has good psychometrics (reliability, validity, responsiveness) .
For a specific module, 29 items reflecting symptoms were selected to constitute the initial item pool. These items focus on the unique side effects and mental health of PUD. We selected these items from literature reviews and nominal / focus group discussions. Focus groups evaluate the importance of each item by ranking each item independently and then discussing the 9 lowest ranked items that are excluded. The remaining 20 items constitute a preliminary questionnaire for conducting the pilot test and also Interviews with 29 PUD patients and 14 clinicians and researchers with extensive experience. We focus on patient opinion, which is most important for assessing the acceptability of interventions and related compliance. Based on the pilot data, the items were re-screened using a development process similar to the generic module (statistical procedure and focus group discussion). The final specific module consists of 14 items, coded PU1-PU14 (see table 1 in detail), classified into 6 facets.
Validation of the QLICD-PU
Data Collection and Scoring
In this study, we enrolled participants with PUD at any stage who were: (1) be able to provide written informed consent; (2) be able to read and write words with assistance. There were no protocol requirements regarding specific clinical treatment of patients. Physicians could treat the patients according to what they deemed clinically appropriate.
The survey was carried out at the First Affiliated Hospital of Kunming Medical University after approved by the ethics committee of Kunming Medical University. Researchers including doctors and medical graduate students explained the purpose of the study and obtained informed consent before the test. Each interviewee was required to answer the questionnaire upon admission. To assess the reliability of the test-retest, a subsample is randomly selected for the second assessment on the second day of hospitalization. All patients available at the scheduled third evaluation time point have completed discharge measures to assess the responsiveness of the questionnaire. Besides, the Chinese version of SF-36  was also used to provide data for assessing the criterion-related validity, convergent and discriminant validity of the QLICD-PU because of the lack of an agreed-upon gold standard for assessing QOL of PUD. Baseline socio-demographic characteristics were recorded from hospital medical records, including age, gender, education level, marital status, clinical history, and treatment. Each investigator checked the answers immediately to ensure their integrity.
Since each item uses the five-point Likert format (not at all, a little bit, somewhat, quite a bit, and very much), positively stated items will be scored directly from 1 to 5, while negatively stated items will receive the opposite score. The domain/facet and overall scores are obtained by adding related item scores, all of which are linearly converted to standardized scores on a scale of 0-100. The higher the score of QLICD-PU means the better quality of life for both raw and standardized scores.
Then the validity, reliability, and responsiveness of QLICD-PU were evaluated. In this study, the construct validity is evaluated by the Pearson correlation coefficient (r) between the items and the domains and and also by factor analysis. Assess the criterion-related validity by correlating the corresponding domains of QLICD-PU and SF-36. Multi-trait scaling analysis  is used to test the convergence validity and discriminant validity. There are two validity criteria: (1) When the item-domain correlation is 0.40 or higher, it supports convergence validity; (2) discriminant validity is revealed when item-domain correlation is higher than that with other domains. In terms of reliability, for each domain/facet and the overall scale, the internal consistency is assessed using the first measurement data (at admission) by Cronbach's alpha coefficient. Evaluation of test-retest reliability was by Pearson correlation coefficient and intra-class correlation (ICC) [26-27] between the first and second assessments. The responsiveness (sensitivity to detect change) was assessed by using a paired t-test to compare the average score change between the two assessments before and after treatments and also the effect size, standardized response mean (SRM) [28-29].
Generalizability Theory Analysis
In addition to the classical test theory analysis, we also applied the Generalizability Theory (GT) in this research to study the reliability of the QLICD-PU score. GT is a modern test theory developed based on the combination of classical test theory and analysis of variance. It is proposed as a method to improve measurement program design in an attempt to obtain reliable data [30-33]. To control the measurement errors, GT introduces independent variables or factors that interfere with test scores into measurement models, such as differences between research objects, item difficulty, scoring criteria, and the interaction between these factors. An analysis of variance was then used to assess the impact of these variables or factors on test scores, using the variance component as an index. GT includes generalizability study (G-study) and decision study (D-study). G study quantified the amount of variance related to the different facets (factors) to be examined. D study provides information about which protocol is best for a particular measurement by generating a generalizability (G) coefficient, which can be interpreted as a reliability factor for all facets of the current study.
In our research, both G study and D study are completed in one measurement model to estimate the variance components and dependability coefficients in one-facet crossed design (person-by-item design, ie. p × i design ). We define the patient's quality of life as the measurement target and the item as a facet of measurement error. For G-Study, we defined an acceptable observation range composed of measurement objects and measurement errors and estimated variance components. For D-study, we define the allowable summary based on the measurement object and the measurement facet that the researchers are willing to summarize to express the measurement conditions. At the same time, the generalized coefficients of each facet and the variance components of the reliability indicators and their interactions are calculated.