3.1 Sample characteristics
Of the 20,226 questionnaires received, 798 had no responses on some of the SF-36 items. In the end, 19,428 samples were included in the study. The mean age of the sample of respondents was 14.78 years (standard deviation [SD] = 1.77), and 49.4% (9,595) were boys. Among the SF-36 and SF-12 domains, the PF mean score was the highest, and the RE mean score was the lowest. PCS was better than MCS. The biggest mean difference in scores between the two instruments was in the domain of SF. Of the corresponding domains, the RE domains were the most relevant (r = 0.923), while the smallest correlation coefficient was in the VT domains (r = 0.670), which means domains of the SF-12 could reflect the information from 67.0% to 92.3% of the corresponding domains of the SF-36 (Table 1).
3.2 The reliability and validity in classical test theory
3.2.1 Factor analysis by EFA
The construct validity of SF-36 was good in adolescents, as determined by the Kaiser-Meyer-Olkin Measure of Sampling Adequacy (0.884). Communalities of all of variables were over 0.5. Factors rotated by the varimax method such that eigenvalues were greater than 1 were extracted. Eight components were produced and explained 69.21% of the total variance. The structure loading of factors extracted and the component score coefficient matrix are presented in Table 2. The structure of the 8 domains identified (PF, RP, BP, GH, VT, SF, RE, and MH) was not supported by EFA. The domains of BP, SF, VT, and MH were not divided into identified structures, due to the strong correlations between BP and SF and between VT and MH. Details are shown in Table 2.
Similarly, the construct validity of the SF-12 was also good in adolescents; the Kaiser-Meyer-Olkin Measure of Sampling Adequacy was 0.732. Eight components were extracted and explained 63.50% of the total variance. Due to the strong correlations between MH and SF and between VT and MH, the domains of SF, VT, and MH were not divided into identified structures in the SF-12 (Table 3).
3.2.2 Factor analysis by CFA
We confirmed two conceptual models. Conceptual Model I assumed that PCS was associated with PF, RP, BP, and GH, whereas MCS was associated with VT, SF, RE, and MH. Conceptual Model II assumed that PCS and MCS were associated with most of the 8 domains. Fit indices of the two models revealed that no matter whether SF-36 or SF-12, Conceptual Model I was better than Conceptual Model II in the structures identified (Table 4). The structure of Model I has been used widely in studies in China. In our study, we selected the structures of Model I as the two summary scales (PCS and MCS) of the SF-36 and the SF-12. Standardized parameter estimates for CFA on each path are shown in Figure 1.
3.2.3 Validity and reliability of domains of SF-36 and SF-12
As mentioned above, standardized parameter estimates for CFA in Model I were selected as factor loading. CR and AVE were calculated according to Formulas 4 and 5.
Except for SF domains in the SF-36 (Cronbach’s Alpha = 0.211), domains composed of multiple items had generally acceptable internal reliability (Table 2). The low internal reliability of SF domains was probably because of inconsistent understanding of the meaning of the only two items, which might be biased or difficult to parse for adolescents (“To what extent has your physical health or emotional problems interfered with…” and “How much of the time has your physical health or emotional problems interfered with…”). Moreover, consistent with related studies, the internal reliability of the MH domain in the SF-12 was low (Cronbach’s Alpha = 0.369). On the other hand, the internal reliability of the SF-36 in each domain was better than that of the corresponding domains of the SF-12, which was consistent with higher internal reliability due to there being more items. The domains of PF, RP, BP, GH, and PCS in the SF-36 had good construct reliability (CR > 0.6). Except for RP and PCS, the domains in the SF-12 were not good at construct reliability, especially for the domains of GH, VT, and SF.
The criterion validity was calculated based on the item of self-reported health (“In general, would you say your health is….”). It is worth noting that criterion validities of all the domains of the two instruments were low, but especially so for PF, RP, and SF, which suggests that the correlation between physical health and self-perceived health was weak. Moreover, in PCS, the criterion validity of the SF-12 was much higher than the criterion validity of the SF-36. Although the criterion validities of the SF-36 were higher in other corresponding dimensions, the gaps were small.
PF, RP, and PCS had generally acceptable convergence validity whether in the SF-36 or the SF-12. Moreover, in the RP and PCS domains, the convergence validities of the SF-12 were higher than the SF-36, while there was a little bit of difference in the other domains except BP, GH, and VT (Table 5).
2.3 Validity and reliability in item response theory
The parameter values and information content of the items according to the Samezima grade response model are shown in Table 6. The discriminations of items were between 0.45-2.73, with a large gap. The difficulty of the items ascended from the lowest level to the highest level unidirectionally, which met the difficulty assumptions estimated by the model. The average amount of information of each item was between 0.07-1.02.
In the SF-36, the domains of PF, RP, GH, and RE had acceptable discrimination of items (> 1), but the remaining dimensions were less differentiated, especially BP and SF, probably because for teenagers there was strong homogeneity between individuals in terms of physical pain and social function. On the other hand, in the SF-12, BP, SF, RP, and VT had higher discrimination of items than in the SF-36.
With reference to the relevant literature, the amount of information measured on the scales > 25 indicated that the quality of the evaluation items was good; the amount of information < 16 indicates that the evaluation items were poor [33]. Given the number of items on the instrument for the SF-36, we divided 25 and 16 by 36 to get the average information amount for each item, so as to obtain the determination criterion: the average information amount of excellent items was > 0.69 (25/36), while items < 0.44 (16/36) were judged to be poor. Similarly, for the SF-12, the average information amount of the excellent items was > 2.08, while items < 1.33 were judged to be poor. Except for PF05 and PF09, the items of the PF domain in the SF-36 were excellent, and the items of the GH domain in the SF-36 were excellent too, though the items of BP, VT, SF, RE, and MH were poor. On the other hand, the average amounts of information in the SF-12 items were poor (Table 6).