Participants and procedure
Using SoSciSurvey (30), participants were invited to participate in the study as an online survey consisting of various questionnaires. Participants were asked about their cancer diagnosis and selected applicable types of cancer from a list. This question was designed as a multiple-choice task with several answer options as well as an open, descriptive category "other", so that several cancer diagnoses could be named at the same time. Social media platforms, a forum for cancer patients, and mailing lists from self-help groups were used to advertise the study as part of another validation study (31). All participants gave their informed consent online, after being informed about study content and aims, procedures, and planned publications. Inclusion criteria were: age ≥ 18 years and at least one current or in the past cancer diagnosis. No exclusion criteria were defined. In total, N = 357 cancer patients (n = 288 women (80.7%), n = 68 men (19.0%), n = 1 gender diverse (0.3%)) completed the PMH-scale.
The study was approved by the Ethics Commission of the University's Faculty of Medicine (reference number 18-098). All procedures contributing to this work comply with the relevant national and institutional committees' ethical standards on human experimentation and the Helsinki Declaration of 1975, as revised in 2008.
Assessment instrument
PMH. The German version of the PMH-scale (4) was used, a self-report instrument consisting of nine items rated on a four-point Likert type scale ranging from 0 ("do not agree") to 3 ("agree"). It assesses the emotional, psychological, and social aspects of positive mental health. Higher scores reflect greater positive mental health. In a series of six studies that included samples of students, patients, and the general population, the scale showed good psychometric properties e.g., high internal consistency (Cronbach's alpha =. 93), satisfactory retest reliability (r = .74 - .81), and convergent validity was confirmed, e.g., with Satisfaction With Life Scale (32) (r = .75), Subjective Happiness Scale (33) (r = .81) (4), and demonstrated strong cross-cultural measurement invariance in student samples from Germany, Russia, and China (19).
Statistical analyses
Data were analyzed using SPSS version 26.0 (34) and RUMM2030 software (35).
To assess the psychometric properties of the PMH-scale in an oncological context, item analysis according to the Rasch model was used. IRT models, including the Rasch model, can be used to analyze the psychometric properties of an instrument in detail because they focus on individual items and how people respond to those items. The probability of an item response is a function of the difference between person parameters and item difficulty parameters on the latent trait, which in this case is PMH (4). Performing a Rasch analysis involves examining how well the data meet the expectations of the measurement model and whether certain requirements are met. This is a primacy of Rasch models, that the data must fit the model, not the other way around (36). As with other IRT models, the requirements relate to unidimensionality, local independence, and absence of differential item functioning (DIF). Specific to Rasch analyses is the requirement of homogeneity. The analysis of the Rasch model can be understood as an iterative process in which the model assumptions are checked and potential deviations found are resolved, if possible. In case model fit is found, the transformation of ordinal scores into interval-level parameters is possible. The Rasch model uses a logistic transformation to convert ordinal scores into linear measures expressed in "logits" (i.e., log-odds units) (29).
Due to the polytomous PMH-items, the Partial Credit Model (PCM) (37) was used. Overall model fit was evaluated using the chi-square item-trait interaction statistic. A good level of overall fit is characterized by a non-significant chi-square probability p > 0.01 (29, 38, 39). To conclude a good fit, the mean values of the residuals should be around 0 and have a standard deviation of 1. Besides the overall fit, the fit of the individual items (item fit) and persons (person fit) can be evaluated and are expressed as residuals. The fit residuals are expected to be within a range of ± 2.5 (29, 40). The second fit-statistic is a chi-square statistic and the chi-square probability should be non-significant.
One fundamental requirement of the Rasch model is unidimensionality, i.e., the items of a scale should capture only one underlying construct, which was tested with principal component analysis (PCA) of the residuals (29, 38). The idea is to use the items with the highest negative/positive loadings on the first component to create two subsets of items. The separate person estimates of these two subsets are used to identify significant differences with independent t-tests. The proportion of significant t-tests should not exceed 5% to reject multidimensionality and infer unidimensionality (41).
Another assumption is that of local independence. This assumption implies that there should be no residual correlations between items when extracting the trait factor (42). LD can occur when items are linked such that the response to one item determines the response to another item (38, 42). Because LD can lead to overestimation of reliability, bias in parameter estimation, and corrupt construct validity (43) adequate handling of it is critical. Local independence was investigated using a residual correlation matrix of the items. Items with a residual correlation of 0.2 above the average were considered as locally dependent (43, 44). One strategy to deal with LD if one does not want to delete scale items is to combine the locally dependent items into ‘superitems’ by adding them together. Using the ‘superitem’-strategy results in a bi-factor equivalent solution. The proportion of explained common variance (ECV) (45–47) of the general factor, should be >0.9 to consider the scale as unidimensional (45).
A specific assumption of the Rasch model is that the items are assumed to be homogeneous in the sense that the ranking of the item parameters should be the same for all respondents, regardless of their expression of the latent trait. This requirement is reflected in tests of item-trait interaction based on group residuals, i.e., differences between observed and expected scores in groups matched by their total person-parameters scores (40, 42, 48).
Another assumption is the absence of DIF. If DIF is given, the difficulty of an item is different for different groups (e.g., men and women). In other words, the corresponding item indicates the latent trait in different ways in different groups (29, 42). DIF analyses were examined using analysis of variance (ANOVA). Uniform DIF is shown by a significant main effect for person factor indicating that the different groups show a consistent difference in their responses to an item across the whole range of the assessed dimension. The presence of non-uniform DIF is shown by a significant interaction effect (person factor x class interval) indicating that the differences between groups vary across the levels of the assessed dimension. In this study, we tested the items for DIF in relation to gender (woman, man), age (median split of the sample: below and above 54), type of cancer (breast, other forms of cancer, multiple cancers), presence of metastases (yes, no, unknown), psycho-oncological support (yes, no) and duration of disease (median split of the sample: below and above 3.9 years). To avoid too small subgroups in the ANOVA, we had to exclude the one gender diverse person, and the metastasis category ‘unknown’ from the DIF analysis and combine the remaining cancer diagnoses with lower frequencies into one category 'other forms of cancer' for the cancer type DIF analysis. In the case of DIF, several strategies to deal with can be used. One possibility is to remove or reformulate items with DIFs or to split the item with regard to the respective DIF-variable. We used the latter strategy and split the item in case DIF was found and subsequently evaluated the impact of DIF by computing equated scores (26). Following this method, the item for which DIF was found, is split for the respective DIF-variable (e.g., for gender). For each DIF-subgroup (e.g., males vs. females) a score-to-measure transformation is performed and for each person parameter the equated scores of e.g., males and females can be compared and the size of score differences can be evaluated. (49, 50).
Moreover, to assess the category functioning of each item, the threshold ordering was analyzed using the category probability curves. Item thresholds are the transition points between two adjacent respond categories. Disordered thresholds can affect the interpretation and validity of scale scores (51). There may be several causes of threshold disorder, such as respondents having difficulty to consistently differentiate among response options or LD causing the disorder. If the disorder is due to problems with category differentiation, one option is to collapse the disordered response categories together.
The reliability of the scale was estimated using the Person Separation Index (PSI). The PSI indicates the discriminatory power of how well a set of items can distinguish between the individuals being measured. PSI values of .7 are considered appropriate for group and .85 appropriate for individual applications (29, 38, 40, 42).
Targeting describes the extent to which a scale is appropriate for a given sample in terms of scale difficulty. Targeting was assessed graphically using the person-item threshold distribution graph. Person-item maps show how person parameters and item thresholds are distributed along the measured dimension (29). They indicate whether the item thresholds are located in the same range as the person parameters. If a scale is poorly targeted for a sample, the measurement precision is low in those ranges of the assessed dimension in which the persons are located. In case of the PMH-scale the scale would be poorly targeted if respondents either report less well-being than the scale assesses or have a higher level of well-being. Additionally, the extent of floor and ceiling effects and a mean person parameter deviating substantially from zero (which usually is the mean value of the item difficulty) can be indicators of poor targeting. (29, 42).