Steps 5-7
Appraisal of the methodological quality of the included studies (risk of bias assessment for each individual study)
A new order is proposed for evaluating the measurement properties, as shown in Table 2. Content validity is considered the most important measurement property, because, first of all, it should be clear that the instrument items are relevant, comprehensive, and comprehensible in relation to the construct of interest and the target population(21). The instruments with high-quality evidence of inadequate content validity can be excluded from further assessment in the systematic review.
The methodological quality and details of the measurements will be assessed in each study using the COSMIN Risk of Bias Checklist, which examines whether the results are trustworthy based on the methodological quality of the study, and the main outcome of this study is to assess the psychometric properties of postpartum quality of life questionnaires(23). First, it should be determined which features of measurement will be assessed in each study.
The COSMIN Risk of BIAS checklist contains ten boxes (Table 2), each corresponding to one measurement property. The quality of each study should be separately assessed based on a measurement property using the relevant box in COSMIN. That is, it may not be necessary to complete the whole checklist when evaluating the quality of the studies described in an article. In accordance with the COSMIN taxonomy (16). Each case will be described on a four-point rating scale as ‘very good’, ‘adequate’, ‘doubtful’, and ‘inadequate’ and it will be determined whether appropriate statistical tests have been used or not and the overall quality score will also be ascertained for a particular measurement, taking into account the lowest score of each component(26) .
Table 2
Boxes of the COSMIN Risk of Bias checklist
Mark the measurement properties that have been evaluated in the article*.
|
Content validity
|
|
Box 1. PROM development**
|
|
Box 2. Content validity
|
Internal structure
|
|
Box 3. Structural validity
|
|
Box 4. Internal consistency
|
|
Box 5. Cross-cultural validity\measurement invariance
|
Remaining measurement properties
|
|
Box 6. Reliability
|
|
Box 7. Measurement error
|
|
Box 8. Criterion validity
|
|
Box 9. Hypothesis testing for construct validity
|
|
Box 10. Responsiveness
|
* If a box needs to be completed more than once, two or more marks can be placed. |
** Not considered a measurement property, but taken into account when evaluating the content validity (26) |
In this study, we evaluate the following boxes, in respective order: Content validity, internal structure and remaining measurement properties (Table 2).
The findings pertaining to the development of the measurement instruments will be described narratively due to the lack of universally-accepted quality standards. Before the quality assessment and synthesis, the primary studies on reliability and validity will be stratified based on the methodological approach.
The results of each study will be assessed based on the criteria for good measurement properties and rated as ‘sufficient’ (+), ‘insufficient (−), or' ‘indeterminate’ (?);(20, 27) For example, if the internal consistency of the QOL score is evaluated, at least low evidence for sufficient structural validity AND Cronbach’s alpha(s) ≥ 0.70 for each unidimensional scale or subscale (Table 3).
The results of all the studies will be summarized so as to determine whether each measurement property of an instrument is sufficient (+), insufficient (−), inconsistent (±) or indeterminate (?). When the studies are inconsistent, the results from the pertinent sub-groups of the studies will be summarized to explain the inconsistency. If not possible, the overall quality will be determined based on the majority of the studies, and inconsistency will be accounted for in the third step .
The overall quality of the evidence will be rated using the modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach(26). Finally, the quality of the evidence will be downgraded when there is a risk of bias based on the COSMIN Risk of Bias Checklist.
Table 3
The criteria for the evaluation of the quality of the results
Measurement
property
|
Rating1
|
Criteria
|
Structural validity
|
+
|
CTT:
CFA: CFI or TLI or comparable measure > 0.95 OR RMSEA
<0.06 OR SRMR < 0.082
IRT/Rasch:
No violation of unidimensionality3: CFI or TLI or comparable
measure > 0.95 OR RMSEA < 0.06 OR SRMR < 0.08
AND
no violation of local independence: residual correlations
among the items after controlling for the dominant factor <
0.20 OR Q3's < 0.37
AND
no violation of monotonicity: adequate looking graphs OR item
scalability > 0.30
AND
adequate model fit: IRT: χ2 > 0.01
Rasch: infit and outfit mean squares ≥ 0.5 and ≤ 1.5 OR Z-
standardized values > -2 and < 2
|
?
|
CTT: Not all information for ‘+’ reported
IRT/Rasch: Model fit not reported
|
-
|
Criteria for ‘+’ not met
|
Internal consistency
|
+
|
At least low evidence4 for sufficient structural validity5 AND Cronbach's alpha(s) ≥ 0.70 for each unidimensional scale or subscale6
|
?
|
Criteria for “At least low evidence4 for sufficient structural
validity5” not met
|
-
|
At least low evidence4 for sufficient structural validity5 AND Cronbach’s alpha(s) < 0.70 for each unidimensional scale or
subscale6
|
Reliability
|
+
|
ICC or weighted Kappa ≥ 0.70
|
?
-
|
ICC or weighted Kappa not reported
ICC or weighted Kappa < 0.70
|
Measurement error
|
+
?
–
|
SDC or LoA < MIC5
MIC not defined
SDC or LoA > MIC5
|
Hypotheses testing for
construct validity
|
+
|
The result is in accordance with the hypothesis7
|
?
|
No hypothesis defined (by the review team)
|
–
|
The result is not in accordance with the hypothesis7
|
Cross-cultural validity\measurement invariance
|
+
|
No important differences found between group factors (such as age, gender, language) in multiple group factor analysis OR no important DIF for group factors (McFadden's R2 < 0.02)
|
?
|
No multiple group factor analysis OR DIF analysis performed
|
-
|
Important differences between group factors OR DIF was
found
|
Criterion validity
|
+
|
Correlation with gold standard ≥ 0.70 OR AUC ≥ 0.70
|
?
|
Not all information for ‘+’ reported
|
-
|
Correlation with gold standard < 0.70 OR AUC < 0.70
|
Responsiveness
|
+
|
The result is in accordance with the hypothesis7 OR AUC ≥ 0.70
|
?
|
No hypothesis defined (by the review team)
|
-
|
The result is not in accordance with the hypothesis7 OR AUC <
0.70
|
Adapted from Prinsen et al. (Prinsen et al., 2016a) under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/); (20, 21, 27)
AUC area under the curve, CFA confirmatory factor analysis, CFI comparative fit index, CTT classical test theory, DIF differential item functioning, ICC intraclass correlation coefficient, IRT item response theory, LoA limits of agreement, MIC minimal important change, RMSEA root mean square error of approximation, SEM standard error of measurement, SDC smallest detectable change, SRMR standardized root mean residuals, TLI Tucker–Lewis index, + sufficient, − insufficient,? Indeterminate
A: To rate the quality of the summary score, the factor structures should be equal across the studies
B: Unidimensionality refers to a factor analysis per subscale, while structural validity refers to the factor analysis of a (multidimensional) PROM
C: As defined by grading, the evidence according to the GRADE approach
D: This evidence may come from different studies
e: The criteria “Cronbach alpha < 0.95” was deleted, as this is relevant in the development phase of a PROM and not when evaluating an existing PROM
f :The results of all the studies should be taken together, and it should then be decided if 75% of the results are in accordance with the hypotheses
Data collection process (data extraction):
The data to be extracted include the study objectives or questions, data related to publication, including authors’ names, year, assessment of psychometric properties, country, language of measurement, study design, study setting, and suitability of assessment, sampling method and instruments used in studies on postpartum quality of life. The measurement properties of some studies that are repeated several times may be based on different subgroups or the use of more than one metric, and subgroups of these studies, will be information about the following cases:
-
Subgroups' details (including clinical details, women's mean age, and interval of measurement of postpartum quality of life from one hour to one year postpartum).
-
The study results and analysis details (correlation, validity, and reliability).
-
Reported results (values related to an appropriate metric or a narrative statement of results).
Then, whether the subgroups' results have been collected or not is investigated, and if so, it will be examined how the pooling procedure and evidence of heterogeneity have been.
Data will be extracted by two independent reviewers. In the case of disagreement, the matter is referred to the third reviewer, and the reviewers will be contacted for further clarification.
Data Synthesis:
First, data including comparisons, differences, and results are chronically collected from studies in a narrative form, and the best evidence related to the tools’ details will be identified using the COSMIN recommendations. The total score will be collected from the tools' adequacy features as ‘Insufficient’ (-), ‘Sufficient’ (+), ‘Inconsistent’ (±), or ‘Indeterminate’ (?). After the collection and summarization of all the evidence based on the measurement properties using the modified GRADE approach, the quality of the evidence will be rated as ‘high’, ‘moderate’, ‘low’, or ‘very low’.
Moreover, in studies with sufficient data for meta-analysis, Stata (meta-analysis software) will be used.
Study strengths and limitations
The processes of selection, data extraction, and quality assessment were carried out independently by two reviewers experienced in systematic review methods in order to minimize personal bias.
To the researchers’ knowledge, only one systematic review article (with no meta-analysis) has been conducted (2012) on general and specific measurements used for pregnancy and postpartum, and that study found that none of the measurements matched the current health model (Mogos et al., 2013).
The included studies had no language restrictions, and there is therefore no language bias.
In the present study, the modified GRADE approach will be used to assess the quality of the evidence.
A meta-analysis may not be possible if only a few initial studies are found or have very different validity and reliability methods.