Registration and Protocol
The protocol for this review was registered prior to the study start on Open Science Framework (https://doi.org/10.17605/OSF.IO/27EDF). We followed the working procedure for conducting a systematic review of validity, reliability and measurement error developed by COSMIN [9, 76]. This report follows the most recent PRISMA guideline [87].
Eligibility Criteria
Population
This review targeted studies that included adults (aged ≥ 18 years or older). Studies including both a clinical group and healthy group were included for the assessment of discriminative validity, while studies including a clinical group, healthy group, or heterogenous group were included for the assessment of reliability and responsiveness. A clinical group was defined in this review as participants currently experiencing an episode of non-specific spinal pain (i.e., neck, upper, or low back pain) of any duration (i.e., acute or persistent/chronic) and who were included in the study based on self-identification or an outcome measure of pain (e.g., visual analogue scale, numeric pain rating scale) or function/disability (e.g., Oswestry Disability Index). A healthy group was defined as participants with no current pain or recent history of non-specific spinal pain. We excluded data from clinical groups consisting of participants with a specific cause of spinal pain (e.g., infection, malignancy, fracture, history of spine surgery) or if the study authors did not screen for these potential causes of pain.
Outcome
We included studies that calculated the FRR from surface electromyography (sEMG) of spine muscles (i.e., neck, upper, or lower back) as a dependent measure. We deviated from our protocol to accept those with or without a concurrent measure of spine angle (e.g., motion capture, inertial motion sensor, or accelerometer/inclinometer) so long as the instructions for the FRR trial defined the motion of the trial (flexion and extension phases). Studies that employed all methods of calculation for the FRR were accepted (e.g., maximum root mean squared sEMG flexing divided by maximum root mean squared sEMG at full flexion, maximum root mean squared sEMG extending divided by maximum root mean squared sEMG at full flexion).
Included studies must have assessed the reliability or responsiveness of the FRR and/or discriminative validity of the FRR through hypothesis testing as or present enough data (mean and standard deviation per group) that could be used to assess validity or responsiveness against our review hypotheses. We used the data available that matched our inclusion criteria (i.e., if only the healthy/asymptomatic population matched our review inclusion criteria, and we could use that for reliability or responsiveness, it was included in those parts of the review respectively). Both populations were considered for studies of reliability or measurement error. Discriminative validity compared between populations that are known to be different.[4] Studies of responsiveness included either an experimental or clinical intervention applied to one or both populations.
Validity
Validity refers to the extent to which a measure assesses the construct it is supposed to measure [5]. Several aspects of validity need to be addressed when assessing the suitability of a measurement outcome. Construct validity is the aspect of validity that refers to the degree to which the scores of a measurement are consistent with hypotheses that align with the construct of interest [5]. For this review, the form of construct validity that we explored was discriminative validity, which refers to the ability of a measurement score to distinguish between predictably different individuals or groups. We hypothesized that clinical groups (i.e., individuals with non-specific spinal pain) will have significantly different FRR values (95% confidence intervals (CI) for mean group differences do not overlap zero) compared to healthy/asymptomatic groups.
Reliability
Reliability refers to the extent to which scores are the same for repeated measurements and can be observed by different persons on the same occasion (inter-rater), over time (test-retest), or by the same person on different occasions (intra-rater) given that the value of the construct has remained stable [5]. The construct must be stable for evaluations of test-retest and intra-rater reliability. Measurement error, a component of reliability, refers to the systematic and random error of an observed score that is not attributed to true changes in the construct being measured [5]. Any measure of reliability was accepted for this review.
Responsiveness
Responsiveness refers to the ability of a measurement instrument to detect change over time, when there has been a change, such as in response to treatment or during progression of disease in the construct being measured [5]. To confirm responsiveness of the FRR outcome measure, we hypothesized that significant differences in FRR (95% CI for mean difference pre/post exposure do not cross zero) would be found before and after exposures/interventions.
Types of studies
We deviated from our protocol to only include articles published in peer-reviewed journals or full papers published in peer-reviewed conference proceedings. We included randomized controlled trials, cohort studies, case-control studies, cross-sectional studies, quasi-experimental studies, and laboratory experiments. No language limits were set. Attempts were made to translate studies to English for inclusion; however, in the event this was not possible, the identified studies were listed for future reviews to use. We excluded the following types of studies: feasibility studies, pilot studies, systematic and non-systematic reviews, protocols, theses/dissertations, commentaries, reports, and any other non-peer-reviewed studies.
Context
Studies conducted in either a clinical or laboratory setting were included.
Information Sources and Search Strategy
Six databases (MEDLINE via Ovid, Embase via Embase.com, CINAHL via EBSCO, SPORTDiscus via EBSCO, Web of Science Core Collection, and Scopus) were searched for published studies from inception to June 1, 2022. Search terms consisted of subject headings specific to each database (e.g., MeSH in MEDLINE) and free text words relevant to the search concepts, such as "flexion relaxation" and "spine". The search string was developed by content experts (DDC, SH, SM, MF) together with a health services librarian (KR) and the search strategy was peer-reviewed by a second librarian according to the PRESS guidelines [88]. The complete search strategies for all databases are included in Supplementary Information S3 online.
Selection Process
Results from each database were combined and imported into Covidence (Veritas Health Innovation, Melbourne, Australia) where duplicates were removed prior to screening. Results for each stage of the review were tracked in Microsoft Excel (Microsoft Corporation, Redmond, USA). Two pairs of reviewers independently screened titles and abstracts and reviewed full texts for inclusion in the review. The reviewers met at each stage for consensus and to resolve any discrepancies through discussion. A third reviewer was consulted when necessary. Backward citation tracking was conducted on all included studies. The reference lists of systematic reviews, pilot studies, feasibility studies, and protocols were screened for articles missed by our search (backwards search) and a forward search of all included articles was performed (articles citing the included articles were found and screened).
Data Collection Process and Data Items
Data were extracted from the included articles by one reviewer and independently checked by a second reviewer. Discrepancies between reviewers were resolved through a consensus meeting. A third reviewer was available to resolve any discrepancies that could not be resolved. Available supplementary files were consulted during data extraction for any relevant data that was not directly presented in the original study. Study authors were contacted for clarification where necessary. Data extracted on the study populations included author, year, country, setting (i.e., clinical or laboratory), sample size, patient characteristics (e.g., age, location of pain, duration of pain, outcome measure used for inclusion into the study), and healthy population characteristics (e.g., age, definition). Information pertaining to the characteristics of the study investigators (e.g., professional background, level of training, and/or years of experience) was extracted where possible. Relevant methodological information included data collection and processing methods (e.g., equipment, preparatory action/instructions to participants, preparation of patients, unprocessed data collection, data processing and storage, and session information), description of the FRR calculation, measurement properties assessed, components repeated (for reliability), source(s) of variation varied (i.e., days, raters), and classification thresholds (if used). Data on the description of FRR calculation, FRR result (mean and variance), statistical analysis and results for each measurement property assessed in each relevant study were extracted and the criteria for good measurement properties were applied [9]. Data only displayed in graphs were extracted by one reviewer and checked by a second using Webplot Digitizer (Version 4.3, https://automeris.io/WebPlotDigitizer). Inter-rater reliability of digitized data was assessed using an ICC2,1 calculated with the psych package [89] in R [90]. Standardized mean difference and 95% confidence intervals were calculated where possible (and not reported by the study authors). Final data tables were checked by a fourth person for errors.
Risk of Bias Assessment and Quality of Reporting
Two pairs of reviewers independently assessed the quality of each included study using the COSMIN Risk of Bias (RoB) tools/checklists to assess reliability and measurement error, construct validity (Box 9b, discriminative validity) and responsiveness (Boxes 10b, 10c, 10d) [9]. The COSMIN RoB tool and checklist are modular, meaning that the boxes in the tool and checklist were completed based on the measurement properties evaluated in each study. If a study reported multiple outcomes of one measurement property (e.g., inter-rater and intra-rater reliability), the corresponding box in the COSMIN RoB tool/checklist was completed more than once. Each standard within the COSMIN RoB box was rated as ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’ according to the criteria outlined by the tool. We followed the “worst score counts” principle, where the overall rating of the quality of each study was determined by taking the lowest rating of any of the standards in the COSMIN boxes used [9, 76]. Reviewers met for consensus and a third reviewer helped to resolve discrepancies that could not be resolved through discussion. Quality of reporting was assessed through Part A of the COSMIN RoB tool, which was the same for all measurement properties and focused on reporting the parameters specific to the instruction to participants, data collection, processing, and analysis and was rated with the same levels and criteria presented above. The results of Part A were not used in the judgement of RoB (Part B); however, we present the summary of the Part A results (percentage of each rating by domain) together with our results for RoB in traffic-light plots (results of individual domains by study) were also prepared as Supplementary Fig. S2A-E using ROBVIS [91].
Effect Measures
Standardized mean difference and 95% confidence intervals were calculated where possible (and not reported by the study authors) for discriminative validity and responsiveness. A standard effect size was also calculated (change in the mean score divided by the standard deviation of the baseline) if responsiveness was not explicitly reported in a study but enough information was provided. There was no effect measure used for the synthesis of reliability and/or measurement error.
Synthesis Methods
Results of individual studies for each measurement property (e.g., the range of values, percentage of confirmed hypotheses) were summarized according to the COSMIN methodologies for systematic reviews. Specifically, the methodology for patient-reported outcomes [76] was followed for discriminative validity, and the methodology for clinician-reported outcome measurement instruments, performance-based outcome measurement instruments, and laboratory values [9] was followed for reliability, responsiveness, and measurement error. We checked study results against our review hypotheses for the assessment of construct validity and responsiveness.
Explanations for inconsistent results between studies for a measurement property (i.e., test-retest reliability) were explored and subgroups of homogeneous studies were summarized (e.g., different study populations, quality of the studies). If no explanation for inconsistency was found, we concluded that the results were inconsistent. Once again, the overall results were compared to the criteria for good measurement properties to determine whether FRR has sufficient (+), insufficient (-), or indeterminate (?) construct validity, reliability, measurement error, and/or responsiveness [9]. Results were reported by two reviewers for each measurement property. The reviewers met for consensus through discussion and a third reviewer helped to resolve persistent discrepancies.
Where possible, results from studies on reliability and discriminative validity were statistically pooled in a random effects meta-analysis using the package meta in R [92]. Statistical heterogeneity was assessed by the I2 statistic. Only studies that reported confidence intervals (or from which we could calculate confidence intervals) and that used the same population, context, study design, FRR calculation, and statistical model/formula were quantitatively pooled and visualized with forest plots. The results of the remaining studies were presented in tables and a synthesis without meta-analysis was conducted in adherence with the SWiM Reporting Guidelines [93].
Certainty Assessment: Grading the quality of cumulative evidence
Two reviewers independently assessed the overall quality of evidence on validity, reliability, measurement error, and responsiveness of the FRR using the modified GRADE approach outlined by COSMIN methodology for systematic reviews of patient-reported outcome measures [9, 76]. The quality of the evidence was graded as high, moderate, low, or very low evidence for our confidence in the measurement property estimates. Four factors were considered when evaluating the quality of the evidence: risk of bias (methodological quality of the studies); inconsistency (unexplained inconsistency of results across studies); imprecision (total sample size of the available studies); and indirectness (evidence from different populations than the population of interest). Each study started with the assumption that the overall result of the study was of high quality and could be downgraded by one to three levels based on each of these four factors. The rules for downgrading were presented in our protocol a priori. The two reviewers met for consensus and a third reviewer helped resolve any persistent discrepancies. The final grading of the quality of the evidence was recorded in a Summary of Findings Table together with the rules (table footnotes) and justifications for any decisions to downgrade.