The initial search retrieved 2,744 publications. After removing duplicates, and limiting to ‘English’ and ‘human’, the results numbered 2,019. Of the 2,019 results, 1,997 studies were judged, via the title and abstract, as not relevant to the scope of this review. The flow chart diagram of the included studies and the reason that studies were excluded is presented in Figure 1.
The remaining 22 full-text articles were assessed for eligibility. Only five studies fulfilled our inclusion criteria. Seventeen studies were excluded and details of the excluded studies and the reasons for exclusion are described in Addional file 2 .
1.3.1 Study characteristics
Five studies were included in the review; two studies were conducted in Norway (19, 20), two in the Netherlands (21, 22) and one in the UK (23). These studies were published between 2002 and 2014. The characteristics of the included studies are summarised in Table 1. The total number of participants in the five studies was 1,023, with the sample size ranging from 134 to 349 people. Two of the studies only included adults in employment taking sick leave due to CLBP (20, 21). The mean age of the participants ranged from 41 to 45.5 years, with similar gender distributions. The mean duration of CLBP symptoms ranged between 5 and 8 years.
Table 1 Characteristics of the included studies
A wide range of outcome measures were used in the studies included in this review. Return-to-work (RTW) was the primary outcome in three studies (20, 21, 23). Four studies used two measures for disability (19, 22-24). Regarding the baseline measures of disease severity, the mean functional disability score using the Roland Morris Disability Score (RMDS) (25) was 14, the score can range between 0-24 (21, 22). Whereas the Oswestry Disability Index (ODI) (26) mean score was 43 (19, 23), the ODI score ranges between 0-100. The mean pain intensity score from the three studies was 5.8, with the possible ranges of pain intensity being between 0-10 (19, 22, 24). For all measures, the high scores identify the greatest disability and pain. The mean EQ-5D-3L (27) score ranged from 0.26-0.49 (19, 22, 23). A generic health status measure, generally, the possible ranges for the EQ-5D-3L are between 0 and 1, where high scores mean better health status. A further generic health status measure, the Short Form 6D (28), was also used in conjunction with EQ-5D-3L in one study (19).
One study looked at costs from the patient and healthcare provider perspectives (23), while the remaining studies were conducted from the societal perspective (19-22). The length of follow-up was between 12 and 24 months.
The PMS consisted of combinations of cognitive behavioural therapy, physical therapy and workplace interventions. Two studies compared PMS with usual care (20, 21), two with surgery (19, 23) and one with physical treatment and graded activity (a treatment that includes behavioural and cognitive methods to improve activity endurance) (two comparators) (22). The outcomes were return to work (20, 21), quality-adjusted life-years (QALYs) using EQ-5D-3L (19, 22, 23) and disability using RMDQ (22) and ODI (19, 23).
1.3.2 Design and description of pain management services
The studies were delivered in a secondary care setting only (19, 20, 23), primary care only (22), or a combined setting (21). All studies clearly described the service in terms of treatment modalities and the staff involved in delivering these services. However, in some studies the duration of treatment varied between people within the study (19, 21). In another study, the intensity of treatment in terms of the total hours provided was not consistent between individuals in the study (23). Study participants were generally working-age adults; two studies focused on employees with CLBP(20, 21) and no studies included people above 65 years old.
The description of services provided in the included studies is summarised in Table 2
Table 2 Service description in the included studies
1.3.3 The comparator
Two studies, which were conducted in the Netherlands and Norway, compared PMS with “standard care” (20, 21). In Lambeek et al., standard care consisted of family physician visits, in addition to occupational therapist consultations, provided in a primary care setting (21). In Skouen et al., standard care consisted of examination in the spine outpatient clinic by a physician, followed by referral back to the GP (20).
Two studies compared the effect of PMS with surgery (19, 23). The surgical procedures were total disc replacement (19) and spinal stabilisation (23). As surgical options are usually reserved for the severest cases, the patient populations in these studies are likely to be different from those where GP/medical management is offered as standard care (20, 22, 24). This is demonstrated by the increased pain intensity and lower quality of life at baseline in the surgical studies. (19, 23). The mean baseline utility scores (EQ-5D) were 0.26 and 0.49 in studies assessing surgery (19) and non-surgical treatments (22) as comparators, respectively. Similarly, the baseline pain intensity score was 6.9 in the study assessing surgery (19), while in the study assessing standard care, the baseline score was 5.7 (21). Therefore outcomes achieved from referral to PMS are not comparable to other studies due to higher baseline pain and disbility levels.
1.3.4 Methodological design
All of the economic evaluations were conducted alongside RCTs. The risk of bias assessment of the included studies is described in Table 3.
Table 3 Risk of bias assessment according to the Cochrane Back Review Group (CBRG)
Two of the five studies were considered to have low risk of bias (21, 22). The major strengths in all of the included studies were that the methods of randomisation and allocation were clear. Moreover, intention-to-treat analysis was considered in the statistical analysis for missing data.
High risk of bias was identified in three studies (19, 20, 23). Two studies reported that the intervention group received extra visits to physiotherapists and other healthcare professionals compared with the standard care arm (22, 23). Hence, the intervention group might have had better outcomes, compared with standard care, because of these additional visits. One of the important aspects in assessing the quality of RCTs is the sample size and statistical power (29). Four out of five studies were sufficiently powered to detect a difference in functional disability using the ODI (19, 23), RMDS (22) or return to work (21).
Although all included studies incorporated RCTs, randomisation by itself does not ensure that the baseline characteristics of the study participants in the comparator groups are similar (29). Knowing this information is essential to demonstrate that the participant response to treatment is directly attributed to the intervention effect and not to other patient-related factors. Adjusting effect size for baseline characteristics should be performed using statistical methods, generally regression. In our review, only one study (22) performed regression to adjust effect size for baseline characteristics. Furthermore, two studies clearly reported that there was a significant difference between the study participants at baseline, which they did not then go on to adjust (19, 20).
The quality of reporting economic evaluations in terms of costs and outcomes is reported in Table 4, while the details of sensitivity analysis and the results are summarised in Table 5.
Table 4 Assessment of economic evaluations based on CHEERS criteria (inputs to economic evaluation: costs and outcomes)
Table 5 Economic evaluation based on CHEERS criteria (statistical analysis and results)
1.3.5 Healthcare resource use and cost
In this review, all of the studies included direct medical and indirect costs. Four studies included direct non-medical costs, such as complementary and alternative therapists, travel expenses and over-the-counter medicines (19, 21-23). Four studies took the societal perspective (19-22) and one study took the healthcare provider perspective (23). Although the last study stated that they conducted their evaluation from a healthcare provider perspective, indirect costs were calculated. Although Skouen et al. stated that their study took a societal perspective (20), direct non-medical costs were not collected.
There are two methods of assessing the service costs, the top-down and the bottom-up (micro-costing) approaches (30). The top-down approach divides the total budget of a health intervention by the total number of people to give an “average” estimate of cost per patient, whereas the bottom-up approach uses patient-level resource use data to generate costs. The latter is the preferred method in economic evaluations to account for variations in costs between study participants (30). In this review, three studies used the top-down approach (19, 20, 22), while two studies used the bottom-up method (21, 23). The method of collecting costs was implied, rather than clearly stated, in three studies (20, 22, 23). Only two studies clearly reported all resource use and their unit costs (21, 22). In two studies, some unit costs were missing (19, 23) and Skouen et al. did not report unit costs (20).
In this review, two studies used postal questionnaires to collect resource use data from people (21, 23), which might be subject to recall bias, especially if the recall period is more than three months (31). In Lambeek et al., the recall period was three months (21) while, in Rivero-Arias et al., the recall period was six months and one year (23). Two studies used costing diaries to collect resource use data (19, 22). Costing diaries aim to collect data prospectively, which reduces the risk of recall bias. To minimise the risk of incompletion, regular telephone reminders are recommended but neither of the studies using diaries reported providing reminders (19, 22).
Productivity loss due to illness can be accounted for by absenteeism, the inability to attend work, and presenteeism, the reduced functionality in terms of quality and quantity while working (32, 33). Productivity loss can be measured either objectively, by using attendance records, or subjectively, using self-report by the employee (32). These methods have some limitations; objective measures might be inaccurate for assessing presenteeism, as they only record employee attendance, with no emphasis on productivity levels in terms of quality.
All studies assessed the effect of PMS on productivity loss. Absenteeism was the only work outcome evaluated. Four studies clearly reported their methods of collecting productivity loss (20-23). Although the appropriate recall period is still inconclusive, three months’ recall for absenteeism and one week for presenteeism is recommended (32). Two studies used “monthly” self-reported methods, utilising calendars (21) and diaries (22). In another study (23), the employment status was self-reported over a relatively long period of six months and one year with insufficient information about the measurement method to assess quality. An objective measure was used in one study (20), which utilised the national health insurance registry to assess sickness absence. Johnsen et al. (19) did not report the method of data collection.
In order to value productivity loss among employees, the “human capital approach” and the “friction cost method” can be used (32). As the friction cost approach can produce lower estimates of cost, it is recommended to use both approaches to determine any methods-dervied difference. Four studies used the human capital approach alone to value productivity loss (19-22). In Rivero-Arias et al. study the productivity was assessed by calculating the total hours worked by each patient at baseline, six,tweleve and twenty four months (23).
1.3.6 Statistical analysis
The statistical analysis of patient-level cost data needs to be adjusted from standard approaches as cost data are generally “positively skewed”, because a small number of people usually require more healthcare resources (30, 34). Non-parametric tests rely on medians and distributional shape. Non-parametric bootstrapping with replacement is the preferred method to analyse cost data because it compares arithmetic means while avoiding distributional assumptions. Standard parametric tests can be used to analyse cost data only if the sample size is large, where skewness will not affect the validity of the analysis. Barber et al. reported that, for sample sizes larger than 150 participants (35), the t-test is usually robust and valid, as parametric assumptions will generally hold. In this review, two studies used non-parametric bootstrapping to test the difference in cost (21, 22), whereas two studies with larger sample sizes, 349 (23) and 173 (19), used parametric t-tests for cost analysis. Skouen et al. did not analyse differences in costs (20).
Discounting is used to estimate the future value of outcomes and costs and assumes present outcomes and costs are considered more valuable than those in the future(30). Future costs and outcomes should be discounted where follow-up is longer than one year, using nationally preferred discount rates. Lack of discounting can lead to overestimating cost-effectiveness. In this review, three studies had interventions that continued for two years (19, 20, 23), two of which reported the discount rate according to the country-specific rates (20, 23).
1.3.7 Dealing with uncertainty
The incremental cost effectiveness ratio (ICER) is the main summary measure of an economic evaluation and is the difference in cost divided by the difference in effect (outcome) between two interventions (30). The base case analysis generates the ICER from the preferred outcome and cost data. Sensitivity analysis is used to test the sensitivity of the ICER to variation in cost and outcome parameters used in the base case analysis (30, 34). In one-way sensitivity analysis, one parameter is changed at a time to test the results. Multiple-way analysis changes multiple parameters at the same time. Although one-way sensitivity analysis is easy and understandable, it can underestimate total uncertainty in the ICERs (36).
Probabilistic sensitivity analysis (PSA) assumes that the values of input cost and outcome variables have a probability distribution. Probabilistic incremental economic analysis is usually carried out using bootstrapping to generate credibility intervals that provide a quantitative measure of uncertainty around ICER point estimates (“expected value”). For the graphical representation of ICERs, cost effectiveness planes are used to present the distribution of bootstrapped ICERs (30). Another common graphical presentation used in economic evaluation is the cost effectiveness acceptability curve (CEAC) (30). The CEAC is a technique for representing information on uncertainty in cost-effectiveness. A CEAC demonstrates the probability that an intervention is cost-effective compared with the substitute, given the observed data, for a range of maximum monetary thresholds that policy makers are willing to pay for a specific unit change in effect (37).
In this review, all studies carried out one-way sensitivity analysis. Four studies generated ICERs using bootstrapping (19, 21-23) and three of them presented ICERs on cost effectiveness planes (19, 21, 22). CEACs were used in these studies to present the probability of cost effectiveness (19, 21-23).
1.3.8 Cost-effectiveness of PMS
The ICERs generated by the studies are summarised in Table 6 . Only one study concluded that PMS dominates usual care (more effective and less costly) (21). Skouen et al. concluded that multidisciplinary services are cost-effective in men only (20). However, this conclusion needs to be interpreted with caution given that top-down costs were used and there was no sensitivity or statistical analyses reported. Two studies reported that PMS are cheaper and less effective than surgery (19, 23). Therefore, a trade-off between cost and effect needs to be considered. Smeet et al. compared PMS with active physical treatment (APT) and graded activity plus problem solving (GAP) (22). In this study, the PMS was dominated when compared with GAP, while it was cheaper and less effective when compared with APT.
Table 6 Summary of incremental cost effectiveness analysis results