Validation of a Back Pain Severity Prediction Algorithm: A Cross-Sectional Study with Updated Healthcare Costs for Back Pain Patients Based on the Graded Chronic Pain Scale

Background: Treatment of chronic lower back pain (CLBP) should be stratied for best medical and economic outcome. To improve targeting of potential participants for exclusive therapy offers by payers, Freytag et al. developed an algorithm to identify back pain chronicity classes (CC) based on claims data. The aim of this study was the external validation of the algorithm, as this was previously lacking. Methods: Administrative claims data and self-reported patient information of 3,506 participants of a health management programme of a private health insurance in Germany were used to validate the algorithm. Sensitivity, specicity and Matthews correlation coecient (MCC) were computed comparing the prediction with actual grades based on von Korff’s Graded Chronic Pain Scale (GCPS). Secondary outcome was an updated view on direct health care costs (€) of back pain (BP) patients grouped by GCPS. Results: Results showed a fair correlation between predicted CC and actual GCPS grades. A total of 69.7 % of all cases were classied correctly. Sensitivity and specicity rates of 54.6 % and 76.4 % underlined the accuracy. Correlation between CC and GCPS with an MCC of 0.304 also indicated a fair relationship between prediction and observation. Cost data could be clearly grouped by GCPS: the higher the grade, the higher the costs and health care usage. Conclusions: This was the rst study to compare the predicted BP severity using claims data with actual BP severity by GCPS. Based on the results, the usage of the CC as a single tool to determine who receives treatment of CLBP cannot be recommended. The CC is a good tool to segment candidates for BP specic types of intervention. However, it cannot replace a medical screening at the beginning of an intervention as the rate of false negatives is too high. Trial registration: which

improvement will potentially result in an economic advantage. Targeted client selection would make it possible to implement and offer interventions that are dominant in terms of health economics -i.e., an improvement in health status on the one hand and a reduction in direct medical costs on the other [11,33,34].
Selecting the target group of chronic clients from purely administrative data is however not easily performed, as there is no classi cation by chronicity in the ICD-10 system. No distinction is made between acute, subacute and chronic pain. Most BP is simply classi ed as ´Low back pain´ (an M54.5 diagnosis) [35].
To address this de cit, Freytag et. al. developed an algorithm based on routine data from an SHI in Germany with 5.2 million bene ciaries to classify BP patients before the invitation. They performed a secondary data analysis and used the data, which was originally intended for medical settlements, to build a classi cation tool [36].
As a result, they were able to divide patients into three chronicity classes (CC): 1: without evidence of chronicity, 2: evidence of risk of chronicity, 3: evidence of chronicity. However, they did not validate their assessment with actual patient feedback or patient-reported outcome, so that external validity is missing to a certain degree [37]. This issue led to the research question of the accuracy of the prediction model.
Does the application classify patients correctly? Can it predict the actual chronicity according to von Korff's GCPS from administrative data? And is a targeted client selection and thus more effective and e cient care possible with this tool?
For the investigation of this research question routine data from the private health insurance (PHI) provider Generali Deutschland Krankenversicherung AG (Generali Germany Health Insurance, formerly known as "Central Krankenversicherung") was analysed. Since 2014, Generali has been running a proactively offered, multidisciplinary biopsychosocial rehabilitation (MBR) intervention for clients with CLBP [24,38]. GCPS at enrolment, as well as the classi cation in CC according to Freytag et al., was available for participants.
The primary objective of this study was to assess the criterion validity of the predictive model. In doing so, medical and economic differences between the two models (GCPS vs. CC) were identi ed. This led to the secondary objective of an updated representation of the costs of care for CLBP patients in Germany. With that decision-makers could obtain a meaningful estimate of the annual expected costs of the treatment of CLBP patients in a PHI setting in Germany. Thus, they are able to prioritise actions and treatment decisions in a more informed way to determine how to best use limited resources.

Study Design
This was a cross-sectional study. Participants of the study were insured members of the Generali Deutschland Krankenversicherung AG, who signed up for long-term MBR against CLBP between July 2014 and March 2021.
All MBR participants underwent a digital assessment at the beginning of the intervention -the information to calculate the GCPS was collected here among others. Two ways to sign up for the programme existed. The standard way (I) was an invitation sent out by the insurance company based on the speci c disease history as stated below. The alternative path (II) was based on the clients' initiative, where they directly requested participation in the health programme (further referred to as self-selected). For the invited insurance holder (I), the CC was available at the date of invitation. This was calculated on the basis of the submitted medical bills of the last 12 months before the invitation. The CC of insured persons who enrolled within three months of invitation was compared with the GCPS at the point of enrolment. For the secondary outcome of the cost analysis, participants who proactively requested participation (II) were additionally taken into account. Data management and statistical analyses were carried out using the software R [39] and the listed packages [40][41][42][43].

Participants
Invited participants (I) were selected according to CC [37]. For the calculation of the CC, the following routinely collected data from 12 months prior to invitation, were taken into account: Number of BP speci c ICD-10 diagnoses (M40 -M54) Incapacity to work due to BP and its duration The three chronicity classes were assigned as: 1) Without evidence of chronicity: Two M40 to M54 diagnoses and not CC group 2 or 3.
2) Evidence of risk of chronicity: Two M40 to M54 diagnoses combined with less than two opioid prescriptions and either a) incapacity to work due to an M40 to M54 diagnosis of less than six weeks or b) at least two F diagnoses.

3) Evidence of chronicity:
Two M40 to M54 diagnoses combined with either a) incapacity to work of at least six weeks or b) at least two opioid prescriptions within six months.
As the insurance company subdivided participants by BP severity in the digital assessment and assigned a suitable programme variant, persons with all three CC levels were invited. Therefore, the minimum requirement to be invited was the presence of two ICD-10 diagnoses in the range of M40 to M54 within the last 12 months. Excluded from invitation were individuals with any condition that precluded participation in an intensive physical intervention (e.g., stroke or need for care). The complete exclusion list can be consulted elsewhere [24].
The aim of the study was to validate the CC algorithm. To achieve this, the GCPS was used [10] and compared with the CC. Participants were questioned about the duration, intensity and impairment due to their BP within the previous six months prior to the date of enrolment. Depending on the answers to the seven questions, every participant was assigned a GCPS grade. Grades ranged from: Grade I: low disability-low intensity Grade II: low disability-high intensity Grade III: high disability-moderately limiting and Grade IV: high disability-severely limiting.

Primary Outcome
The primary outcome was the criterion validity of the CC, i.e. an evaluation of the accuracy of the prediction model of BP chronicity classes using claims data as developed by Freytag and colleagues. The GCPS was used as a reference value for the classi cation of chronicity. The predicted (CC) was compared with the actual chronicity grade (GCPS) for all invited participants of the MBR. To compare the four-level GCPS with the three-level CC, the GCPS needed to be reduced by one grade. GCPS grades I and II were combined and compared with CC 1 -"without evidence of chronicity". GCPS grade III was compared with CC 2 -"Evidence of risk of chronicity" and GCPS grade IV with CC 3 -"evidence of chronicity".
In a rst step, the correlation between CC and newly categorised GCPS was assessed using Spearman's rho rank correlation coe cient with 95 % con dence intervals (CI). Strength of correlation was interpreted as weak (rho < 0.1), modest (rho 0.1 -0.3), moderate (rho 0.31 -0.5), strong (rho 0.51 -0.8) or very strong (rho >0.8) [44]. The second step included the assessment of the agreement between CC and the categorised GCPS by using Cohen's weighted Kappa. The agreement was interpreted as poor (Kappa < 0.2), fair (Kappa 0.21 -0.4), moderate (Kappa 0.41 -0.6), substantial (Kappa 0.61 -0.8) or almost perfect (Kappa 0.81 -1) [45]. Furthermore, the GCPS and the CC were both dichotomised in severe and non-severe BP cases. Grades I and II were previously de ned as functional chronic pain, and Grades III and IV as non-functional chronic pain [10]. In order to allow easier comparability and interpretation, GCPS grades I and II, which were already summarised, were relabelled as non-severe and III to IV as severe cases. CC class 1 and 2 equally as non-severe, and CC 3 as severe BP cases and presented in a 2x2 confusion matrix.
The confusion matrix assigned the chronicity class of each MBR participant with its predicted class (severe BP or non-severe BP). As a result, every sample belonged to one of the following four classes: True positive (TP) were actual severe BP cases that were correctly predicted as severe True negative (TN) were actual non-severe BP cases that were correctly predicted as non-severe False positive (FP) were actual non-severe BP cases that were wrongly predicted as severe False negative (FN) were actual severe BP cases that were wrongly predicted as non-severe Sensitivity (i.e. the proportion of participants with severe BP who were correctly classi ed by the model), speci city (i.e. the proportion of participants without severe BP correctly classi ed as not having severe BP by the model) and Matthews correlation coe cient (MCC) [46] (i.e. the correlation between actual and predicted severity grades) were estimated to evaluate the model's performance. MCC was chosen instead of accuracy and F 1 score as it is more reliable taking into account all of the four confusion matrix categories [47]. As MCC is a discrete case of Pearson Correlation Coe cient, the strength of correlation was interpreted equally, meaning: very weak relation ( Participant characteristics potentially associated with the grade of BP chronicity Demographic information of the participants (e.g. age, sex), overall health (e.g. weighted Charlson Comorbidity Index Score (CCI) [49], self-assessed overall health status using the rst item of SF-12), possible psychological comorbidities (PHQ-4 score and its subscales [50], ICD-10 F-diagnoses) and direct effects of BP (ICD-10 M-diagnoses, everyday impairment, average pain level, number of days restricted in everyday activities within the last six months) were selected. These variables were descriptively compared across CC respective GCPS grades.
Not every participant was enrolled in a daily sickness bene t insurance in addition to their regular PHI policy at this provider. It is likely that most participants were insured against sick leave at another provider. However, no information was available on the insurance status. Therefore, in contrast to the SHI system, there was no general incentive for the insured to report incapacity to work to Generali. Since the days of incapacity to work played a major role in the calculation of the CC, the daily sickness allowance insurance status of the insured was regarded as a possible confounder and analysed separately in a sensitivity analysis. However, it was assumed that those insured against sick leave at this provider also reported absence.

Secondary outcome
The secondary outcome was an updated representation of the costs of care for CLBP in the German PHI setting. Overall health costs and BP speci c inpatient, as well as outpatient costs in the last 12 months before enrolment, were considered. Costs were descriptively compared across CC respective GCPS grades.
Included were costs from the following areas: General hospital services, GP and specialist care, medicines, remedies, alternative practitioners (e.g., chiropractor), aids and private medical treatment.
Additional elective services (e.g., one or two-bedroom supplement) and the entire costs of dental care treatment were excluded.
In a PHI setting the reimbursement procedure follows the principle of refund of expenses, i.e., the clients pay the health care bill in advance, submits the bill afterwards to their insurance company and receives the reimbursement according to the insurance tariff concluded from it. Reimbursement of the health care bills depends on the respective tariff. The study population consisted of fully insured participants with different levels of deductible as well as policyholders eligible for governmental aid. Therefore, the cost component was de ned as the total bill amount instead of the refund amount paid. Thus, the actual costs were compared with each other without taking into account which payer (health insurance, subsidy or individual supplementary) reimbursed the costs. As costs were presented for a period of 12 months, no discounting was executed. All costs were converted to 2020 Euros (€) using consumer price indices.
As healthcare costs tend to be highly skewed and heavily right-tailed [51], a truncated mean was also calculated in addition to the average costs per category. For this, all upper outliers (high-cost cases) were calculated using Tukey's method with 1,5 * interquartile range (IQR) [52]. Low-cost cases were de ned as participants who did not submit an invoice from the presented area in the last 12 months before enrolment.

Data Source/Measurement
For this study, two data sources were used. The information to calculate the CC and its connected variables (e.g., diagnoses, sick-days, opioid use, CCI), as well as all cost data, were obtained from claims data of the insurance. The information to calculate the GCPS was obtained through participants' responses in the standardised, self-administered digital assessment during enrolment. Participants were questioned about their current health status to a) assign the best type of intervention and b) control for individual developments with follow-up measurements.

Bias
The routinely collected data did not yield a potential source of bias. For the data collected within the standardised, self-administered questionnaire there were two potential sources of a) recall bias and b) demand characteristics. A recall bias was possible since the GCPS was calculated using the development of BP within the last six months. However, the GCPS is in general widely used to assess CLBP [8, 11,53,54] but also other types of anatomically de ned pain conditions [55,56]. It has been validated several times [57,58] and is an internationally recognised tool in self-administered pain assessment [59] so that a possible effect of recall bias was neglected.
A second possible source of bias was demand characteristics [60], i.e. that respondents answer the questionnaire tactically in order to receive the most comprehensible care possible. However, participants were asked to answer truthfully in order to receive an intervention tailored to their individual needs. Since all participants were pain patients, who volunteered for the intervention, which is always free of charge, it could be assumed that their answers were rather accurate. Moreover, the speci c steering logic was not mentioned in writing. Therefore, a bias due to demand characteristics seemed also unlikely.

Study Size
Different samples were required to answer the two research questions. The selection criteria are shown in Table 1. The study population consisted of 3,629 participants for whom the GCPS grade at enrolment was available. As the data was provided by a PHI, there were participants with an individual yearly threshold of costs before payment of expenses (deductible). Insurees with a xed deductible usually only hand in their invoices of a year if they exceed that amount. To reduce the potential bias introduced by the tariff, 122 participants who did not hand in any invoice in the 12 months before enrolment (yearly average of invoices = 27) were excluded.
To answer the rst research question of the criterion validity all participants of the MBR who enrolled in the standard way (n = 2,722) were taken into account. The time between the initial invitation and enrolment was calculated. As CC was only available at the date of the invitation, participants who took longer than 90 days to register were excluded (n = 326) in order to rule out the temporal effect and thus potential changes of the CC. The nal group size for the rst research question was 2,396.
For the estimation of the cost of CLBP participants signed up on their own initiative were additionally considered (n = 872). The size of the group used to answer the second research question increased to 3,506.

Participants
Characteristics of the study population are presented in Table 2. Participants' mean age was 54.74 years, whereas 65.9 % were of male and 34.1 % of female sex. The mean CCI was 0.8, indicating a population in a healthy state. This was con rmed at the assessment. More than two-thirds self-reported an overall health status of "moderate" or better. The average PHQ-4 sum score was 2.82. The PHQ-4 subscales averaged 1.56 in the depression and 1.26 in the anxiety part. The average pain intensity was 4.48 and the average disability was 3.98, whereas both were assessed using the GCPS-questionnaire. Most participants reported being less than 14 days disabled due to their BP within the last six months. Nearly half (46.1 %) of the study population was insured against sick leave at this provider. In total, 8.7 % handed in a claim for sick leave due to BP.
A clear division of the characteristics by the GCPS could be observed. There was a clear negative correlation between GCPS and health status. The higher the grade, the lower the health indicators. This was also true for variables, which were not used to calculate the GCPS (i.e., PHQ-4, overall health, CCI and sick leave due to BP).
The majority of participants were grouped in GCPS Grades I (42.8 %) or II (24.7 %). Grade III (17.7 %) and IV (14.8 %) made up one-third of the participants. Characteristics were also divided by CC in comparison (Appendix 1). It could be observed that CC 2 (evidence of risk of chronicity) reported on average a higher PHQ-4 than CC 1, CC 3 or self-selected participants.   [37] based on administrative claims data TP = True positive were actual severe BP cases that were correctly predicted as severe TN = True negative were actual non-severe BP cases that were correctly predicted as non-severe FP = False positive were actual non-severe BP cases that were wrongly predicted as severe FN = False negative were actual severe BP cases that were wrongly predicted as non-severe Further, the GCPS and the CC were dichotomised in severe BP or non-severe BP cases. Table 3 shows the confusion matrix, which matches the assigned CC of each MBR participant with its predicted class.
Results show a sensitivity of 54.6 % and a speci city of 76.4 %. In total 69.7 % were correctly predicted.
With an MCC of 0.304, the strength of correlation was classi ed as fair. This was in agreement with Cohen's weighted Kappa of 0.304 (95 % CI: 0.260 -0.348) -which also indicated a fair agreement between CC and GCPS.

Sensitivity Analysis
The capacity to work played a pivotal role in the calculation of CC. Only 46.1 % of the study population were insured against sick leave at this provider, which meant that only information about working ability was available for that subpopulation. To exclude the possibility of confounding due to the insurance status, a sensitivity analysis was run including only participants who were insured against sick leave (n = 1,114). In a similar fashion as shown above, a 3x3 confusion matrix (appendix) was created and afterwards dichotomised. Spearman's rho (0.405 (0.355 -0.453, p<0.001)) and Cohen's weighted Kappa (0.358 (CI 0.298 -0.418)) were increased. The 2x2 confusion matrix (Table 4) shows that sensitivity has risen to 63.9 % whereas speci city has slightly reduced to 73.4 %. Overall, 70.7 % of all cases were correctly predicted. An MCC of 0.348 and a weighted Kappa of 0.341 also indicated a fair relationship between prediction and observation of BP severity for the subgroup.  [37] based on administrative claims data TP = True positive were actual severe BP cases that were correctly predicted as severe TN = True negative were actual non-severe BP cases that were correctly predicted as non-severe FP = False positive were actual non-severe BP cases that were wrongly predicted as severe FN = False negative were actual severe BP cases that were wrongly predicted as non-severe

Healthcare Costs of chronic BP patients
The secondary outcome was an updated representation of the costs of care for CLBP in the German PHI setting. Overall health costs and BP speci c inpatient, as well as outpatient costs in the last 12 months before enrolment, were presented. Costs were descriptively compared across CC respective GCPS grades of participants who either were invited by the insurance or took part upon self-selection. Participants whose participation was initiated upon self-selection (n = 803) could be grouped in between CC 2 and CC 3. Truncated mean: Exclusion of high-cost cases and cases who did not submit a BP invoice in the last 12 months before enrolment.
[1] To be selected by CC two ICD-10 M-diagnoses within the last 12 months were the minimum requirement. In the presented table, the time difference between initial invitation by insurance and enrolment was not taken into consideration. 624 participants had two BP speci c diagnoses (and connected invoices) within the last 12 months of invitation, but not within the last 12 months before enrolment. Additionally, 227 participants self-selected themselves and also did not hand in a BP speci c invoice in the last 12 months before enrolment.

Summary
This study used administrative claims data and self-reported patient information to validate a claimsbased algorithm (CC) identifying the severity of CLBP. A functioning algorithm would enable payers to select and invite participants for targeted, expensive treatment programs without the need for additional screening. Results showed a fair correlation between predicted CC and actual GCPS grades. A total of 69.7 % of all cases was classi ed correctly. Sensitivity and speci city rates of 54.6 % and 76.4 % underlined the accuracy of the prediction. A sensitivity analysis with participants insured against sick leave showed similar results. The correlation between CC and GCPS with an MCC of 0.348 and a weighted Kappa of 0.341 indicated also a fair relationship between prediction and observation of BP severity for the subgroup.
Cost data could be clearly grouped by GCPS grades. It could be stated that the higher the grade, the higher the cost and health care usage. Overall, the average total direct health cost was €7,279.78.
Participants with GCPS grade I had mean costs of €5,967.94 whereas participants with GCPS grade IV had mean costs of €10,619.29. The average BP speci c cost for the overall group was €1,082.13.
Participants with GCPS grade IV (€2,312.12) had more than three times higher BP speci c costs than GCPS Grade I participants (€650.53).

Limitations
The study had two limitations: the data used (I) and the outcome (II).
The administrative data used had the primary purpose of settling claims and was granted by a PHI, who in general have the freedom of implementing and offering health programmes without any restrictions due to national regulations. However, PHI data do not include all health-related billing data. Tariff-related peculiarities (e.g., deductibles and co-payments) mean that in practice not all medical invoices are submitted [61]. With an exclusion of participants with no invoices submitted in the last 12 months, possible tariff biases were reduced. But still, only a minority of participants held a daily sickness bene ts insurance with this provider. An underreporting of sick leave was likely. The sensitivity analysis focused on participants who were insured against sick leave showed an improvement in the strength of correlation. However, it could be the case that the algorithm is better suited for a sickness fund where complete information about sick leave for all participants would be available (e.g., the German SHI). In further research, participants should be questioned about sick leave directly to cross-validate claims data.
The second limitation was the reduction of classes in the outcome of criterion validity. Original GCPS had four grades, whereas CC had only three grades. Therefore, GCPS grades I and II were combined in order to reduce the amount to three grades. Taking cost information and demographic characteristics into consideration it could be stated that this was a legitimate operation, as characteristics between I and II only differed slightly. A dichotomisation of the GCPS in severe and non-severe BP cases was also unproblematic as this is inherently contained in the grading, which is separating grades I/II and III/IV by disability into functional and non-functional chronic pain. A dichotomisation of the three CC classes could prove to be di cult, as CC 2 was between chronic and non-chronic. However, confusion matrices were run for 3x3 and 2x2 comparisons and outcomes differed only marginal. The strength of the relationship remained in the range of a fair correlation so that a reduction to two categories did not in uence the overall results.

Interpretation
To our knowledge, this was the rst study to compare the predicted BP severity by claims data with the actual BP severity by GCPS. In other types of diseases, predictive models from administrative data were often used to estimate disease severity. severe consequences based on GCPS would not be selected by the model. In addition, the insured persons' preferences and personal life circumstances should be considered to a reasonable extent when selecting the suitable program component. When creating further prediction algorithms, future research should include a criterion validity assessment in the validation study. Only then is it possible to actually determine whether the algorithm identi ed also performs well in practice.
It was also shown in this study that participants with high GCPS grades are not only suffering the most but also causing the highest costs. In previous studies it was clearly established that especially high GCPS grades pro t from multimodal, long-term interventions [11,24,53]. The payer therefore has an interest in trying to focus on this subgroup to reach a cost-effective intervention. The second outcome of the study was an updated view of direct healthcare costs of patients with CLBP.
For the German insurance market in which the study was conducted, two studies existed, which depicted healthcare costs of CLBP patients by chronicity grades [11,12]. Wenig et. al [12] used a postal survey and asked SHI-insured participants with BP (n = 5,650) about their healthcare usage in the previous 3 months.
From this, they estimated and extrapolated BP speci c direct (46 %) and indirect costs (54 %) for 12 months. They estimated that a patient with CLBP would on average create direct costs of €612.50. They found out that the most in uential predictor of high costs was a high GCPS grade. Participants with GCPS grades IV (€7,115.7) were said to have more than 17 times total costs (direct and indirect BP speci c) than participants with grade I (€414.4).
In this study -which only focused on direct health costs -we also saw a sharp increase in BP speci c costs based on GCPS grade. We used actual claims data from a 12 months period before the enrolment date in a health program against BP. Participants with grade IV (€2,321.12) had about 3.5 higher BP speci c direct healthcare costs than participants with grade I (€650.53). With an average of €1,082.13 on BP speci c total costs, the expenses in the PHI were higher than identi ed by Wenig et al. The difference between grades I and IV were albeit not as high as reported by Wenig and colleagues.
Müller et. al presented a study in 2019 in which they compared the therapeutic and economic effects of a multimodal back exercise programme. They also presented direct medical costs for a study population of 2,324 participants using routine data supplied by an SHI. Participants with GCPS grade IV (€5,310) had 2.2 times higher overall direct healthcare costs than participants with grade I (€2,391) over a time period of two years.
In the presented study we identi ed 1.8 times higher overall direct health costs (€10,619 vs. €5,968) in the last 12 months. Total direct costs of the privately insured were albeit a lot higher than costs in the SHI system. This can be explained by the fact that reimbursement schemes and provider spending in the outpatient setting tend to be two to three times higher in the PHI setting [70,71].
Nevertheless, this presented study gives a good overview and updated cost information on direct overall and BP speci c costs based on their different GCPS grades. As the relationship of the cost differences between different GCPS grades was in accordance with previous studies, it can be assumed that the gures presented are a good representation of the costs to be expected in the PHI system. Decisionmakers should use these ndings to match effective interventions with limited funds. Participants with GCPS grade IV are suffering greatly and produce the highest costs through the current treatment of their pain. As care is often not guideline-based [14,15,22,23], a great deal of attention should be paid to targeting this cohort. Involving them in appropriate and effective treatment programmes is essential.

Generalisability
The presented study had three strengths that played a role in the generalisability of the results: I: study sample, II: target group and III: availability of the data. Due to the long time period of seven years in which data was collected, the study reached a large size with 3,506 participants. Furthermore, only data of participants was used who felt their BP so pressing that they were willing to participate in a long-term intervention. The monetary key gures shown thus re ect expected costs of people participating in an intervention against their BP.
One other advantage was the availability of the data. In order to participate, it was mandatory to carry out the digital assessment, so that a lot of information about BP and its consequences could be obtained. In addition, the cost data routinely collected from the insurance company could be used purposefully.
Due to the setting in a PHI, the generalisability of the presented cost data is nonetheless limited. All participants were insured with the same PHI company. Even though recruitment took place nationwide, outpatient cost data and therefore also the overall cost was probably still higher than could be expected in a comparable study focused on the SHI. This was due to the systemic differences between SHI and PHI and cannot be remedied. With the knowledge of the two to three times higher costs in the outpatient sector, trends could albeit also be gained for the SHI. The trend in the spending between different GCPS grades was however highly comparable between PHI and SHI. Moreover, inpatient spending was highly comparable between both systems, as additional private elective bene ts, such as supplements for treatment by a chief physician or accommodation in a single room, were excluded from the cost consideration. Besides, the inpatient reimbursement scheme of diagnosis-related groups (DRG) is identical in both systems.
It could also be possible that in an SHI setting, with complete data about coverage against sick leave, more than 70 % of the participants would be correctly categorised by the CC. However, the sensitivity analyses with the subgroup of sick leave insured showed that the predictive ability only improved slightly. The strength of correlation still stayed in the range of a fair agreement between CC and GCPS so that the in uence of the system here could be regarded as low. Overall, the strengths of the study outweigh the inherent disadvantages of PHI claims data so that the results can be interpreted and transferred to other settings. If the prediction algorithm of Freytag et al. is to be used in other settings, care should be taken to ensure that information on diagnoses and medications as well as on work absences due to BP and its duration is available and reliable.

Conclusion
This was the rst study to compare predicted BP severity by claims data with actual BP severity by GCPS.
They result in a sensitivity of 54.6 % and a speci city of 76.4 %. In total 69.7 % were correctly predicted. With an MCC of 0.304, the strength of correlation was classi ed as fair. Based on the ndings of this study, the usage of CC as a single tool to determine who gets treated against CLBP and with what is not recommended. Healthcare spending clearly can be separated by GCPS. A predictive algorithm that could abolish the need for medical screening on site would need to reach a very high sensitivity to identify patients who would pro t the most from a targeted intervention (GCPS grades III and IV). The CC by Freytag is a good tool to segment candidates for BP speci c interventions -especially to identify possible participants who need additional psychological components in the intervention. However, it cannot replace a self-reported medical screening instrument. The rate of false negatives would be too high for that.

Declarations
Ethics approval and consent to participate: The study used routinely collected data from an intervention, which was evaluated and registered previously at the German Clinical Trials Register under DRKS00015463 retrospectively (4 Sept 2018). Consent to participate and the self-reported questionnaire have remained unchanged since the study and are therefore still valid in accordance with the ethics proposal.
The independent research ethics committee of the University of Lübeck gave approval for the medical evaluation study (Re.-No.14 -249, dated 20.11.2014). As the participants already consented to the usage of the data for further analysis, no new ethic vote was sought for the present analysis.
Written informed consent was obtained from all study participants.
Consent for publication: not applicable, no individual data.
Availability of data, material and code: The data and code that support the ndings of this study are available from Generali Deutschland Krankenversicherung AG but restrictions apply to the availability of these data, which were used under license for the current study and so are not publicly available. They are however available from the authors upon reasonable request and with permission including a signed data access agreement of Generali Deutschland Krankenversicherung AG.
Competing interests: Martin Hochheim (MH) is working in part time for the Generali Health Solutions GmbH (GHS), which is a liated with the Generali Deutschland Krankenversicherung AG. Max Wunderlich (MW) is managing director of the GHS. Philipp Ramm (PR) is currently responsible for the BP programme. They declare that research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest. Volker Amelung (VA) declares no con icts of interest.
Funding: In this study, data from the medical digital enrolment questionnaire for a health programme for insured with BP were linked with administrative data to carry out the analysis. The medical programme was funded by the Generali Deutschland Krankenversicherung AG. Funding for this study by any party has not taken place.
Authors' contributions: MH planned the analyses, analysed the data, interpreted the results and wrote the manuscript. VA supervised the project. MW contributed to the implementation of the research. PR contributed to the interpretation of the results. All authors provided critical feedback and helped shape the research, analysis and manuscript. All authors read and approved the nal manuscript. Comparison of actual BP severity with prediction Comparison of GCPS at baseline with prediction from claims data algorithm according to Freytag et. al

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.