SORG algorithm to predict 3- and 12-month survival in metastatic spinal disease: a cross-sectional population-based retrospective study

In this study, we wished to compare statistically the novel SORG algorithm in predicting survival in spine metastatic disease versus currently used methods. We recruited 40 patients with spinal metastatic disease who were operated at Geneva University Hospitals by the Neurosurgery or Orthopedic teams between the years of 2015 and 2020. We did an ROC analysis in order to determine the accuracy of the SORG ML algorithm and nomogram versus the Tokuhashi original and revised scores. The analysis of data of our independent cohort shows a clear advantage in terms of predictive ability of the SORG ML algorithm and nomogram in comparison with the Tokuhashi scores. The SORG ML had an AUC of 0.87 for 90 days and 0.85 for 1 year. The SORG nomogram showed a predictive ability at 90 days and 1 year with AUCs of 0.87 and 0.76 respectively. These results showed excellent discriminative ability as compared with the Tokuhashi original score which achieved AUCs of 0.70 and 0.69 and the Tokuhashi revised score which had AUCs of 0.65 and 0.71 for 3 months and 1 year respectively. The predictive ability of the SORG ML algorithm and nomogram was superior to currently used preoperative survival estimation scores for spinal metastatic disease.


Introduction
Epidemiological data of metastatic spinal disease remain difficult to estimate as spinal metastases may be underdiagnosed if mild and asymptomatic, or missed due to short life expectancy as a result of systemic disease burden. Nevertheless, it is thought that 5-10% of cancer patients suffer from spinal metastasis [20,22], with at least half of these presenting a metastatic epidural invasion or even spinal cord compression [6,15,20,22] and approximately 10% eventually benefitting from spinal surgery [6,15,20,22,30].
To determine the optimal treatment for each individual patient while taking into account all the important factors influencing such decision-making, several decision frameworks were proposed in recent years. Among these, NOMS (neurologic, oncologic, mechanical, and systemic) [14] is probably the most used and reproducible worldwide as it overcomes the shortcomings of fixed algorithmic scoring systems. Thus, clinicians are asked to balance acute neurological status and risk of mechanical instability against life expectancy and risk of treatment-related morbidity and This article is part of the Topical Collection on Spine-Other. mortality [4,15,21,30]. Added to these previous considerations, recent evidence also proposes more aggressive treatment even if life expectancy is inferior to 3 months for patients with a good baseline condition [8]. All in all, a proper estimation of postoperative survival prediction plays an important role in tumor boards' decision and thus should strive to be as accurate as possible.
To the moment, there is no standardized estimation tool for overall postoperative prognosis. Nater et al. [16] performed a full external validation of eight scores (Tokuhashi [23], Tomita [26], van der Linden [28], and Bollen [2] among others) and concluded that calibration was poor overall. It is suggested that clinicians should use these tools with caution, especially if they are applied in a population different from the development population. More recently, the SORG (Skeletal Oncology Research Group) nomogram ( Fig. 1) [10,18], which was developed with patients from Massachusetts General Hospital and Brigham and Women's Hospital, has proven to be a promising prognostic tool for surgically treated spinal metastatic patients. This tool still lacks external validation using an international independent data set performed by a different group of investigators in order to be established in clinical practice (external validations were made in high-volume hospital centers in North America, such as The Johns Hopkins Hospital [11] and Memorial Sloan Kettering Cancer Center [3]). Furthermore, Ahmed et al. [1] found that the SORG nomogram (Fig. 1) demonstrated the highest accuracy at predicting 30-day and 90-day survival, whereas the original Tokuhashi [23] was the most accurate at predicting 365-day survival.
The SORG classic model was published as an online application (Table 1) [31], incorporating new additional parameters (alkaline phosphatase, neutrophil-to-lymphocyte ratio, platelet-to-lymphocyte ratio) and still seeks for validation in large patient samples from a more international setting. The aim of this study is to assess whether this SORG machine-learning (SORG ML) algorithm is able to (1) provide decisional support and to match decisions taken by expert multidisciplinary teams to operate or not spinal metastasis in an international cohort distinct from the development population and (2) discriminate patients that could benefit from surgical care despite the presumption of a life expectancy shorter than 3 months (based on systemic disease burden). The SORG nomogram: to be used for patients with operable spinal metastatic disease [18,19]. For each parameter, the point on the corresponding axis is determined and a vertical line is drawn downward so that the number of points given (e.g., a patient with a preoperative hemoglobin of 10 g/dL will receive between 65 and 70 points) may be read. This process is repeated for every parameter. The sum of points can be located on the total point axis, from which a vertical line is drawn downward so that the 30

Materials and methods
Our hypothesis is that, as partially demonstrated previously [1,3,18,19], the SORG nomogram and ML are better predictive tools than the Tokuhashi original or revised scores for spinal metastatic disease survival at 3 months and 1 year. This international validation followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [7]. All methods were performed in accordance with the relevant guidelines and regulations according to our institutional review board.

Inclusion/exclusion criteria
All patients diagnosed with spinal metastasis between January 2015 and May 2020 were prospectively and consecutively recruited and discussed in our multidisciplinary tumor board. Among these, the following inclusion criteria were applied: (1) adult patients between the ages of 18 and 90 at the time of surgery; (2) included patients had surgery for a metastatic spine lesion; (3) primary tumor histology was confirmed by pathology; (4) availability of electronic medical records (EMR) with imaging, operative notes, and laboratory values; (5) at least 30 days of follow-up (FU) after surgery; (6) date of death recorded in the EMR or most recent follow-up available in case of survival until present day.

Source of data and consent
EMR review was approved by our institutional review board for retrospective chart analysis on patients who underwent surgery for spinal metastatic disease at our University tertiary care center. Informed consent for clinical research is systematically obtained from each patient after diagnosis and prior to surgery.

Data collection and outcome
The following variables were assessed based on the factors needed as inputs to the SORG ML algorithm (Figure and  Table 1) [18]: primary tumor histology (based on groupings by Katagiri et al.) [12], Eastern Cooperative Oncology Group (ECOG) performance status [17], American Spinal Injury Association (ASIA) Impairment Scale [13], preoperative presence of any Charlson comorbidity other than metastatic disease [5], presence of visceral metastases (metastases in liver or lung), presence of brain metastases, previous Our primary outcome measure was postoperative survival and its correlation with survival prediction tools: the SORG nomogram, SORG ML algorithm [18], and the Tokuhashi score (original and revised) [23,24]. Survival as well as data contributing to the predictive survival metrics was determined through manual chart review. The date of last review was January 5th, 2021.
Data was missing in terms of inputs only for BMI for 4 patients (10%). Data was also missing for the outcome measure for 1-year survival for 5 patients (12%) due to recent operations without sufficient follow-up time. Otherwise, the data was complete.

Statistical analysis
Continuous variables are reported with means and correspondent range or standard deviation (SD) and categorical variables with absolute number and percentages. Baseline patient and tumor characteristics were compared to the developmental cohort [18] and the validation cohort [3] ( Table 2) with the Fisher exact test and the Mann-Whitney U test.
Individual predicted survival probabilities were calculated for each patient by inputting the variables in the SORG machine-learning algorithm [31] (https:// sorga pps. shiny apps. io/ spine metss urviv al/). Discrimination was measured using the c-statistic (which is also known as the area under Table 2 Baseline characteristics and differences between the patients from the development data set [3,18]  the receiver operating characteristic curve [AUC] for binary classification) and visualized by plotting the receiver operating characteristic curve (Table 3). The AUC plots sensitivity against 1-specificity for all potential cutoffs for a testand ranges from 0.5 (no better than chance) to 1.0 (perfect discriminative ability). We used bootstrap standard errors (1000 replications) to calculate 95% confidence intervals of the AUC. Each prediction tool was considered sufficiently accurate if the AUC was greater than 0.70 [9]. Receiver operating characteristic (ROC) curves were produced using JMP Statistical Software (JMP®, Version < 15 > , SAS Institute Inc., Cary, NC, 1989-2019) and allowed accuracy comparison between predictive tools.
The SORG ML algorithm, both for 3 months and 1 year, was determined to give a correct prediction in our ROC analysis if the predicted survival estimate was greater than or equal to 0.5 and the patient did in fact survive at 3 months or 1 year. Equally, if the estimate was less than 0.5, and the patient did not survive, then the prediction was determined correct. Otherwise, the prediction was determined to be incorrect.
For the Tokuhashi original score, a score of 0-5 predicted ≤ 3 months of survival [25]. A score of 6-8 predicted ≤ 12 months of survival. And a score of 9-12 predicted > 12 months of survival. For the Tokuhashi revised score, a score of 0-8 predicted < 6 months of survival. A score of 9-11 predicted ≥ 6 months of survival. And a score of 12-15 predicted ≥ 12 months of survival. Estimate scores for the Tokuhashi original and revised scores [23][24][25] were determined to be correct or incorrect based on whether the actual survival fell into the predicted range.
Baseline characteristics between the developmental and validation cohorts and our own cohort differed significantly on the following measures: age, absolute lymphocyte count, absolute neutrophil count, neutrophil-lymphocyte ratio, albumin, alkaline phosphatase, and creatinine.

External validation and comparison to Tokuhashi
The SORG ML algorithm for 3-month mortality prediction in spinal metastatic disease achieved an AUC of 0.87. The SORG ML algorithm for 12-month mortality prediction in spinal metastatic disease achieved an AUC of 0.85. The SORG nomogram had a similar predictive ability at 3 months and 1 year with AUCs of 0.87 and 0.76 respectively. These results showed excellent discriminative ability as compared with the Tokuhashi original score which achieved AUCs of 0.70 and 0.69 and the Tokuhashi revised score which had an AUCs of 0.65 and 0.71 for 3 months and 1 year respectively (Table 3, Figs. 2 and 3).
These results confirm our hypothesis that the SORG ML and SORG nomogram scores are better predictive tools than the Tokuhashi original or revised scores for spinal metastatic disease survival at 3 months and at 12 months.

Discussion
Effective counseling of patients with metastatic spinal disease and their families in regard to surgical or conservative treatment options requires a reliable and validated scoring system to preoperatively predict the survival at key postoperative time points such as 3 months and 12 months. It can be summarized as a balance between systemic disease burden and life expectancy against potential surgical complications and the weight of the rehabilitation period. Since most classic survival prediction tools, including the Tokuhashi score, analyze 3-and 12-month survival, we included our analyses at these time points. However, it should be acknowledged that the 3-month survival time point is generally used in clinical practice to decide the validity of a surgical vs. conservative treatment with the 12-month time point having less importance in decision-making. The current state of affairs of this decision algorithm is based on multidisciplinary tumor board's discussions and still relies mainly in the revised Tokuhashi. This external population-based study was able to (1) provide newer and independent data confirming that the SORG ML and nomogram are better than the Tokuhashi classic and revised scores when it comes to predict survival at 3 and 12 months; (2) confirm a specificity higher than 70% for both SORG ML and nomogram with only around 50% for the Tokuhashi classic and revised; and (3) point towards a different and more conservative decisions in at least 10% of the operated cases.
According to the TRIPOD guidelines [7], algorithms should be repeatedly validated, for the assessment of possible performance inadequacies, among different independent populations. The SORG ML and classic algorithm were previously developed and externally validated but needed to face an international validation to see how they would perform outside the setting of a US tertiary care center. This step is also essential in order to implement such a score in central European countries. The population of patients used in this study showed few significant differences from the validation and developmental cohort with regard to age and blood sample results.  Despite these differences, we have shown that the SORG ML algorithms showed very high performance in predicting 3-and 12-month mortality according to the receiver operating curves. In fact, the AUC in our international cohort actually outperformed the validation cohort already published (0.87 vs. 0.75 for 90 days and 0.85 vs. 0.77 for 1 year) [11]. The SORG nomogram also performed very well in our international cohort with an AUC of 0.87 for 90 days and 0.76 for 1 year. These findings are even more encouraging when compared with the results from Nater et al. [16] that showed an AUC of 70 for 3 months and 0.78 for 12 months for the best predictive tool among nine different scoring systems or even from Paulino Pereira et al. and Bongers et al. [3,19]. These facts also mean that this study is the first to show a clear advantage of the SORG at 1 year, compared with the Tokuhashi.
This better discriminative power from both SORG ML and nomogram is also accompanied by a good overall calibration, despite the small size of the sample that prevented the realization of reliable calibration plots. Percentage of "good decisions" based on score predictive value (Fig. 4) also shows the same trend even though their interpretation shall be cautious due to the small sample size.
The objective evaluation showed that the utilization of the SORG ML or nomogram could have prevented 4 "bad surgical indications" in patients that died before 3 months. This represents 50% of the overall 3-month mortality rate of 20% and reinforces the belief that these tools are better than the ones currently used.
There are several limitations that must be acknowledged in order to appropriately contextualize these findings. We recognize that our sample size is limited after a recruitment time of 5 years from a single institution. A minimum of 200 events and non-events is recommended in order to obtain reliable calibration measurements, the number that we were not able to meet [27]. This sample size issue gains more importance when targeting a multivariate subpopulation analysis, for instance accuracy of predictions according to primary tumor histology. The SORG scores inherently take into account tumor histology. However, the influence on survival of more or less radio or chemo-sensitive tumors is not explicitly explained in the score, but rather implicitly taken into account in giving the final survival percentage predictions. It is important to analyze these adjuvant therapeutic options, which are important in prognostication despite initial biochemical parameters, in a multidisciplinary tumor board setting. On the other hand, due to the reliable data within the cohort, we were able to evaluate many items contributing to the different survival prediction measures which contribute to overall better performance. In order to improve the validity of our findings, we propose to continue recruiting patients into the cohort, and eventually to compare our findings with fellow European institutions, which could on the other hand had some population heterogeneity to the cohort and negatively influence the precision of future findings. Also the study design was retrospective. The baseline characteristics differed between the validation and the developmental cohort on some disease factors (laboratory findings and mean age). In addition, although many data are included in the SORG survival score, certain data points are certainly missing which would undoubtedly contribute to patient survival: the frailty index, the role of co-morbid conditions, and the number and site of extra-spinal metastases. Also, patients that refused surgery were not included in this study which could have influenced survival periods and accuracy of algorithms. Furthermore, our analyses are limited to the impact of survival prediction scores on surgical decision-making-there are evidently other important life quality measures to take into account when counseling patients and their families. Due to the overall morbidity of metastatic cancer, which can be quite considerable, an overall quality-adjusted life year analysis can be very useful in these clinical situations, though it is not the focus of this study.
This external validation cohort constitutes the first European test of the SORG ML and SORG nomogram predictive value and shows promising results that shall encourage the multidisciplinary teams to use these tools in a daily basis. Although our cohort demonstrates minor statistical significance preferring the SORG scores over the Tokuhashi scores, it may be argued that this minor difference is not sufficient to prefer one score over the other. Adding our findings to the original cohort, and taking into account similar findings already published in a Taiwanese cohort, we believe that our findings are valid [29]. These algorithms showed a performance better than the currently utilized Tokuhashi revised, being more sensitive and specific in estimating 3-and 12-month survival in patients with metastatic spine disease. The utility and applicability of these tools in populations treated with different treatment modalities other than surgery remains to be determined.

Conclusions
Initial results from external validation of the SORG ML algorithm in an international patient cohort are very encouraging. Our findings contribute to the already published data on the superiority of the SORG ML algorithm and nomogram in predicting survival for patients with spinal metastatic disease in comparison with other existent models and support their utilization upfront as at least of the tools helping clinicians to decide for or against surgery. Despite our study's limitations stated above, the contribution of our study remains valid as it represents the first European cohort to validate the SORG scores, already validated in the USA and in Taiwan. Our results contribute to a growing literature which support the international validation of the SORG scores and their superiority over the traditionally used Tokuhashi score. Further studies are still needed to consolidate the use of these algorithm or ML in larger patient samples, from prospective and multi-institutional trials. ML usage proved to be easy and allows for potential updates in the future.

Simple summary
The purpose of this study is to validate both a nomogram and machine-learning algorithm produced by the Skeletal Oncology Research Group (SORG) at Harvard Medical School in an international cohort of patients. The SORG nomogram and machine-learning (ML) algorithm provides preoperative survival estimation at 3 months and 1 year in spinal metastatic disease and helps in determining the appropriateness of surgical management. The analysis of data in our independent cohort shows a clear advantage in terms of predictive ability of the SORG machine-learning algorithm and nomogram in comparison with the Tokuhashi scores. We conclude that the predictive ability of the SORG ML algorithm and nomogram is superior to currently used preoperative survival estimation scores for spinal metastatic disease.

Prior publishing in conference proceedings
This work has been published by the same authors in the conference proceedings of Brain and Spine, Volume 1, Supplement 1, 2021 [32].

Declarations
Institutional review board statement EMR review was given ethical approval by the Geneva University Hospitals institutional review board for retrospective chart analysis on patients who underwent surgery for spinal metastatic disease at our University tertiary care center.

Consent to participate
Informed consent for clinical research is systematically obtained from each patient after diagnosis and prior to surgery.

Conflict of interest
The authors declare no competing interests.