External validation and comparison of the Glasgow-Blatchford bleeding score, the Rockall score and the AIMS65 score in upper gastro-intestinal hemorrhage: A cross-sectional observational study in Western Switzerland

Background: Upper gastro-intestinal bleeding presents a high incidence in Emergency department. This study aims to externally validate and determine the performance of the Rockall score, the Glasgow-Blatchford bleeding, the modied Glasgow Blatchford score (mGBS) and the AIMS65 score in an Emergency department. Methods: We performed a retrospective cross-sectional observational study between January 1, 2015 and December 31, 2019. We performed Receiver-Operating Characteristic curve and area under the curve (AUROC) to compare discrimination for each scores. The primary outcome was need for intervention or death, including transfusion, endoscopic or surgery intervention. The secondary outcome was in-hospital death. Results: We enrolled 1,521 patients with UIGB. Mean age was 68 [52 – 81] years old, 62 % were men. Melena and/or hematemesis were the most common complain at ED (73%). Primary outcome was positive for 422 patients (27.7%), 76 patients (5%) were positive for our secondary outcome. The Glasgow-Blatchford score and the modied Glasgow-Blatchford score shown the highest area AUROC, respectively 0.774 (95% CI=0.750-0.798) and 0.782 (95% CI=0.759-0.805). AIMS-65 and Pre-endoscopic Rockall score shown lower discrimination, respectively, 0.684 (95% CI=0.657-0.711) and 0.647 (95% CI=0.618-0.675). Conclusion: Regarding our primary outcome, the modied Glasgow-Blatchford score and the Glasgow-Blatchford score presented a good performance. A GBS or mGBS of 0 is safe to rule-out patients with UIGB from ED. Performance of AIMS-65 score and Pre-endoscopic Rockall score were moderate.


Background
Gastro-intestinal bleeding (GIB) is a common and life-threating condition that requires careful evaluation and risk-strati cation at initial admission in the Emergency Department (ED). Upper gastro-intestinal hemorrhages (UGIB) are responsible for about half of GIB [1]. The annual incidence of GIB in the ED is estimated about 120 cases per 100,000 population. The main symptoms includes hematemesis, melena and less often hematochezia and anemia [2].
Patients admitted in the ED for UGIB are at risk of recurrent bleeding and death justifying the hospital admission. Death occurred for 5-10% of patient with GIB admitted in-hospital [3].
Identi cation of acute signi cant bleeding is challenging, and the estimation of the risk of recurrent bleeding or death can be di cult in the ED. Patient with UGIB without obvious bleeding are frequently admitted in-hospital for surveillance and to perform an esophagogastroduodenoscopy (EGD).
Consequently, most of these in-hospital stays might lead to overtriage and overuse of specialized facilities.
Prognosis models enable to predict the risk of an adverse outcome. Many prognostic models and clinical scores have been developed to predict in-hospital death or need for intervention and discriminate high and low risk UGIB patient [4]. Some observational studies showed that most of low-risk patients could be safely discharged with outpatient care and scheduled EGD. [5].
Among clinical scores developed, the Rockall score (RS), the AIMS65 score and the Glasgow-Blatchford score (GBS) are most frequently used.
External validation of these scores is often performed in the same population at a different time period and without prospective validation. Consequently, transportability represents a major limitation for using these scores in a different population.
Our study aims to externally validate and compare these scores' ability to predict death and need for intervention in a tertiary hospital in Western Switzerland.

Study design
We performed a retrospective cross-sectional observational study based on clinical data collected by the data warehouse of a single tertiary center.

Setting and study population
We included all patients older than 16 years admitted at the Emergency Department of Lausanne University Hospital (CHUV) for UGIB. Lausanne University Hospital is a tertiary hospital in Western Switzerland with around 65,000 ED visit per year. UGIB was identi ed by symptoms at ED admission: hematemesis, melena, hematochezia, or other symptoms associated with ED nal diagnosis of gastrointestinal bleeding (syncope, hypotension, anemia, hemorrhagic shock, or asthenia) between January 1, 2015 and December 31, 2019. Exclusion criteria were pregnancy, patients aged less than 16 years old and patients refusing authorization to analyze their data.

Data collection
We extracted all variables from the hospital data warehouse collecting data from medical recorded les, administrative les, diagnosis and surgery coding databases. We collected demographic data (age, sex), date and time of admission, duration of hospitalization, past medical history, physiological data at admission (blood pressure, heart rate, respiration rate, level of consciousness according to the AVPU scale), surgical interventions (type), endoscopic interventions, need for blood transfusions, level of triage priority, and nal diagnoses. We also collected laboratory data (hemoglobin, lactate, excess base, urea, albumin, blood transaminases, prothrombin time and INR (International Normalized Ratio), platelet count, plasma brinogen, aPTT (activated prothromboplastin time) and therapeutics applied (norepinephrine, blood products, tranexamic acid, octreotide, esomeprazole).

Clinical scores compared
The Rockall score (RS) predicts mortality and was developed in 1996 from a study of 4,185 patients with UGIB in the UK during the period from 1993-1996 [6]. Since the full score requires endoscopic ndings, initial application for risk strati cation is limited. Adaptation of the Rockall score, based only on "preendoscopic" clinical data (PERS) also predicted in-hospital death and allowed early risk strati cation.
Considering the aim of this study, only PERS will be analyzed. PERS is obtained from 3 clinical variables: age at presentation, signs of shock, and comorbidities (as congestive heart failure, ischemic heart disease, any major comorbidity, renal failure, liver failure and disseminated malignancy). The minimum value of the score is 0, the maximum is 7.
The AIMS-65 score predicts mortality and was developed in 2011 in the USA, based on a retrospective research conducted on 29,222 patients admitted for UIGB between 2004 and 2005 in 187 hospitals. The same authors externally validated the AIMS-65 one year later on 32,504 patients, from the same national database used for development. AIMS-65 includes 5 clinical or laboratory variables: age at presentation, albumin, INR, alteration in mental status and systolic blood pressure. This score has a narrow score spectrum: minimum score is 0, maximum is 5 [7].
The Glasgow-Blatchford Score (GBS) was developed in 2000, based on 1,748 patients with the objective of identify a patient's need for intervention in the UK (de ned as blood transfusion, endoscopic treatment, or surgery). GBS includes 8 clinical or laboratory variables: blood urea nitrogen, hemoglobin (adapted to sex), systolic blood pressure, heart rate, presentation with melena, presentation with syncope, presence of a hepatic disease (known history or clinical/laboratory evidence) and presence of cardiac disease (known history or clinical/laboratory evidence). The minimum value of the score is 0, the maximum is 23 [8]. According to the European Society of Gastrointestinal Endoscopy (ESGE), patients with a GBS score of 0-1 are considered to be at a very low risk and do not require early endoscopy nor hospital admission; they can be managed as outpatients, informed of the risk of recurrent bleeding and be advised to maintain contact with the discharging hospital. [9] The modi ed Glasgow-Blatchford Score (mGBS) predicts need for intervention and represents a simple version of the GBS by omitting anamnestic variables potentially requiring interpretation or subjective judgment which can increase the risk of bias, i.e.: syncope, hepatic disease and cardiac disease. As in mGBS, the minimum value is 0, the maximum is 16 [10].

Outcomes
The primary outcome was the need for intervention or death. It includes blood transfusion or endoscopic treatment or surgery or in-hospital death.
Our secondary outcome was in-hospital death.

Statistical analysis
All variables are presented as either mean with Standard Deviation (SD) or median with interquartile range, as appropriate. Qualitative variables are expressed as numbers and proportion (percentage).
We assessed overall performance, discrimination and calibration of the scores.
Overall performance of the models was assessed by the Brier score, quantifying the distance between the predicted outcome and the actual outcome. We scaled the Brier score by its maximum to standardize for low incidence. The scaled Brier score ranges from 0-100% and indicates the degree of error in prediction; a scaled Brier score of 0% shows a perfect performance.
Brier score was not estimated for the GBS and mGBS scores because authors did not report the predicted outcome.
The discrimination of a model relates to how well a prediction model can discriminate those with the outcome from those without the outcome. We estimated sensitivity, speci city, positive likelihood ratio (PLR) and negative likelihood ratio (NLR) for each threshold of the 4 scores.
The PLR is the ratio of sensitivity to 1-speci city. A PLR of 10 or above will result in a large increase in the probability of the outcome. The NLR is the ratio of 1-sensitivity to speci city. A NLR of 0.1 or less will result in a large decrease in the probability of the outcome.
The discrimination of the risk scores was compared by plotting the Receiver-Operating Characteristic (ROC) curve -which is true positives (sensitivity) on false positive (1-speci city) -and estimating the area under the receiver-operating characteristic (AUROC). The AUROC can be interpreted as the probability that a patient with the outcome is given a higher probability of the outcome by the model than a randomly chose patient without the outcome. For a binary outcome the concordance (C-statistic) is identical to the AUROC. A perfect model has a C-Statistic of 1.0.
Calibration relates to the agreement between observed outcomes and predicted outcome. Calibration in the large is the difference between the mean predicted and observed probabilities and the ratio of the predicted and observed number of events (P/O). We plotted a calibration graph for AIMS65 and Pre-Rockall score with observed probability of death on predicted probability of death by decile of score combining with a local polynomial regression (Loess algorithm).
We assessed the calibration intercept and calibration slope as a measure of the spread between predicted and observed probability of death. A perfect model has an intercept of zero and a slope of 1; indicating that predictions are neither too low nor too high [11]. We managed missing value using multiple imputation by chained equation (MICE). We ful ll 20 datasets with MICE. We included all variables with missing values needed to estimate the different scores (urea, hemoglobin, blood pressure, heart rate, albumin, INR, AVPU scale). Explanatory variables (death, surgical or endoscopic intervention and transfusion) were not missing. Missing values are reported for each variable in Table 1.

Ethics
The study was approved by the Ethics committee of the Canton Vaud (CER-VD) (project-ID: 2020-00515).

Results
Patients' characteristics are summarized in Table 1 Noradrenaline.
Principal outcomes are summarized in Table 2. Our primary outcome, consisting in patients who needed an intervention or died, was positive for 422 patients (27.7% of the entire population). 277 (18.2%) patients needed a blood transfusion. Three hundred and sixty-four (23.9%) patients needed an intervention (endoscopy, surgery or blood transfusion). The average time spent in hospital was 7 days. We reported 76 in-hospital death (5%).  Table 3. Global performance was analyzed for the secondary outcome (in-hospital death). Brier score was 0.046 for AIMS-65 and 0.0647 for PERS. The scaled Brier score for was 20% for both scores. Discrimination was analyzed for primary outcome (need for intervention and/or died) and the secondary outcome (in-hospital death). C-statistic (AUROC) for the primary outcome is summarized in Figure 1. The primary outcome showed an AUROC of 0.78 and 0.77 respectively for mGBS and GBS score systems, followed by PERS with 0.65 and AIMS-65 with 0.68. Sensitivity and speci city for different thresholds are summarized in supplementary les.

Main ndings
Performance was poor for AIMS-65 and PERS. Regarding our primary outcome, mGBS and GBS presented a good performance with area under the curve of 0.78 and 0.77, respectively. The PERS showed a better calibration than AIMS-65. A GBS or mGBS of 0 safely identi ed patients with no need for intervention. The diagnostic performance of the AIMS-65 and PERS were unsatisfactory, therefore their clinical use cannot be recommended.

Strengths and limitations
Our study presented several strengths and some limitations as well. First, we included patients at initial presentation in the ED representing an inception cohort. Second, we did not report missing values for follow-up in-hospital death. However, due to the retrospective design, we cannot exclude missing values for intervention (transfusion, surgical or endoscopic intervention) if these interventions were not reported or identi ed in the data warehouse. Third, 126 patients were not considered for inclusion because of their refusal of the use of health-related data. Even if it represents a small amount of patients selected at random, it might led to selection bias. Fourth, we used the rst reported value for estimating each score. Error measurement might have led to regression dilution bias, decreasing the C-statistic of each score. Fifth, we used rigorous methods for assessment of each score using all characteristics of a prognostic model (overall performance, discrimination and calibration). Sixth, the large sample size with more than 1,500 patients provides precise and reliable measures. Seventh, the mono-center design decreases the external validity. PERS and GBS were developed almost two decades ago. Mortality of UIGB has changed slightly since then, and this may affect the calibration with current practice and limit the external validity.
However, the aim of the study was to assess external validation in a different population of the derivation cohort of each score. Data from our ED population in Western Switzerland may not be applicable in other geographic regions.

Comparison with other publications
The score's authors of GBS, Blatchford and al. enrolled 1,718 patients for developing the score.
[8] They reported an AUROC of 0.92, a signi cantly better discrimination than we found in our study. They prospectively performed an external validation in the study, and reported a high discrimination [8]. The validation cohort of the authors was chronologically different, but not geographically divergent from the derivation cohort, resulting more in a study assessing reproductivity than a real external validation study.
On the contrary, our study population is different geographically and chronologically, and represents a true external validation. In a recent external validation study, Chang et al. reported an AUROC of 0.8 for GBS, which is similar to the value we found in our study [12]. In another study, Chan and al. conducted an external validation study in 2010 with a pool of 432 candidates in the UK, obtaining an AUROC of 0.82 for GBS 0.83 for mGBS, results of which are similar to those found in our study with an AUROC of 0.78 and0.77, respectively. [13] The AUROC found for our primary outcome were 0.65 for PERS and 0.68 for AIMS-65. Of note, these scores were developed to predict in-hospital death and not need for intervention. for AIMS-65 related to need for intervention or death -similar to those found in our study. [14] The results of our study regarding the secondary outcome, in-hospital death, are similar to other external validation studies. GBS discrimination for predicting death is lower (AUROC of 0.7) than predicting need for intervention. Similar results were found in other external validations. [ Clinical implications mGBS and GBS score seemed more accurate than others to predict the need for intervention or death. GBS were developed to predict need for intervention while PERS and AIMS-65 were developed to predict mortality. It is clinically relevant to predict need for intervention rather than in-hospital death as the main objective of risk strati cation is to rule-out patient from the ED. A GBS or a mGBS of 0 is safe to rule-out patient with UGIB from ED with only 1 event for GBS 0 (0.06%). A post-test prediction of less than 1% represents an acceptable probability to rule-out decision-making. Performance of PERS and AIMS6-65 in predicting the need for intervention or death was too weak to have a clinical use. Of note, mGBS presented similar performance indicators than GBS and is more suitable for clinical use.
The percentage of patients needing an intervention or dying increase progressively from a threshold of GBS or mGBS 1 to 5. This range includes low-to-moderate risk. Clinical expertise and shared decision with patient is crucial in these groups for decision making. At contrary, AIMS-65 and PERS didn't show similar characteristics and had a signi cant number of unfavorable events at low values making this clinical usefulness useless.

Future study
These scores are developed to predict generic or composite outcomes (death and need for intervention or death). Regarding mortality, death associated to UGIB could occur from other causes than bleeding itself (cancer, pulmonary embolism during hospitalization, etc.) Consequently, overall mortality is not a reliable outcome. Decision making for immediate intervention in the ED is more associate to initial bleeding than later death due cancer evolution. Score predicting initial death or death due to bleeding could improve decision-making in the ED.
Prospective validation study of clinical rule including predicting score is needed to validate clinical usefulness of clinical score integrating clinical decision making.

Conclusion
In everyday clinical practice, the use of these scores must be considered with caution. mGBS and GBS seem to be the most accurate scores to exclude low-risk patients. Taking into account costs and speci c public health system features, different thresholds can be selected according to the risk that is considered acceptable. Clinical judgment, especially in low to moderate risk patient, is essential. As many factors need to be considered, use of these scores could help in decision-making, but doesn't represent undeniable regulation. A shared decision with the patient must be considered wherever possible.

Declarations
Ethical approval and consent to participate This study has ethical approval from Commission Cantonale d'Ethique de la Recherche sur l'être humain (CER-VD), approval's number : 2020-00515.

Consent for publication
Not applicable.

Availability of data and materials
Data are available on reasonable request.

Competing interests
All authors declare no support from any organisation for the submitted work; no nancial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have in uenced the submitted work.

Funding
This study did not receive any fund.
Author's contribution SR, FXA, PNC designed the study. SR, FXA designed and monitored the data collection from which this paper was developed. SR, FXA analysed the data. SR, FXA wrote the rst draft. SR, FXA, AS, PNC contributed to write and revised the paper. Discrimination of scores predicting need for intervention (AUROC).

Figure 2
Number of events need for intervention by score.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.