Study design and data sources
For this retrospective cohort study, we accessed the population-wide health administrative data for all publicly funded services provided to the residents of Ontario, Canada from the Institute for Clinical Evaluative Sciences16 data repository. We combined the records for ED visits with acute care visits, gathered from National Ambulatory Care Reporting System (NACRS) and Discharge Abstract Database (DAD) datasets respectively. These datasets contained primary and secondary diagnoses recorded using ICD-10-CA codes (up to 10 codes per record in NACRS, and up to 25 codes in DAD) as well as clinical, demographic, and socio-economic information about each person. We only included the first incidence of a TBI-related visit, defined as the “first TBI event”, for persons who were discharged from the ED or acute care hospitals with a TBI-specific diagnostic code (S020, S021, S023, S027, S028, S029, S040, S071, S06) between April 1st, 2002, and March 31st, 202017. We restricted the cohort to persons who were aged 16 - 64 years in order ensure homogeneity of gender attributes within the adult group (versus pediatric or senior population). Data on age, sex, and calendar year specific death rates in general population were extracted from Statistics Canada life tables18. Data on discharge locations were derived from DAD.
The combined dataset was randomly split into 50% for training, 25% for validation, and 25% for testing to prevent overfitting and to ensure model validation. Training and validation sets were used for model building and internal validation, whereas the reported results were based on the test set performance.
Table 1. Characteristics of datasets used in each analysis.
Parameter
|
All subjects
|
Mortality model1, Test set
|
Discharge model2, Test set
|
N=276,812
|
N=4,389
|
N= 7,343
|
Records source (%)
|
|
|
|
Acute Care
|
12.7
|
63.2
|
100.0
|
Emergency Dept.
|
87.3
|
36.8
|
-
|
Sex (%)
|
|
|
|
Female
|
44.5
|
29.0
|
27.3
|
Age (years)
|
|
|
|
Median
|
31
|
45
|
42
|
Q1
|
21
|
28
|
26
|
Q3
|
47
|
56
|
54
|
Rural (%)
|
|
|
|
Yes
|
15.5
|
12.6
|
16.4
|
Income quintile (%)
|
|
|
|
1 (lowest)
|
21.5
|
23.4
|
24.7
|
2
|
19.9
|
20.4
|
20.4
|
3
|
19.7
|
19.0
|
18.6
|
4
|
19.7
|
19.2
|
19.0
|
5 (highest)
|
19.2
|
18.1
|
17.4
|
Length of Stay* (days)
|
|
|
|
Median
|
2.9
|
5.8
|
5.0
|
Q1
|
1.7
|
3.0
|
2.0
|
Q3
|
4.8
|
14.0
|
12.0
|
ADG score*
|
|
|
|
Median
|
2
|
2
|
2
|
Q1
|
1
|
1
|
1
|
Q3
|
3
|
4
|
4
|
TBI severity (%)
|
|
|
|
Unknown
|
48.6
|
-
|
19.6
|
Mild
|
41.5
|
-
|
39.0
|
Moderate
|
2.8
|
-
|
7.3
|
Severe
|
7.1
|
100.0
|
34.0
|
Abbreviations: TBI = Traumatic Brain Injury; Dept. = Department; ADG = Johns Hopkins’ Aggregated Diagnosis Groups; Q1 = 1st quartile; Q3 = 3rd quartile. Data given as median (Q1, Q3) for continuous variables or (%) for categorical variables.
1 Subjects with severe TBI who had recorded survival status at day 30 after the first TBI event
2 Subjects who had a record of discharge location from acute care hospitals and non-missing ADG score
Statistical approaches
1. Gender score derivation
We used logistic regression approach to derive gender score reflecting a probability of each person being male or female based on a set of indicator variables of diagnostic codes that reflect biological (associated with binary sex, such as diseases) and social (associated with behavioral and other socially defined characteristics considered as male-like or female-like) attributes of people.
Each person’s sex was compiled from the Registered Persons Database19. The ICD-10-CA diagnostic codes recorded in each TBI visit were converted into a matrix of indicator variables for each distinct diagnostic code using natural language processing tools (creating document-term matrix using R package “tm”20). Diagnostic codes that were not common, i.e., present in a single person in training and/or validation datasets, as well as codes that occurred only in males or females were removed from both sets; the latter was done to ensure derived gender characteristics were relevant to both sexes. To select the subset of diagnostic codes to include in the gender score model, we assessed the significance of each unique diagnostic code in predicting the sex (Female = 1) of persons who were diagnosed with that code by fitting univariate logistic regression models. All diagnostic code indicators that were significant at 5% level after Benjamini-Hochberg correction21,22 in both training and validation sets were subsequently included into the gender score model predicting the probability of sex reported as female in the training set. Consequently, model coefficients obtained from training were used to calculate the final gender scores in the test set. Therefore, the final gender score was a continuous variable ranging from 0 (male-like) to 1 (female-like), estimating the probability of a person being male or female.
2. Predicting TBI-related excess mortality
Following our previous research8, we defined the acute phase of mortality due to injury sustained during a TBI event (in some studies it was called TBI-related mortality8,23) as death within a 30-day window. Exploratory analysis showed that 64% of the people who died within 30 days had a severe TBI diagnosis (Supplementary Table S1), therefore, the analysis was restricted to this subpopulation. In addition, people with unknown survival status 30 days following their first TBI event, or with unknown injury severity were excluded from this analysis.
The outcome was therefore defined as time-to-death within 30 days of the first TBI event and persons who were alive on the 30th day after the first TBI event were censored. Covariates in the model were selected based on previous research8, which included age as a continuous variable, mechanism of injury (determined using major external cause of injury group codes24: falls, struck by/against object, motor vehicle collisions, cyclist collisions, other), rurality indicator, income quintile (linear predictor), and Aggregated Diagnosis Groups (ADG) score, which is a weighted score representing the presence or absence of 32 ADG diagnosis groups as an indicator of comorbidities25 (Supplementary Table S2). To control for population death rates, we extracted the age, sex, and calendar year specific death rates for each person from the Statistics Canada life tables18 and used it as an offset term in the model. Excess mortality rate was modelled using a Poisson regression model8,26–28; details are presented in the Supplementary Text S1.
3. Predicting discharge location
Discharge location prediction was restricted to acute care visits. Persons who were alive when discharged, with a recorded discharge location and non-missing baseline ADG score were included in this analysis (Table 1). The outcome variable was discharge location from acute care, categorized into six groups: discharged home, discharged home with support, inpatient complex continuing care (CCC), long term care (LTC), rehab, and other. The category “other” was composed of smaller subgroups including transfer to another inpatient care/hospital/acute care facility, long term/continued care, other ambulatory care/palliative care/hospice/addiction treatment centers/jails, died in facility, left against medical advice, and signed out against medical advice29,30. Covariates identified from previous study included age, length of stay (LOS), ADG score, rurality indicator, and income quintiles29. The most common discharge location (Supplementary Table S3) was “discharged home”, which was used as the reference level in the baseline category logistic regression models31.
4. Measuring effects of sex versus gender score
We compared the predictive performances of gender score versus biological sex in predicting TBI-related outcomes (early mortality and discharge location) using the test set. To achieve this, we considered the following three models for each outcome: Model 1 with binary sex and control variables as covariates, Model 2 with gender score and control variables, and Model 3 with both sex and gender score in addition to the control variables. We used two metrics/statistics to assess the effects of sex and gender score in predicting TBI-related outcomes: 1) profile likelihood based confidence intervals (CI) and 2) likelihood-ratio tests (LRT) while controlling for any relevant variables. We reported the p-value (p-val) and the degrees of freedom (df) for LRT, p-val < 0.05 was considered statistically significant. All model-derived estimates are reported with 95% CI, and if the CI contains 1, it is not significant at 5% level. Effect estimates were reported to compare the unit difference in gender score and sex. Sex is a binary variable coded as Female = 1 and Male = 0, and gender score is a continuous variable ranging from 0 to 1 (towards 0 is more male-like and towards 1 is more female-like).
To assess whether sex or gender score is more informative in predicting TBI-related outcomes, we can use Model 3 to assess whether Model 1 or Model 2 is preferred using an indirect and a direct approach for comparison. An indirect approach is to test the significance of the sex effect and the gender effect, if one effect is significant and the other effect is not significant, then that is an evidence that the model with just the significant effect (and the control variables) is the preferred model32 for that outcome. We can also use Bayes factors (BF)33 to directly measure the strength of evidence that the data supports one model versus the other, therefore, prefers one effect over the other. Since Model 1 and 2 have the same degrees of freedom, their LRT statistic is equal to the difference in Bayesian information criterion (BIC), which is approximately equal to a function of the BF (Supplementary Text S2)33,34.
5. Missing data
All analyses were based on complete records with all variables relevant to a particular analysis recorded. Three variables used in the analysis had missing values: ADG score (10.1% missing), LOS (10.6% missing), and TBI severity (48.6% had “unknown” injury severity). The last variable was only used in TBI-related excess mortality analysis and exploratory analysis showed that the death rate of persons with “unknown” injury severity was similar to that of persons who sustained a mild TBI (Supplementary Table S4). Considering that excess mortality analysis was restricted to persons with a severe TBI and much higher mortality rate, excluding persons with unknown severity status should not result in biased estimates.
6. Statistical software
All analyses were performed in R (version 3.6.3, R Foundation for Statistical Computing; www.R-project.org).