Study design, participants, data and setting
We developed a population-based multilevel cohort study of adult patients diagnosed with DLBCL or FL between 1st January 2005 and 31st December 2013 in England. Patients were followed up until death or the end of the study at the 31st December 2015, whichever occurred first.
Information on patients with DLBCL and FL was collected from the linkage of English cancer registry data, the Cancer Analysis System21 (CAS) and Hospital Episode Statistics22 (HES) datasets within the national cancer registry and analysis service (NCRAS). These datasets contained information on the patient’s cancer, subtype (morphology), diagnosis date, comorbidity, admissions, accident and emergency presentations, and outpatient appointments at an NHS hospital.
DLBCL and FL were defined according to the 10th revision of the International Statistical Classification of Diseases and Related Problems (ICD-10 codes C82.0-C85.9) (Supplementary table S2).23 Morphology (cell type) and topography (tumour site) were defined using renewed updates of the ICD for Oncology (ICD-O); ICD-O-324 was used for diagnoses up to 2010, and ICD-O-3.125 for diagnoses after 2011.
We obtained the statutory approvals required for this research from the Confidentiality Advisory Group (CAG) of the Health Research Authority (HRA): PIAG 1–05(c) 2007. Ethical approval was obtained from the Research Ethics Committee (REC) of the Health Research Authority (HRA): 07/MRE01/52. Informed consent from participants was waived by the ethics committee. We used anonymised National Cancer Registry and Hospital Episode Statistics data. All methods were carried out in accordance with relevant guidelines and regulations.
Outcome, exposure, and patients’ sociodemographic characteristics
The outcome of the study was the time to death or censoring among DLBCL and FL patients 5 years after cancer diagnosis. We were interested in the estimation of net survival to then derive the excess mortality only due to cancer. Hence, we used England life tables stratified by deprivation, sex, age and calendar year (2005-2013) to account for the overall mortality rate from the background population.26
Comorbidity status was the main exposure. We defined comorbidity as the existence of other chronic medical disorders, in addition to cancer, the primary disease of interest, which are causally unrelated to the primary disease.27,28 Records from HES were used to identify patients’ comorbidity status based on a computational algorithm published elsewhere.29 The algorithm seeks for the presence of comorbidities retrospectively and defines a time window of 6 to 24 months prior to cancer diagnosis where comorbidities are not recorded to avoid bias due to the presence of comorbidities related to cancer (i.e., cardiological comorbidities due to DLBCL or FL cancer treatment). Patient’s comorbidity status was adapted from the original Charlson comorbidity index30 (CCI). We used the Royal College of Surgeons (RCS) modified Charlson Score (Supplementary Table S1).31 The score removes patients with a previous malignancy to avoid bias, does not weight differently comorbidities, and categorises comorbidities as: no comorbidities, one comorbidity and two or more comorbidities (multimorbidity).
Socio-demographic and economic characteristics were collected from the HES dataset. Ethnicity was recorded as white or other. Area-level deprivation, classified into one of five quintiles, was determined by the Index of Multiple Deprivation32 (IMD), which was based on the Lower Super Output Area33 (LSOA) residence of the patient at the time of cancer diagnosis. LSOA is a geographical location with a median of 1500 inhabitants. We also include the information regarding patients’ diagnosis path (route to diagnosis), a UK specific programme, classified as: accident and emergency room diagnosis, general practitioner referral (routine and urgent referrals where the patient was not referred under two-week-wait), two-week-wait (urgent GP referral with a suspicion of cancer), and secondary care diagnosis (other outpatient and inpatient elective routes).34
Statistical Analysis
We tabulated the sociodemographic characteristics by DLBCL and FL. To derive patients excess mortality we used a multilevel excess hazard model (EHM) consisting of a smooth function of the baseline hazard ( ) to derive the net survival estimate, and accounting for heterogeneity across LSOAs via the inclusion of a random intercept with mean zero.35 The statistical contribution of the random effect to the overall goodness of fit of the model was tested using a likelihood ratio test statistic with a Chi-square mixture distribution.36 The was modelled by a cubic B-spline function of time with two knots placed at 1 year and at 3 years after diagnosis where the hazard plateaus.
We include in the model the following variables: age, sex, comorbidities, deprivation, lymphoma subtype, ethnicity, route of cancer diagnosis, and the overall mortality rate from the background population by levels of deprivation, sex, age and calendar year, included as an offset in the model. We included age as a smooth function consisting of a standardized cubic B-spline centred on 70 years and with one knot at the same age. Furthermore, we assumed a time-dependent effect of age at diagnosis, defined as a B-spline function of time, and represented by the interaction between time and age. The parameter estimates for the variables were interpreted conditionally on the random effect, i.e., they have a cluster-specific interpretation, where a cluster refers to a given LSOA. From the model we derived the excess mortality hazard ratios (EMHR) and their respective 95% confidence intervals (CI) for all the categorical variables, and the variance of the random effect for the LSOA.
Missing data analysis
We explored the missing data mechanism for each of the three variables with missing data. Due to clustered data and partially observed categorical variables, we used the latent normal joint modelling multiple imputation approach, under a missing at random assumption (MAR). The imputation model included all fully- and partially-observed variables, vital status indicator, the Nelson-Aalen estimate of the cumulative hazard, and accounted for clustering of patients within lower-super output areas. We generated 10 imputed datasets. The multilevel EHM was fitted to each of these datasets, and results combined using Rubin’s rules.37,38 Overall tests for the effects of age after multiple imputation were done using the F-based procedure for the test of multiple parameters after multiple imputation.39
We used R software for all data analyses; the mexhaz35 package was used for excess hazard modelling and the jomo40 package for multiple imputation.