Using propensity for pre-diagnosis behavior as a predictor of cancer survival time: an example in esophageal cancer.


 Background: Information on the associations between pre-diagnosis health behavior and post-diagnosis survival time in esophageal cancer could assist in choosing treatments and planning health services but can be difficult to obtain using established study designs. We postulated that, with a large data set, using estimated propensity for a behavior as a predictor of survival times could provide useful insight as to the impact of actual behavior. Methods: Data from a national health survey and logistic regression were used to calculate the propensity of selected health behaviors from participant’s demographic characteristics for each esophageal cancer case within a large cancer registry data base. The associations between survival time and the propensity of the health behaviors were investigated using Cox regression. Results: Observed associations include: a 0.1 increase in the probability of smoking one year prior to diagnosis was detrimental to survival (Hazard Ratio (HR) 1.21, 95% CI 1.19,1.23); a 0.1 increase in the probability of hazardous alcohol consumption 10 years prior to diagnosis was associated with decreased survival in squamous cell cancer (HR 1.29, 95% CI 1.07, 1.56) but not adenocarcinoma (HR 1.08, 95% CI 0.94,1.25); a 0.1 increase in the probability of physical activity outside the workplace is protective (HR 0.83, 95% CI 0.81,0.84). Conclusions: We conclude that propensity for health behavior estimated from demographic characteristics can assist in determining existence of the association between pre-diagnosis health behavior and post-diagnosis health outcomes, allowing some sharing information across otherwise unrelated data collections.


Background
With an incidence of 9.3/100,000 males and 3.5/100,000 females per year, esophageal cancer led to more than half a million deaths worldwide in 2018 (1). The majority of these deaths arise from modi able lifestyle factors. In the US in 2014 it was estimated that 71% of male and 59% of female esophageal cancer deaths arose from modi able lifestyle factors and that cigarette smoking, alcohol consumption and excess body weight could account for up to 50%, 17% and 27% of deaths respectively (2).
While there is considerable documentation of associations between health behavior and onset of esophageal cancer (3), the impact of health behavior on survival times is less well understood (4). A more thorough understanding of predictors of survival time is needed to assist in choosing treatments, for anticipating health service needs and for health services planning.
Health behavior prior to a cancer diagnosis is often different from health behavior post-diagnosis.
Behavior prior to diagnosis can be in uenced by public health activity but post-diagnosis behavior is strongly in uenced by the diagnosis itself (5) and by treatment (6, 7). As esophageal cancer has relatively short survival times (in the US, just 19% of cases survive 5-years (8)), pre-diagnosis behavior could have a strong carry over effect on survival time.
Unfortunately, investigating the effect of pre-diagnosis behavior on post-diagnosis survival can be di cult and expensive. As the disease is relatively rare, a prospective cohort study would be ine cient (on the gures above, surveillance of 100,000 men for 10 years would be expected to yield just 93 new esophageal cancer cases). Retrospective studies which enroll newly diagnosed cancer patients and ask them to recall their prior health behavior still involve considerable expense and are fraught with recall and survivor biases. In one example, an Australian study enrolling newly diagnosed esophageal cancer patients reported that patients with late-stage disease were di cult to enroll and under-represented (9).
Secondary analyses of already existing data can provide alternate, cost-effective opportunities. It is now common for governments to sponsor both regular health behavior surveys and mandatory cancer registries. For those cancer cases who contributed to a survey prior to diagnosis, their health behavior and cancer outcomes can be linked to produce a retrospective cohort. Data linkage avoids recall and survivor biases and is cost e cient (as the required data are already collected, compiled and cleaned).
But data linkage may not be feasible either. Con dentiality is one issue. But more fundamentally, as esophageal cancer is relatively rare, the number of cancer cases who happened to have previously participated in the health survey is likely to be very small. If data linkage cannot be applied, is there any other way in which these rich (and expensive) data sets can be used to help provide insights into the association between pre-diagnosis behavior and post-diagnosis survival times?
Often the only measures in common between cancer registries and national health surveys are the demographic characteristics of participants. It is known that demographically similar people are more likely to display similar health behavior than people from different demographic groups (10). That is, different demographic groups have a different propensity for particular behaviors. Propensity calculated from demographic variables may be a weak indicator of actual behavior, but with large data sets even weak signals are detectable. This study investigated whether or not useful information on the association between pre-diagnosis health behaviors and post-diagnosis survival times could be obtained by analyzing cancer cases propensity to engage in these behaviors. The analyses used US data and focused mainly on the three modi able lifestyle factors identi ed above: cigarette smoking, alcohol consumption and excess body weight.

Methods
The data sets Unit record data on esophageal cancer cases and their outcomes was extracted from the Surveillance, Epidemiology, and End Results Program (SEER) cancer registry (11). The SEER system is administered by the National Cancer Institute. SEER currently compiles data from cancer registries covering about 28% of the US population across 13 States. Most cancers, including esophageal cancers, are recorded. Deidenti ed unit record data made available for research include demographic measures, medical details of the cancer, treatment and outcomes (including survival time).
Data on health behavior was extracted from the Behavioral Risk Factor Surveillance System (BRFSS) health survey (12). The BRFSS is an annual national survey of health. It commenced in 1984 and now collects data from more than 400,000 telephone interviews each year covering adult residents of all US States and three Territories. The de-identi ed unit record information made available for research included demographic and health behavior measures, and State population sampling weights.
Both collections provided access to cleaned, de-identi ed unit record data at no cost to the researcher. Although both data collections are large, with less than 0.2% of American adults participating in BRFSS and around 4,000 esophageal cancer cases being recorded in the SEER data set each year, we could only expect about eight new esophageal cancer cases each year to have participated in the previous BRFSS survey.

Inclusions and exclusions
This analysis focusses on the 15-year period from 2001 to 2015. Data prior to 2001 are excluded due to changes in the de nitions of some health behaviors variables and because earlier data may be less relevant to current behavior and outcomes. 2015 was the most recent year of SEER cancer registry data.
As esophageal cancer is rare in young ages, all cancer cases who were less than 35 years of age are excluded as being atypical. 201 of 57025 (0.3%) cases are excluded. For the BRFSS health survey, all data records from respondents 25 or more years of age who lived in one of the 13 US States represented in the SEER cancer registries are included. Including the younger respondents allows information on health behavior up to 10 years prior to cancer diagnosis to be retained.

Outcome variable
The outcome of interest is post-diagnosis survival time in months as recorded in the SEER cancer registry data set. That is, all cases with survival less than 30.4 days after diagnosis (including cancers detected post-mortem) have a survival time of 0 months, those who died between 30.4 and 60.8 days have a survival time of 1 month, etc. The maximum possible survival time is 179 months. For those who are still alive and those who are lost to follow-up, survival time is censored at the date of last follow-up.

Health behavior variables
The research focused mainly on measures relating to cigarette smoking, alcohol consumption and excess body weight. The choice of variables was restricted to measures available through the BRFSS health survey. The following variables, all recording self-reported behavior, were included: Current smoker (yes/no) which includes those who smoke daily or less than daily; Alcohol -heavy drinking (yes or no), which is de ned as more than two standard drinks per day for men and more than one standard drink per day for women in the month prior to survey; Alcohol -binge drinking (yes or no), which is de ned as males reporting having ve or more standard drinks or females reporting 4 or more standard drinks on one occasion in the month prior to survey; Current smoking and alcohol consumption (yes/no), which is de ned as both current smoker and an average consumption of ≥1 standard drink of alcohol per day in the past month.
Obese (yes/no) which is BMI ≥ 30 kg/m2 Undertook physical activity or exercise in the past 30 days other than regular job (yes or no)

Demographic variables
As the cancer registry data did not include information on pre-diagnosis health behavior we estimated the propensity for each pre-diagnosis health behavior for each cancer case using the available demographic variables.
Of the variables in common between the SEER cancer registry and the BRFSS health surveys we hypothesized that year, age, sex, race, marital status and State of residence could be helpful for predicting health behavior. For example, race is known to be associated with smoking (13) and alcohol dependence (14) in the US. Also, living as married ameliorates social isolation and social isolation is associated with adverse health behaviors such as smoking, higher BMI, and lower desire for exercise (15).
As age was recorded in 5-year age groups in the SEER cancer registry data, we applied the same categories to the BRFSS health survey data. Race was categorized as White; Black; Asian or Paci c Islander; and American Indian or Alaskan native. Participants in the BRFSS health survey who selfreported as mixed race (n = 44,670, 3.1% of total) were omitted as there was no corresponding code in the SEER cancer registry data set. Marital status was categorized as married or living as married; divorced or separated; widowed; and single.

Other factors considered
Post-diagnosis survival time is sensitive to a range of factors, some of which could potentially confound associations with pre-diagnosis health behavior and survival time. For example, the association between health behaviors and incidence of esophageal cancer is known to differ by histological type (3,16) and these differences appear to carry over into survival time (17,18). Therefore, we have conducted sub-group analyses for squamous cell carcinoma (ESCC) and adenocarcinoma (EAC). Also age is associated with survival time (19) and health behavior can change with age. Age, recorded in 5-year age groups but treated as a continuous variable, is included in the nal models as a potential confounder.
Somewhat more di cult was how to address cancer stage. Cancer stage at diagnosis is an important predictor of survival time (19) and could perhaps be associated with health behavior, although this association may be an intermediary step between health behavior and survival time rather than a true confounder. For completeness we opted to adjust for cancer stage in our models. Recording of cancer stage at diagnosis was incomplete in the SEER cancer registry data; being unavailable from 2001 to 2003 and having 18% missing data across the other years. We have excluded cancer stage prior to 2004 and categorized it into 5 categories (stage I, stage II, stage III, stage IV, not speci ed) from 2004 onwards.
Eligible data records 56,824 SEER esophageal cancer cases and 1,450,775 BRFSS health survey respondents met the eligibility criteria. Additional le 1 summarizes the characteristics of the two samples. Among the cancer cases, median time till death was 7 months with median follow-up time of censored observations (18.6%) was 30 months. 52.9% of cases were EAC and 33.7% ESCC. 16.1% of the BRFSS respondents were current smokers and 4.8% were judged to be heavy drinkers of alcohol. The BRFSS respondents included higher proportions of younger people and females than the SEER cases.

Statistical analysis
The characteristics of eligible cancer registry cases and health survey respondents are summarized using counts and percentages, with the exception of survival time which is summarized using medians, quartiles and maximums.
Propensity for health behaviors were estimated from the BRFSS health survey data using logistic models; with a separate model for each health behavior. Each modelled the probability of having the behavior of interest based on year of survey, age, sex, race, marital status and State of residence. We also allowed for differences in the propensity for health behaviors between sexes and between marital statuses at different ages by including age by sex, age by marital status and marital status by sex interaction terms in each logistic model.
To correct for the complexities in the BRFSS health survey sampling and non-response we weighted the logistic models by the sampling weights provided. In 2011, the BRFSS introduced a new method of calculating sampling weights which improved the weighting of some variables including race and marital status. However, as both systems weight to the State totals, we do not differentiate between the different type of weights in this analysis. We excluded data records with extreme sampling weights: those which fell in either the top or bottom 0.5% of the distribution. To assist the models to converge we use Firth's bias reduced penalized-likelihood when tting the models; using the logistf package (version 1.23) in R software (version 3.5.2). The tted models are summarized in Attachment 5.
Year and age category were tted as numeric variables while sex, race, marital status and State of residence are categorical. Preliminary investigations (not reported) con rmed that a linear model was reasonable for both year and age category. Year is coded as 0 for 2001 through to 14 for 2015 for analysis.
We con rmed that the chosen risk pro ling variables were indeed predictors of each health behavior by visual inspection of odds ratios from logistic regression models. To help gauge the predictive ability of each demographic variable we present areas under the curve (AUC) of the receiver operating characteristic (ROC) curve for each predictor alone and for the full logistic model using the pROC package (version 1.13.0) in R software. The higher above 0.5 the AUC, the greater the ability of the model to predict the health behavior.
For each esophageal cancer case in the SEER cancer registry, we estimated their propensity of participating in each health behavior by substituting their demographic characteristics into the logistic predictive model for that behavior. As we were speci cally interested in health behavior prior to diagnosis we trialed three pre-diagnosis time points: 1, 5 and 10 years prior to diagnosis. This entailed substituting diagnosis year minus 1, 5 or 10 as the year variable of the logistic model and 5-age group minus 0, 1 or 2. To avoid extrapolating earlier than the observed data, the 5-year lag analysis was restricted to esophageal cancer cases from 2006 to 2015 and the 10-year lag model was restricted to cases from 2011 to 2015.
The relationship between the estimated probability of each behavior and survival was investigated using Cox regression models using the survival package (version 2.43-3) in R software. Separate models were tted for each behavior. Results are presented as hazard ratios (HRs) with associated 95% con dence   a The hazard ratio describes the impact of a 0.1 increase in the probability of having the speci ed health behavior.
b Adjusted for age and cancer stage at diagnosis <0.001 a The hazard ratio describes the impact of a 0.1 increase in the probability of having the speci ed health behavior.
b Adjusted for age and cancer stage at diagnosis Smoking one year prior to diagnosis appears to be unrelated to survival until adjustment for age and disease stage at diagnosis. In the adjusted model, each 0.1 increase in the probability of pre-diagnosis smoking is associated with a 20% (HR 1.20, 95% CI 1.18-1.22) increase in post-diagnosis hazard with no discernible difference in results for ESCC and EAC subgroups.
Results for alcohol consumption are mixed. When using behavior one year prior to diagnosis as the predictor, a 0.1 increase in the probability of heavy drinking appears to be protective of survival even after adjustment for age and cancer stage at diagnosis (HR 0.82, 95% CI 0.76-0.88). However, when looking at behavior 10 years prior to diagnosis, the adjusted model nds heavy drinking to be detrimental to postdiagnosis survival in ESCC (HR 1.30, 95% CI 1.08-1.57) and with no discernable association in EAC (HR 1.10, 95% CI 0.95-1.26). The pattern of results for binge drinking is quite similar.
A 0.1 increase in the probability of concurrently smoking and drinking ≥1 standard drink per day in the year prior to diagnosis is associated with double the risk of death (HR = 1.93, 95% CI 1.72-2.16), after adjustment for age and cancer stage with no difference between ESCC and EAC.
After adjustment, a 0.1 increase in probability of obese one year prior to diagnosis is associated with an apparently trivial increase in post-diagnosis hazard (HR 1.04, 95% CI 1.02-1.06). A slightly larger hazard (HR 1.10, 95% CI 1.07-1.14) was recorded for a 0.1 increase in the probability of obese 10 years prior to diagnosis. A 0.1 increase in the probability of exercise outside employment one year prior to diagnosis is associated with improved survival (HR 0.82, 95% CI 0.81-0.84) with little difference between ESCC and EAC.

Discussion
The results above appear to support of the proposition that demographic-derived estimates of the propensity of health behaviors can assist in identifying association between pre-diagnosis health behavior and post-diagnosis survival time in esophageal cancer. The hazard ratios quoted in this paper show the increased hazard of death associated with each additional 0.1 probability of the health behavior of interest. This is quite different from the direct measure of impact of the health behavior on survival time and more di cult to interpret. Never-the-less, there is consistency between the results of the present study and previously published results: especially in the presence and direction of associations.
We have found that a 0. The unexpectedly protective result for alcohol consumption one-year prior to diagnosis could indicate insu cient adjustment for confounding (such as comorbidities or health symptoms) or weaknesses in the measurement tool (such as biases in the self-reporting of alcohol consumption in standard drinks).
Previous authors have found that pre-diagnosis smoking and alcohol consumption combined produce a disproportionately high risk to post-diagnosis survival (for example, HR 3.84, 95% CI 2.02,7.32 for ESCC (17)). We have also found that a 0.1 increase in the probability of concurrent daily smoking and consuming one or more alcoholic drinks per day one year prior to diagnosis, adjusted for age and cancer stage at diagnosis, had a relatively high estimated HR of 1.93 (95% CI 1.79,2.07).
We observed that a 0.1 increase in the probability of obese one year prior to diagnosis was associated with slightly higher risk of death adjusted HR = 1.04 (95% CI 1.03,1.06) mainly associated with ESCC (HR 1.07 95% CI 1.04,1.10). The association seems small and the literature on obesity is sparse with mixed ndings. One review found pre-diagnosis obesity could be associated with higher risks of death in cancer (speci cally breast, prostate and colorectal cancers) (26) but a later study reported that pre-diagnostic obesity increased hazard for all cancers except cancers of the upper digestive tract (obese compared to normal weight HR 0.87, 95% CI 0.62,1.22) (27). More recently a North American study (23) found recalled obesity in early adulthood was associated with lower survival times than normal weight (HR 1.77, 95% CI 1.25, 2.51). The measure of obesity available in this study may not be optimal.

Exercise
We found that a 0.1 increase in probability of pre-diagnosis physical activity outside of the workplace was associated with improved survival (adjusted HR = 0.82, 95% CI 0.81,0.84). This is consistent with a recent review (28) which found the relative risk of death between the highest versus lowest category of physical activity to be 0.71 (95% CI 0.57,0.89) for esophageal cancer.

Strengths and weaknesses
Our analyses using propensity for health behaviors has produced results which have some face validity.
A strength of this example is that the data sets used are large, public domain and well understood. Any interested researcher can reproduce, re ne and/or extend these analyses using the same data sets.
Both the data sets and the analysis technique used have some limitations and weaknesses. In relation to the data sets, there are response biases within the BRFSS (29) which the sampling weights may not have fully addressed. Further, the measures of behavior available are limited and are dictated by the existing data base which was designed for other purposes and is not optimized for our research question.
For the model, estimating the propensity of a behavior is less accurate than a direct measure of behavior and conveys less information about that behavior: so will have less power for detecting associations.
There may be residual confounding from unmeasured variables (such as education, socio-economic status or comorbidities). Finally, omitting interactions with year may have contributed to the apparent lack of difference in outcomes between behavior one, ve and ten years prior to diagnosis.

Conclusion
The rarer the disease, the less feasible it is to conduct either prospective cohort studies or record linkage (retrospective cohort) studies. Retrospective data collection (including case-control studies) are fraught with recall and survivor biases. Exploiting existing data provides cost-effective opportunities for investigations but may require different methodologies.
Analyses of the associations between propensity for pre-diagnosis health behavior (based on demographic characteristics) and survival time in esophageal cancer produced results with some face validity. Expressing associations in units of changes in the probability of the health behavior was cumbersome. However, the required data are already available, allowing relatively quick and inexpensive investigations of possible associations between pre-diagnosis behavior and post-diagnosis outcomes for relatively rare diseases. And of course, most diseases are relatively rare.

List Of Abbreviations
AUC, area under the curve; BMI, body mass index; BRFSS, Behavioral Risk Factor Surveillance System; CI, con dence interval; EAC, esophageal adenocarcinoma; ESCC esophageal squamous cell carcinoma; HR, hazard ratio; OR, odds ratio; ROC, receiver operating characteristic curve; SEER, Surveillance, Epidemiology, and End Results Program;