Data Description and Case Definition
We analyzed primary data from anonymized electronic medical records of 250,000 individuals between the years 2007 and 2017, collected from Maccabi Healthcare Services (Maccabi). Maccabi is the second-largest health maintenance organization (HMO) in Israel, serving about 25% of the population (2,215,000 clients). Maccabi’s clients are representative of the Israeli population, and reflect all demographic, ethnic, and socioeconomic groups and levels [26].
In order to avoid biases that may arise from death or changes in healthcare provider, we randomly selected 250,000 individuals who were members of Maccabi during the entire period from 2007 to 2017 or who were born during this period and remained members until 2017. We chose 2008 as the earliest influenza season because it was the first season that the vaccination was offered free of charge for all members of all the health-care providers in Israel[27].
For each member, we compiled demographic characteristics, influenza vaccination history, respiratory diagnoses, prescriptions, encounters with the healthcare system, hospitalizations, chronic illnesses, and family connections to other members in the data set (SI Appendix, Table S1). The data were approved for use by Maccabi’s sub-Helsinki institutional review board, signed by Dr. Yosef Azuri, protocol number 0048-17-BBL.
Season: Influenza is a seasonal disease, with the highest prevalence between October and March[28]. Therefore, we defined each “season” as the period ranging from June 1 until May 31 of the following year and named it after the end year. For example, “season 2016” was defined as the period between June 1, 2015 and May 31, 2016.
Age group: Our data specifies for each member the year of birth. Thus, we divided the population into the seven age groups (0-4, 5-16, 17-25, 26-35, 36-50, 51-64, and 65+). This division was done with respect to the season in question; if a specific analysis included several seasons, the population was divided into age groups separately in each season. Therefore, in some cases, a patient belonged to one age group in a specific season and moved to another age group in the following season.
Respiratory Illness: We define a respiratory illness as a respiratory diagnosis reported by a physician, according to several codes from the International Statistical Classification of Diseases and Related Health Problems protocol (ICD-9; SI Appendix Table S2). Given that perceptions rather than the actual cause of infection govern the decision for an individual to get vaccinated against influenza [18–20], analyzed this considerably broad definition rather than using ICD-9 codes that are limited to influenza or influenza-like-illness.
Family: Family members are defined as a set of patients having the same “family code” in the surveillance systems of Maccabi. Among the 250,000 patients included in our study, 55,749 patients shared a “family code” with at least one other patient within the data, creating 25,999 families.
To accomplish the two goals of this study, we established three discrete steps. The first step was personal pattern analysis, in which we identified behavioral patterns that relate to the personal aspects of an individual’s vaccination decision. The second step was social pattern analysis, in which we evaluated the environmental factors that may influence an individual’s vaccination decision. Finally, in the predictive modeling step, we developed a machine learning model that converted the data and the insights from the previous tasks into an individual-level prediction of a future vaccination decision. The third step was conducted in accordance with the tripod statement [29] for multivariable prediction models.
A. Descriptive personal patterns
We examined the behavior of a random variable representing the probability of a patient to become vaccinated in a given season, as a function of the patient’s vaccination decision and respiratory illnesses in the previous season. For each season (among the 2009-2017 seasons), we divided the population into seven age groups and then divided each age group into four mutually exclusive sub-groups: 1) patients who vaccinated and became infected in the previous season, 2) patients who vaccinated and did not become infected in the previous season, 3) patients who were not vaccinated and became infected in the previous season, and 4) patients who were not vaccinated and did not become infected in the previous season. We calculated the proportion of patients who became vaccinated in the given season for each of the 28 groups.
We then created a measurement that encodes the time during the season that each patient became vaccinated, and named it the patient’s Average Vaccination Rank. In each season, each patient received a value within the range [0 - 1], representing the timing of the vaccination within the season relative to other patients. The first patient to become vaccinated received the value 1, and the last patient to become vaccinated received the value 0. The patient’s Average Vaccination Rank was defined as the average of these values across seasons the patient became vaccinated in. In order to examine the distribution of the Average Vaccination Ranks among patients, we divided the population into 11 mutually exclusive groups according to the number of vaccinations in a 10-season period. For this purpose, we excluded patients under 10 years old. We then used boxplot analysis to depict the distribution of the Average Vaccination Rank among each group separately by age groups.
We also examined the relationship between healthcare consumption and vaccination decision. We used two measurements for this purpose: 1) the average number of prescribed medications (of any kind) per season, and 2) the vaccination proportion, calculated as the number of seasons that the patient became vaccinated out of 10 seasons, or less for patients under 10. We then divided the population into 10 equal groups by deciles of the average number of prescriptions and examined the distribution of the vaccination proportion for each group separately using boxplot analysis. In addition, we conducted a similar analysis for the average number of encounters with the healthcare system (of any kind).
B. Descriptive Social patterns
Family analysis – we analyzed the vaccination decision of patients with respect to the vaccination decisions of other members of the patients’ families. We then calculated a relative risk for becoming vaccinated, defining the exposed group as the group of patients with at least one other family member who became vaccinated in a given season. This analysis was conducted for all seasons combined. We also examined the similarities between family members over time and compared it to the similarities between randomly sampled patients (SI Appendix Figure S1).
Geographic analysis – Each patient was associated with one of 137 Maccabi clinics, where he receives most of his medical care. We used the vaccination proportion, calculated as the number of seasons the patient became vaccinated out of 10 seasons (or less for patients under 10) and grouped the patients by clinic. We then conducted an ANOVA test to examine the variance between clinics. In addition, we calculated the proportion of patients who became vaccinated, for each season, at each clinic. We then took all 3,070 statistical areas in Israel (according to the Israeli Central Bureau of Statistics) and associated each statistical area with its nearest clinic. Using these data, we created a heat-map that displays the variance in vaccination proportion across socioeconomic areas based on the average vaccination proportion across all seasons in the associated clinics.
Socioeconomic analysis - We performed an analysis of the vaccination proportion with respect to the socioeconomic score, range from 1 to 10, that was assigned to each patient by Maccabi. We calculated the average vaccination proportion for each socioeconomic score across all seasons and displayed the variance of the vaccination proportion of different socioeconomic scores.
C. Predictive modeling
We developed a model to predict the influenza vaccination behavior of a patient in a future influenza season. The predictive model development was guided by and made in accordance with the tripod statement[29] for multivariable prediction models. This is a classification problem with a binary label, as the patient may or may not become vaccinated (positive label or negative label, respectively). We created a time-free model, which does not attempt to predict a patient’s behavior in a specific season, but rather allows predicting behavior in any future season given the relevant previous data. Based on the results of an entropy analysis (SI Appendix), we used the data of three consecutive seasons in order to predict the behavior in the subsequent fourth season. For example, we data from seasons 2008 through 2010 to predict each patient’s behavior in 2011, data from seasons 2009 through 2011 were used to predict behavior in 2012, and so on. Therefore, the ten-season period data of each patient produced up to seven records with seven labels (seasons 2011 - 2017), depending on the patient’s age, creating a dataset of 1,553,907 records in total.
The features we used for the prediction were based on demographic attributes of the patient and on calculated or aggregated medical characteristics. In order to build a comprehensive model, the features of the dataset were rather basic. Because 55,749 of the patients had family members within the data, we could extract family-related features in addition to the basic features. We therefore divided the 1,553,907 records into 2 datasets: a basic dataset containing 1,225,032 records, where each record had 27 features and a label, and a ‘family’ dataset containing 328,875 records, where each record had 30 features and a label. The features are described under Sociodemographic & EMR in Table 1 (a detailed description can be found in Table S3 in the SI Appendix).
We trained several models using the following algorithms: 1) Logistic Regression[30], 2) Naïve Bayes[30], 3) XGBoost Random Forest[31], 4) Light GBM Random Forest[32] and 5) Artificial Neural Network[30]. While preprocessing, we used min-max scaling where relevant. Random Forest algorithms were trained using 500 decision trees.
We split both the basic dataset and the family dataset into training sets (70%), validation sets (15%), and test sets (15%). We trained each algorithm on both train sets, creating two models, one for each dataset (basic dataset and family dataset), using 4-fold cross-validation to maximize the ROC AUC score. When training the models, we conducted hyper-parameters tuning using grid-search to discover the combination of parameters that performed best on the train set. Detailed description of the hyper-parameters tuning is presented in (Table S4).
After the training phase, we received two trained models for each algorithm. We predicted the labels of the records in the validation sets and evaluated the performance of the models using the ROC AUC score, precision, recall, and F1-score.
As a benchmark, we created four additional simple models. The first is the Age model and it relies on the patient’s age alone. The second one is the Sociodemographic model (SD), which does not utilize personal medical records, but instead uses a subset of demographic features of the patient and the vaccination coverage in the patient’s clinic. The third is Vaccination decision in the previous season model (PS), which relies solely on the patient’s behavior in the previous season; this model predicts that an individual that became vaccinated in the last season will become vaccinated in the predicted season, and vice versa. The fourth simple model, the Vaccination decision in the previous season & Sociodemographic model (PS&SD), uses the patient’s behavior in the previous season in addition to the features used by SD model. The features of the models are described in Table 1. Because none of the simple models use family-based features, they were all trained over the entire training population and predicted the labels for all validation samples. The training of these simple models was based solely on the algorithm that performed best on the basic and family datasets.
Finally, we evaluated the most successful models and the simple models on the test sets, and analyzed the feature importance of these models.