Taiwan myocardial infarction data
The data source was the Taiwanese National Health Insurance Database (NHIRD) [9], including two datasets: the patient demographic dataset selected from Taiwanese people diagnosed with MI from 2010 to 2013; and the hospitalization dataset, including corresponding inpatient records spanning 1996 through 2013. Both datasets are de-identified to meet the obligations under the privacy act.
The hospitalization dataset contained 18,875 inpatient records including information on the date of diagnosis and at least one to five diagnoses encoded using ICD-9-CM. ICD-9-CM, “International Classification of Diseases, Clinical Modification,” is a hierarchical classification system for assigning diagnostic and procedure codes [10]. To represent a diagnosis, the format of the code is three digits that refer to an individual disease and two digits after the decimal point that refer to the detail of the disease. Since we only focused on individual diseases, we discarded the two digits after the decimal point, resulting in three-digit ICD-9 codes denoting disease diagnoses. Another patient demographic dataset selected subjects from the 1-million population dataset and included information on gender, birth date, and death date (the death date was None if patients were still alive at the final collection date 2013-12-31).
Study Design
Figure 1 illustrates the stages of the study. The initial stage involved merging of the two datasets and retrieval of the corresponding hospitalization records. We first performed a survival analysis for the patients and defined death within 1 year as short-term death. Afterwards, to prepare for the subsequent odds ratio calculation, we removed subjects diagnosed in the last data collection year, 2013, to ensure that short-term death was observable for all subjects. A total of 730 subjects were removed, and 2,123 subjects with 13,707 related inpatient records remained. We then transformed these data into trajectories involving the diagnoses and the date for the subsequent mining process. In the final stage, we applied interval sequential pattern mining and filtered significant patterns based on the odds ratio. While combining these patterns, we established a disease trajectories diagram to demonstrate the disease progression.
The data contains two datasets from Taiwan’s health insurance database, including the patient demographic dataset and the hospitalization records.
Merging Hospitalization And Patient Demographic Dataset
This study comprised a patient demographic dataset and a hospitalization dataset. We merged the two datasets and retrieved the first diagnosis date for each patient. The first diagnosis date of our target disease, “myocardial infarction,” was defined as the date the patient was first hospitalized with the disease. There were 2,853 patients plus 18,875 hospitalization records covering the hospitalization history of patients from 1996 to the first diagnosis date.
Survival Analysis And 1-year Short-term Death
Survival analyses contain tools to estimate the interval from an event to death for a group of people. There are some typical problems associated with analyzing survival datasets. One is censoring. Generally, a study has a data collection range, but not all the events of interest occur before the end of the collection time, and therefore, right censoring is necessary. For instance, in our dataset, the last collection date was 2013-12-31, and the event we were interested in was death. However, not all the patients passed away before this date, so we could not observe the mortality of all individuals. However, discarding these individuals would waste knowledge and cause bias. Another problem is the skewed distribution of survival data, which fails most statistics tools based on Gaussian distribution. A popular way to solve this problem is to use the Kaplan–Meier estimator. We could adopt censored data to construct the survival function with a step function. Meanwhile, we were curious about the hazard value at each given time. Unfortunately, the Kaplan–Meier estimator does not work well in constructing the hazard function. Another method is called the Nelson–Aalen estimate [11, 12]. We applied both estimators to our dataset for survival and hazard values from the first MI diagnosis to death [13].
After the analysis (Fig. 2), we found a significant decrease in survival rate at the beginning of the curve, indicating that many people die right after the first diagnosis. When we arrived at the hazard function, with a 1-year bandwidth smoother, the result demonstrated a high risk of passing away after the diagnosis. After this 1-year peak, the hazard becomes lower and stable. Accordingly, we decided to base our investigation around this 1-year peak.
Hazard function
Figure 2. Survival functions of the myocardial infarction patients in Taiwan from 2010 to 2013 after the first diagnosis
Both functions share a horizontal timeline, originating from year zero when the patient was first diagnosed.
Transformation Into Disease Trajectories
Before applying the mining algorithm, we removed subjects who were diagnosed in the last year of our final data collection year (2013) to guarantee the observability of short-term death. We then labeled each patient based on whether he or she passed away within 1 year after the first diagnosis. Of 2,123 patients, 592 (27.89%) passed away in the short-term, and 1,531 (72.11%) did not. Next, hospitalization records for each patient were extracted and transformed into sequences. Sequences included all records until the date of the first diagnosis of MI (Fig. 3).
The top line represents a person’s timeline; for example, birth 1997-01-01 and first diagnosis 1997-01-20. This timeline contains some records of visits with the hospitalization date and diagnoses. Information on diagnoses is obtained and transformed into a disease sequence with the intervals noted from birth.
Interval Sequential Pattern Mining And Odd Ratio Calculation
Yu Hirate and Hayato Yamana proposed an algorithm in 2006 for generalizing interval sequential pattern mining—constraint-based and extended sequence-based mining. The algorithm handles seven inputs: interval-extended sequence database, interval itemization function to convert item intervals to pseudo items, minimal support of the pattern, and four types of constraints—min/max interval between items and min/max interval of the whole sequence. This function lets \(I=\{{i}_{1}. {i}_{2}, \dots ,{i}_{3}\}\) be a set of all items, and X is a subset of items sorted alphabetically. A sequence is denoted by \(⟨\left({t}_{\text{1,1}},{X}_{1}\right),\left({t}_{\text{1,2}},{X}_{2}\right),\dots ,\left({t}_{1,n},{X}_{n}\right)⟩\), where\({t}_{\alpha ,\beta }={X}_{\alpha }.time-{X}_{\beta }.time\), and \({X}_{\alpha }.time\), \({X}_{\beta }.time\) represent the transaction occurrence time. The output is a sequential pattern with intervals satisfying the minimal support and constraints. A pattern is denoted by \(<\left(\varDelta {t}_{1},{x}_{1}\right),\left(\varDelta {t}_{2},{x}_{2}\right),...,\left(\varDelta {t}_{n},{x}_{n}\right)>\) where \(\varDelta t\) is the itemized interval and \(x\in I\) [14]. With this algorithm, we fed our disease trajectories into the logarithm function (Eq. 1, Table 1), which we found worked better in convergent patterns after some trials, and we set a minimal support of 100 to obtain patterns.
Equation 1. Logarithm itemization function
$$I\left(t\right)=⌊{\text{l}\text{o}\text{g}}_{2}\left(\frac{days}{7}+1\right)⌋$$
Table 1
Lookup table for the logarithm function from item interval to pseudo item
Item interval | Pseudo item |
0 days–7 days | 0 |
7 days–21 days | 1 |
21 days–35 days | 2 |
1.17 months–1.63 months | 3 |
1.63 months–3.50 months | 4 |
3.50 months–7.23 months | 5 |
7.23 months–14.70 months | 6 |
1.23 years–2.47 years | 7 |
2.47 years–4.95 years | 8 |
4.95 years–9.94 years | 9 |
We then calculated the odds ratio by dividing patients into four groups—exposed/not exposed to patterns and short-term death/not short-term death—for each identified pattern to measure the strength of association. The odds ratio was used to indicate the effect size for categorical outcomes. An odds ratio of 1 meant the event was comparable in exposed and not exposed (control) groups. An odds ratio greater than 1 meant the event was associated with an increased risk in the exposed group. We set an odds ratio greater than 2, which is generally considered clinically significant [15], as a necessary condition to filter patterns.