2 − 1 Study area
Tokyo is located in the Kanto region in the eastern part of Japan. It has a population of 13,995,469 as of April 1, 2022, which is the largest in Japan[25]. The east side of Tokyo is plain and some parts of it are adjacent to the ocean. So, it is affected by the moist easterly winds from the ocean. As there are mountains on the west side, it is affected by the dry warm winds over the mountains from the west. Summers are very hot and humid, and data from the past 30 years (1991–2020) shows that the mean relative humidity from June to September is 75%, and the mean maximum temperature in August is 31.3°C[26].
2–2 Meteorological data
It has been reported that meteorological conditions other than temperature, such as humidity, wind speed, and precipitation, have a much smaller impact on the risk of heatstroke than temperature[27]. This is supported by the fact that most elderly people who die from heatstroke do so indoors[28]. Therefore we did not include variables other than temperature. We used the daily maximum temperature observed at the meteorological observatory (Tokyo) of the Japan Meteorological Agency from June 1 to September 30, 2012–2020[26].
2–3 Heatstroke patients
The daily number of patients transported by ambulance in Tokyo Metropolitan Area provided by the Fire and Disaster Management Agency of the Ministry of Internal Affairs and Communications[29] was used for the heatstroke patient data in our model. Data from June 1 to September 30, 2012–2020, were used.
2–4 Modeling methods
Three methods were employed: multiple regression analysis (MR), generalized additive model (GAM), and time-stratified case crossover analysis (TC).
2-4-1. Multiple regression model
Previous studies have shown that the number of heatstroke patients increases exponentially with temperature[11]. Therefore, we log-transformed the objective variables and attempted a regression. However, because the objective variable contained \(0\), it was set to \((y+1)\). Because TC described in subsection 2-4-3 controls for each day of the week of each month, we adjusted the conditions by adding the month and day of the week to the predictor variables as well (Eq. [1]).
$$\text{Log}\left(y+1\right)={{\beta }_{0}+\beta }_{1}{T}_{\text{m}\text{a}\text{x}}+{\beta }_{2}Jun+{\beta }_{3}Jul+{\beta }_{4}Aug+{\beta }_{5}Mon+{\beta }_{6}Tue+{\beta }_{7}Wed+{\beta }_{8}Thu+{\beta }_{9}Fri+{\beta }_{10}Sat \left[1\right]$$
where \(y\) is the number of heatstroke patients, \(\beta\) is coefficients, \({T}_{\text{m}\text{a}\text{x}}\) is the daily maximum temperature, and each month and day of the week is entered as a predictor variable with 0 or 1. By adding each month to the predictor dummy variable, we also expected to consider the susceptibility to heatstroke, or heat acclimatization, in each month. The predictor variables were adopted using a stepwise method with the smallest Akaike information criterion.
2-4-2. Generalized additive model
A generalized additive model (GAM) is a model in which the linear variables of the generalized linear model can assume not only normal distribution but also other probability distributions as objective variables[15], and are sums of nonlinear functions, as expressed in Eq. [2]. It has the advantage of being able to assume the shapes of the objective and prediction variables without any assumptions[30].
$$y={\beta }_{0}+\sum _{i=1}^{M}{f}_{i}\left({x}_{i}\right) \left[2\right]$$
where \(y\) in Eq. [2] is the predicted number of heatstroke patients, \({\beta }_{0}\) is a coefficient, and fi(xi) is a third-order smoothing spline curve per explanatory variable. x is the predictor variable and i is the number of predictor variables. The daily maximum temperature was used as a spline function, and each month and day of the week was added to the variables as in MR. The predictor variables were adopted in the same way as described in the subsection 2-4-1. Therefore, M in Eq. [2] was determined using the stepwise method that minimizes the Akaike information criterion. A semiparametric model was used with the gam function of the mgcv package in R, version 1.3.1093.
2-4-3. Time-stratified case-crossover analysis
Case-crossover is a study design originated in case-control studies[31] and is intended only for disease-affected individuals[32]. It is useful for rare diseases, diseases that develop acutely and have a short incubation period, and deaths and cases where the exposure fluctuates in a short period of time and the induction period of the disease because of the short exposure[33]. It uses only the information of the cases in which the event occurs, and it compares the same person at different times instead of comparing different people at the same time[34]. As such, there is no confounding by any characteristics such as sex, age, and prior medical history[35]. It is possible to obtain controls with matching genetic and background information[32]. In the area of environmental epidemiology, case-crossover design is used in the air pollution epidemiology[36], and it has been mainly studied for respiratory diseases. Voorhees et al. used it in heat illness to demonstrate the relationship between heat-related mortality and temperature[21].
In this study, we assumed that the onset of heatstroke occurred only once per day. The maximum temperature on the day when heatstroke occurred (case period) was compared to the maximum temperature on another day when heatstroke did not occur (reference period). The reference period was selected on the same day of the week, in the same year and month as the case period, using the time stratification method, referring to examples from previous studies[23, 37]. The analysis was performed using R version 1.3.1093. Continuous data are required to use the casecross function in the season package[38], but since the heatstroke data are limited to the hot season, there were data gaps. We converted the data set for using the survival package to put it into the coxph function, which performed the same process as for conditional logistic regression provided by the casecross function of season package[39]. Then the resulting value was used into the health impact function in Eq. [3] in our analysis[21].
$$y=r\times \left({e}^{\beta \varDelta x}-1\right)\times Pop+{y}_{0} \left[3\right]$$
where \(y\) is the predicted number of heatstroke patients, \(r\) is the average transport rate per 100,000 people per day during the training data period, \(Pop\) is the population in Tokyo on January 1st of the test year, 13,951,636 in 2020, 13,857,443 in 2019, and 13,754,059 in 2018[25].25 \(\beta\) is the log odds ratio obtained from relative risk associated with a change in exposure expressed as a temperature-response function, and \(x\) is the daily maximum temperature. \(\varDelta x\) is calculated from the increase/decrease based on 27.7°C which is the mode of the daily maximum temperature of the training data. This value was well aligned with previous studies reporting that the number of heatstroke patients started to increase around 27°C in Tokyo[40]. \({y}_{0}\) is the mean value for patients at 27.7°C in each training data, which is used as the baseline value.
2–5 Training and test data selection
We used data from 2018, 2019, and 2020 as test data. To find a robust model for training data, we used training data from 2012 to the year prior to the test data and tried all combinations shown in Table 1. To check all possible combinations as training data, we tested 255 patterns for 2020, 127 patterns for 2019, and 63 patterns for 2018 as training data. For example, when testing the year 2020, we have eight years for training data from 2012 to 2019, so we substituted eight for n in Eq. [4] and determined the number of combinations as \(Com=\)255.
$$Com=\sum _{k=1}^{n}{}_{n}{C}_{k} \left[4\right]$$
Table 1
Training and test data selection
Test data | 2018 | 2019 | 2020 |
Training data | 2012–2017 | 2012–2018 | 2012–2019 |
# of used years | # of combinations | # of used years | # of combinations | # of used years | # of combinations |
6 | 1 | 7 | 1 | 8 | 1 |
5 | 6 | 6 | 7 | 7 | 8 |
4 | 15 | 5 | 21 | 6 | 28 |
3 | 20 | 4 | 35 | 5 | 56 |
2 | 15 | 3 | 35 | 4 | 70 |
1 | 6 | 2 | 21 | 3 | 56 |
| | 1 | 7 | 2 | 28 |
| | | | 1 | 8 |
Sum | 63 | Sum | 127 | Sum | 255 |
2–6 Comparison of three methods
We evaluated the accuracy of the predicted number of heatstroke patients using the root mean square error (RMSE) and mean absolute error (MAE), as expressed by Eqs. [5] and [6]. Smaller value of the indicator indicates higher accuracy. The observed value is \({y}_{i}\) and the predicted value is \({y}_{i}^{{\prime }}\). \(n\) is the number of predicted days (\(n=122\)). The focus was on the RMSE to create a model with no major deviations between the observed and predicted values.
$$RMSE=\sqrt{\frac{1}{n}{\sum }_{i=0}^{n}{({y}_{i}-{y}_{i}^{{\prime }})}^{2}} \left[5\right]$$
$$MAE=\frac{1}{n}{\sum }_{i=0}^{n}|{y}_{i}-{y}_{i}^{{\prime }}| \left[6\right]$$
2–7 Analyses of influential factors
By comparing the errors of each method, we identified factors that influence the errors in the training data. As shown in the previous studies, the end of the rainy season, midsummer days (daily maximum temperature of 30°C or higher), extremely hot days (35°C or higher), and tropical nights (daily minimum temperature of 25°C or higher) were considered influential factors for heatstroke patients[20, 40]. We attempted a linear regression analysis using these factors as explanatory variables and RMSE as the objective variable to see if these factors are also determinants of model error. In addition, we examined the trade-off between the quantity of the training data and quality of the constructed models when adding old data to the training data to consider changes in heatstroke diagnostic criteria[41, 42], and social conditions. For example, the percentage of elderly people in the total population is increasing every year, affecting the validity of older data.