Effectiveness of cascading time series models based on meteorological factors in improving health risk prediction

Meteorological factors, which are periodic and regular in a long run, have an unignorable impact on human health. Accurate health risk prediction based on meteorological factors is essential for optimal allocation of resource in healthcare units. However, due to the non-stationary and non-linear nature of the original hospitalization sequence, traditional methods are less robust in predicting it. This study aims to investigate hospital admission prediction models using time series pre-processing algorithms and deep learning approach based on meteorological factors. Using the electronic medical record data from Panyu Central Hospital and meteorological data of Panyu district from 2003 to 2019, 46,089 eligible patients with lower respiratory tract infections (LRTIs) and four meteorological factors were identified to build and evaluate the prediction models. A novel hybrid model, Cascade GAM-CEEMDAN-LSTM Model (CGCLM), was established in combination with generalized additive model (GAM), complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), and long-short term memory (LSTM) networks for predicting daily admissions of patients with LRTIs. The experimental results show that CGCLM multistep method proposed in this paper outperforms single LSTM model in the prediction of health risk time series at different time window sizes. Moreover, our results also indicate that CGCLM has the best prediction performance when the time window is set to 61 days (RMSE = 1.12, MAE = 0.87, R2 = 0.93). Adequate extraction of exposure-response relationships between meteorological factors and diseases and suitable handling of sequence pre-processing have an important role in time series prediction. This hybrid climate-based model for predicting LRTIs disease can also be extended to time series prediction of other epidemic disease.


Introduction
Climate change causes or aggravates a wide range of exposures with multiple impacts on health, both direct and indirect (Linares et al. 2020). The 2020 report of The Lancet Countdown on health and climate change reveals that no country, rich or poor, is exempt from the effects of climate change on health (Watts et al. 2020). Lower respiratory tract infections (LRTIs) are a generalized description of a group of respiratory diseases that comprise acute bronchitis, pneumonia, and exacerbations of chronic lung disease (Woodhead et al. 2011), and it has become the fourth most common cause of death according to the World Health Organization (WHO) Global Health Estimates 2019 (WHO 2020a). In the context of continuing global climate change, identifying factors influencing the occurrence of LRTIs could contribute to the prediction of future outbreaks and facilitate the establishment of transmission prevention measures.
Traditional time series forecasting models built on mathematical methods such as the autoregressive integrated moving average (ARIMA) and seasonal ARIMA models are widely adopted for predicting the number of outpatients volumes (Luo et al. 2017), emergency department visits (Kadri et al. 2014), and new admission inpatients (Zhou et al. 2018). Although ARIMA models are quite flexible, a major limitation is that the non-linear effects of various complex factors cannot be captured (Zhang et al. 2015). Recently, machine learning algorithms that can resolve non-linear relationships between variables have proven effective in time series prediction and have been successfully applied in various healthcare scenarios. Gu et al. (2019) used the optimized three-layer long short-term memory (LSTM) model to predict the incidence of Hand-foot-mouth disease (HFMD) in 14 administrative regions of Guangxi, China, and confirmed that LSTM has a favorable performance. Yousefi et al. (2016) provided an artificial neural network (ANN)-based prediction tool that provides a reasonable prediction of emergency department visits in a 1-week ahead time frame.
However, only a very limited number of studies has been trying to incorporate characteristics related to environmental exposures to predict demand for healthcare services. Qiu et al. (2020) used six machine learning models to forecast peak demand for cardiovascular diseases admissions and demonstrated that meteorological conditions and air pollutants make a crucial contribution to the accuracy of the predictions. Navares and Aznarte (2020) proposed a model based on environmental indicators using LSTM and convolutional neural network (CNN) networks to predict the number of daily hospital admissions for respiratory and circulatory diseases. In the case of LRTIs, meteorological factors not only affect the activity of influenza viruses and make their activities seasonal (Price et al. 2019), but also affect the immune system and alter the susceptibility of the body (Mäkinen et al. 2009). As a result, the incidence of LRTIs also follows a certain rule. A study in Shaanxi Province, China, using stepwise multiple linear regression found that LRTIs hospitalization rate was significantly correlated with air temperature, relative humidity, and atmospheric pressure (Liu et al. 2016). Besides, a study conducted in Finland using a generalized additive model (GAM) found that low temperature and low relative humidity are associated with increased incidence of LRTIs (Mäkinen et al. 2009). It remains uncertain how well LRTIs work in predicting admissions related to environmental exposures, which open up a new window for developing more accurate predictive models.
In addition, hospital admission time series are noisy and nonstationary (Huang and Wu 2017); utilizing the primary series directly to establish prediction models is prone to substantial errors. To overcome the inherent non-stationarity, noise, and confusion of time series data, series decomposition methods such as empirical mode decomposition (EMD), ensemble empirical mode decomposition (EEMD), and complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) have been extensively adopted in various fields such as signal processing (Das et al. 2020), finance (Lin et al. 2021), energy (Gao et al. 2020), and air quality (Song and Fu 2020). In medical health fields, Khaldi et al. (2019) investigated ANNs combined with signal decomposition techniques to predict weekly hospital emergency department arrivals. Huang and Wu (2017) used a hybrid method of EMD and back-propagation ANN optimized with particle swarm optimization to predict outpatients. The above research proved that data decomposition is a powerful tool for data pre-processing and enhances the generalization of the model while reducing over-fitting problems.
In this paper, we collected 46,089 cases of LRTIs and meteorological factors in Panyu, Guangzhou, from January 1, 2003, to July 15, 2019. A multi-step cascade prediction method (CGCLM) combined with GAM, CEEMADAN and LSTM was established to forecast short-term LRTI admissions by incorporating the local exposure-response relationship between weather and disease.

Data collection
Data on the daily number of hospitalized patients with lower respiratory tract infection were obtained from Panyu Central Hospital. This data includes the number of inpatients with the primary diagnosis of LRTIs (International Statistical Classification of Diseases and Health Related Problems-10: J12.0-J22.0) in Panyu Central Hospital from January 1, 2003 to July 15, 2019. Daily meteorological data including mean temperature (°C), relative humidity (%), atmospheric pressure (hPa), and wind velocity (m/s) in those days were collected from Guangzhou Meteorological Bureau. A total of 6040 days and 46,089 cases were included in the study. All patients' private information was desensitized.

Data processing
During the study period, the proportion of missing values of each meteorological variable was 0.05% (3/6040) for temperature, 0.05% (3/6040) for relative humidity, and 0.116% (7/6040) for atmospheric pressure, respectively. Average value method was used to fill the missing values of meteorological cases. The time series features were extracted from date, including year, month (month of year), half of month, onethird of month, day (day of month), and DOW (day of week).

Generalized additive model
GAM was first proposed by Hastie and Tibshirani (1987) as an extension of the generalized linear model (GLM), which has significant advantages in dealing with complex non-linear relationships between variables in time series analysis. GAM can be used to capture the exposure-response relationship between meteorological factors and LRTIs in Guangzhou and to obtain the lagged characteristics of meteorological factors. Considering that the number of hospital admissions for LRTIs obeys a Poisson distribution, a univariate model using a logarithmic function (log) was constructed. The model is expressed as follows: where t is the day of the observation; Y t is the number of hospital admissions observed on day t; log is the logarithmic transformation of μ t ; X t is the main variables, X i is the covariates, and β is their coefficients; DOW is the day of week; and α is the intercept, and time was introduced together as confounders. The regression coefficients (β), standard error, and interquartile range (IQR) for each meteorological factor were estimated according to the model, and the relative risk (RR) and 95% confidence intervals (95% CI) were calculated as follows: Complete ensemble empirical mode decomposition with adaptive noise EMD is a signal processing method based on instantaneous frequency proposed by Norden E. Huang (Huang et al. 1998). This method can decompose complex signals into intrinsic model function (IMF) according to the time-scale characteristics of the data itself, for which IMF needs to satisfy two conditions: (I) the number of local extreme points and the number of cross-zero points are equal or differ by 1 over the whole time frame; (II) for any one of the zero points, the mean envelope of the local maximum and local minimum values are 0. With the hypothesis of decomposition and the definition of the IMF above in mind, the EMD process of a raw data series n(t)(t = 1, 2… T) can be formulated as follows: where s t is the original signal, imf i is the ith IMF component, and r n (t) is the trend term.
The EMD process can decompose non-stationary and non-linear signals into a numbers of components, but it still suffers from "mode mixing" problem (Wang et al. 2012) in practical applications. Wu and Huang (2009) proposed EEMD method by introducing a uniform frequency distribution of auxiliary noise into the EMD decomposition thus significantly alleviating mode mixing. However, in practice, the white noise added by EEMD is averaged over time, but it is not offset completely. By adding adaptive white noise at each decomposition stage, CEEMDAN method achieves a reconstruction error of almost zero with fewer averaging times (Torres et al. 2011). As a result, CEEMDAN method eliminates mode mixing of EMD and, meanwhile, solves the incompleteness of the EEMD decomposition and the computational inefficiency caused by the need to reduce the reconstruction error by increasing the averaging times. The calculation process of EMD and CEEMDAN is set forth in the supplementary material.

Sample entropy
Sample entropy (SE) is a metric for determining the complexity of time series proposed in 2000 by Richman and Moorman (2000). SE is more relatively consistent than approximately entropy (AE) and has been used successfully in the analysis of the complexity of biological signal sequences.
The estimated value of sample entropy is calculated by according to the following formula: where the range of values for each IMF component sequence is {x(i)| 1 < i < N}, m is the mode dimension, and r is the similarity tolerance. In this study, the maximum epoch length (m) was set to 2 and the default tolerance (r) was given as 0.2.
Long short-term memory networks LSTM network was proposed by Hochreiter and Schmidhuber (1997), which a modified version of recurrent neural networks (RNNs). It incorporates sequence correlation, where the current state during the learning process will contain all historical information from previous time series, thus having the ability to learn long-term dependencies and solves the problem of RNN gradient explosion (Gers et al. 2000). LSTM neural network structure consists of a series of loops of connected subnetworks (i.e., memory modules), each containing one or more self-linked cells (cells), and control information. Detailed steps for LSTM used to process sequences can be found in the supplementary material. The LSTM network models involved in this research were optimized with common values, and the parameters are shown in Table 1.

Data set splitting: time sliding window
In implementing the prediction process, a time sliding window was designed regarding Jacinta et al. (2016) to segmentate the data. Time sliding windows allow dynamic updating of data in real time by continuously deleting and adding data, thus addressing the problem that fixed sliding windows cannot obtain the best data window and data dimensions (Dong et al. 2020). The process consists of three parts, as shown in Figure 1: a training part, which was designed to train the model and update the model parameters, with a data set length of 365 × 3 days; a validation part, which was adapted to adjust the hyperparameters to obtain the best settings of the model parameters; and a testing section, which was used to predict the data using the best model. The validation and testing were set to the same size N, and four windows (N = 14, 31, 61, 91) were tested respectively to obtain the best prediction dimension.

Assessment criteria
The predictions were compared to the actual data and the errors were calculated for LSTM model and CGCLM hybrid model, respectively. Three metrics were used: root mean squared error (RMSE), which has traditionally been widely used in forecasting evaluation (Armstrong and Collopy 1992); mean absolute error (MAE), which is less sensitive to outliers in predicted values than RMSE (Hyndman and Koehler 2006) and provides a better reflection of forecast error; and R-square (R 2 ), as a measure of fit or model variability in health-forecasting-related studies, is very common in the literature (Soyiri and Reidpath 2012).
RMSE, MAE, and R 2 can be expressed as follows:

Methodology
An overview diagram of the experimental design is shown in Figure 2, with the following implementation steps. 1. The daily number of patients admitted for LRTIs in Panyu Central Hospital from January 1, 2003, to July 15, 2019 and the local meteorological parameters in the same period were obtained as modeling data. 2. GAM was used to assess the association between exposure to four meteorological variables and admissions to LRTIs from 2003 to 2019. This step is sufficient to capture lagged effects to estimate the maximum RR of meteorological factors on LRTIs across the six study days. 3. In order to build an effective predictive model, the characteristics of the original series must be fully analyzed and considered; in view of this, CEEMDAN was applied to decompose the original admission time series to obtain 9 IMF components and 1 trend component. This step enables data stabilization by adding positive and negative white noises with the same amplitude but opposite phase to the original signal. Figure 2 An overview of the research framework, which can be divided into three sections for data pre-processing and feature extraction, decomposition of time series, and forecasting.

4.
The SE values of each IMF component were calculated respectively, while the subsequences with similar SE values were merged and recombined to obtain five new subsequences differed significantly in complexity. The incorporation of the SE method overcomes the problem of ignoring relationships between decomposition models, thus effectively capturing valuable information and reducing the computational cost of the model. 5. The new subseries and meteorological parameters were normalized before the data were fed into the model to eliminate the interference of different orders of magnitude on the prediction model. The normalization was applied to a range between [0,1] to achieve equal scaling of the original data and was calculated as in equation (9).
where X is the original data, and X′ is the normalized data.
6. LSTM neural network prediction models of corresponding parameter space were established for each new subsequence by adding meteorological factors, and the predicted values of each model were output. 7. The arithmetic sum of the predicted values after reverse normalization was calculated to obtain the predicted daily admissions for LRTIs.
where X 0 p is the output value of the prediction model and X p is the value after inverse normalization.
The process of Cascade GAM-CEEMDAN-LSTM is described by Algorithm 1.
The GAM model was built using the "mgcv" package from R software version 3.6.2. The CEEEMDAN algorithm,SE algorithm, and all deep learning models are implemented on python version 3.6.

Data description
A total of 46,089 patients with LRTIs were incorporated into the study from January 2003 to July 2019. In terms of age distribution, the majority of patients were less than 18 years old (26,261 cases, 57.0%), while male patients (28,177 cases, 61.1%) were more than female patients (17,912 cases, 38.9%) in terms of gender ( Table 2).
The daily number of LRTIs cases and the distribution of four meteorological variables in Panyu are shown in Table 3. The average daily number of hospital admissions in Panyu was 7.64. The average daily temperature (T), relative humidity, atmospheric pressure, and wind velocity were 23.11°C, 74.81%, 1011.33 hPa, and 1.90 m/s, respectively.
The explained variance for each GAM model is shown in Table S1 of the supplementary materials. The R 2 and deviation explained values ranged from 0.451 to 0.462 and 43.9 to 44.7%, respectively.

CEEMDAN decomposition results of the original admission sequence
According to the CEEMDAN decomposition process described above, the daily admission sequence of LRTIs was decomposed. The results are shown in Figure 4; 9 IMFs and 1 residual term are listed from top to bottom, in which the horizontal axis represents time and the vertical axis represents the frequency of each IMF. The frequency from IMF1-IMF9 to the residual term gradually decreases, and the change mode is simpler than the original sequence. The bar chart depicts the Pearson correlation coefficients of the nine IMF series and one residual term with the original series.

Reconstruction of decomposed sub-sequences via SE
The complexity of the IMF components obtained from the CEEMDAN decomposition was evaluated using SE, and the results are shown in the left panel in Figure 5. The IMFs were   Figure S1. Table 4 shows a comparison of the three evaluation metrics for prediction precision of the single LSTM and CGCLM methods. It can be clearly seen that the CGCLM method, which includes CEEMDAN module, helps to better extract time series features to make more accurate predictions, with the 14-day predicted MSE value decreasing from 1.92 to 1.25, the MAE value decreasing from 1.53 to 0.98, and the coefficient of determination R 2 improving from 0.79 to 0.91. The 31-day predicted RMSE decreased from 1.86 to 1.15, the MAE decreased from 1.47 to 1.90, and the R 2 increased from 0.80 to 0.92. The 61-day predicted RMSE value dropped from 1.73 to 1.12, the MAE value reduced from 1.35 to 0.87, and the R 2 rose from 0.83 to 0.93. The 91-day predicted RMSE value fell from 1.60 to 1.30, the MAE value declined from 1.23 to 0.99, and the R 2 improved from 0.85 to 0.90. Overall, model prediction performance is best when the prediction window size is 61 days using the CGCLM method, and the trend plot of predicted versus true values is shown in Figure 6. In addition, we calculated the RMSE, MAE, and R 2 Figure 3 Forest plots of the effects of temperature, relative humidity, atmospheric pressure, and wind velocity on admission RR and 95% CI of LRTIs respectively for each year, also showing the excellent performance of the CGCLM method as shown in Table S2.
To further evaluate the performance of the two models, the predicted value distribution of CGCLM was compared with the single LSTM model. Figure S2 shows the distribution of prediction results of the two different models and different predicted time size. Obviously, the two models have different prediction results for Panyu Center Hospital LRTI daily admission time series. In order to compare the results, three calibration lines were set at the maximum, average, and minimum, respectively. It can be clearly seen that the predictions of CGCLM for maximum, mean, and minimum are closer to the true values. However, the single LSTM model generally predicts a large value of maximum value, a small value of minimum value, and slightly larger deviation of the average value. Therefore, from the perspective of predicted value distribution, CGCLM method also has better prediction performance.

Discussion
The purpose of this study is to propose a time series prediction model to achieve health risk prediction through meteorological factors in advance. We constructed a multistep cascade model CGCLM using a dataset totaling 46,089 patients with LRTIs and explored the predictive effects of different lengths of time windows. The results showed that the CGCLM approach integrating GAM, CEEMDAN, and LSTM works better than the single LSTM model. CGCLM shows the best performance when the time window is set to 61 days.
Discovering potential relationships and patterns from healthcare-related data to help hospitals and public health departments to make decisions has garnered a lot of attention. For example, in 2009, Google predicted the outbreak of Influenza A (H1N1) with the help of big data technology from users' search behaviors and issued an early warning ahead of the US Centers for Disease Control and Prevention (Davidson  . A study constructing a predictive model of influenza-like case rates based on users' posts on Twitter has yielded relatively accurate results (Lampos et al. 2010). In recent years, a large number of studies have focused on the impact of environmental variables on diseases. Some studies have paid attention to the use of predictive models, such as machine learning and neural network for admission prediction (Kassomenos et al. 2008). Other studies have concentrated on exploring the relationship between air quality and meteorological data and the occurrence of pneumonia through the use of statistical methods such as GLM (Glick et al. 2018), GAM (Li et al. 2018), or autoregressive integrated moving averages (Ruchiraset and Tantrakarnapa 2018). This difference in approach ustifies the diversity between predictive and explanatory models. The former usually provides high precision but low interpretability, while the latter approach often provides high interpretability but low precision (Sainani and Kristin 2014).
In this study, we presented a multi-step health risk prediction method (CGCLM) based on meteorological factors and cascading time series models, which fully considered the explanatory and predictive properties of the model and confirmed its excellent performance. Explanatory model (GAM) focus on identifying and explaining risk factors associated with LRTIs, while predictive model (LSTM) aims to accurately estimate the probability of occurrence of future health risk  events. To reduce the effect of noise on prediction, CEEMDAN was used in conjunction with LSTM to predict the time series of LRTIs admissions. Since CEEMDAN is a Fourier transform-based signal decomposition method, it can process any non-linear and non-stationary signal adaptively. The original time series is decomposed into multiple subseries at different frequencies by CEEMDAN processing, and these subseries are predicted respectively as input data for LSTM model. Finally, all predictions are reconstructed to obtain the final results. After analyzing the results, we found that the CGCLM method achieved good prediction performance, but there was some variation in the performance of the LSTM model leading to fluctuations in the prediction accuracy from year to year. This may be due to the fact that in certain populations, socioeconomic factors may be the primary driver of disease admissions instead of meteorological ones (Sahni et al. 2017;Trachtenberg et al. 2014). The lag effect of meteorological conditions on the impact of respiratory diseases has been demonstrated in our team's research (She et al. 2021). However, to date, few machine learning-based studies have analyzed the lag effects of environmental exposures when predicting health risks. Based on previous research, Krishan et al. adopted the representative lags as predictors to predict the peak demand day of respiratory disease visits (Khatri and Tamil 2017). Ramadona, A.L., et al. set the meteorological variables with a time lag of 4 months for the exposureresponse relationship study and predicted the number of dengue cases and potential outbreak risk based on the time lag results (Ramadona et al. 2016). In our study, GAM model was used to determine the risk relationships between several meteorological conditions and LRTIs admissions. Compared with the traditional regression model, GAM provides a flexible and effective technique for modeling non-linear time series in studies of health effects of environmental factors.
In this work, we observed that the effect of relative humidity on the admission of LRTIs was at a high level, suggesting that it may be a key factor influencing LRTIs. This finding is consistent with previous studies. A meta-analysis demonstrated that environmental relative humidity generally plays an important role in the incidence and prevalence of climate-sensitive diseases, such as respiratory diseases (Jinghong et al. 2014). Existing studies explain the mechanism of the association between relative humidity and LRTIs; one of the study reports that relative humidity is positively correlated with RSV activity, with more RSV infection as relative humidity increases, ultimately accompanying the risk of LRTIs (Meerhoff et al. 2009). Enveloped viruses such as influenza viruses require 20-30% relative humidity to grow, and they can only survive in environments of high relative humidity (50-70%) (Lowen et al. 2007). What is more, temperature and wind velocity also showed some correlation. Temperature changes may impair sensor cells' function and antiviral responses, thereby affecting the body's immune function (Iwasaki et al. 2017). Moreover, temperature also affects people's direct exposure to the spread of the virus, e.g., people's physical activity is increasing during the warmer months (Suminski et al. 2008). Wind velocity has been found to increase the activity of respiratory viruses (Darniot et al. 2018), possibly explained by the fact that it promotes the circulation and distribution of air pollutants, such as virus-carrying particles, to speed up their transport. Besides, we also found that atmospheric pressure was negatively associated with the risk of hospital admission for LRTIs. Shaman and Kohn stated that atmospheric pressure affects the spread and survival of viruses (Shaman and Kohn 2009). However, Liu et al. found a positive correlation between atmospheric pressure and the number of hospitalizations for LRTIs (Liu et al. 2016).
WHO reports that effective health service delivery requires a number of key resources, of which comprehensive health information is a critical component (WHO 2020b). Given the increasing public demand for coverage and quality of health care services, health care providers are often under increased pressure to respond to the demands associated with high disease prevalence (Sharifi and Saberi 2014). Capacity planning decisions are particularly important for the healthcare industry as frontline health care services and providers do not have sufficient information or resources to meet the "above normal" demand for health care services. This is because they not only involve the management of highly specialized and expensive resources (i.e., nurses, doctors and advanced medical equipment) but also can be life-ordeath in critical situations (Hans 2012). Health forecasting allows both individuals and service providers to anticipate situations so that necessary steps can be taken to manage peaks or extreme events (Soyiri and Reidpath 2013). Therefore, our research model promises to provide health care organizations with accurate forecasts of health care demand, thereby optimizing resource allocation and improving the service quality.
Although the experimental results of CGCLM are satisfactory, some limitations of the present undertaking could be explored in further studies. Firstly, there is a noticeable seasonal trend in admissions of patients with LRTIs, and the four different prediction window sizes for the sliding window designed in this study may incompletely filter out seasonal variations. Secondly, the restrictions of variable availability makes pollutants such as AQI, PM 2.5 , CO 2 , and NO 2 excluded from this study, which has been proved to have a direct impact on the respiratory system. The relatively low explained variance of the GAM expressed by R 2 also suggests, to some extent, that the inclusion of pollution factors may be beneficial for further improvements in model performance. Finally, potential factors, such as the varying sensitivity of patients of different ages and genders to weather conditions, and the underlying illness of the patient group itself, may have an impact on the results. Nevertheless, CGCLM remains a promising model that is based on a data adaptive decomposition approach that can easily add, remove, and change input features with high compatibility and scalability.

Conclusions
In this study, a multi-step health risk time series prediction method (CGCLM) was established by combining with GAM model, CEEMDAN signal decomposition algorithm, and LSTM. This method was applied to forecast the daily hospital admission of LRTIs. The main steps were the following: according to the actual meteorological conditions of the study site, the exposure-response relationship was established by using GAM model; then, the CEEMDAN method was used to decompose the unsteady admission time series into several smooth IMF components; finally, LSTM neural network was used to predict the IMF component. With error analysis (RMSE, MAE, and R 2 ), CGCLM method shows a lower prediction error than single LSTM model. The realization of accurate admission prediction not only provides support for the establishment of a prediction model for the number of medical patients and the decision-making of public health departments, but also provides a reference for people to choose the time of medical treatment, which is of great research significance.