Prediction and early warning model of mixed exposure to air pollution and meteorological factors on death of respiratory diseases based on machine learning

In recent years, with the repeated occurrence of extreme weather and the continuous increase of air pollution, the incidence of weather-related diseases has increased yearly. Air pollution and extreme temperature threaten sensitive groups’ lives, among which air pollution is most closely related to respiratory diseases. Owing to the skewed attention, timely intervention is necessary to better predict and warn the occurrence of death from respiratory diseases. In this paper, according to the existing research, based on a number of environmental monitoring data, the regression model is established by integrating the machine learning methods XGBoost, support vector machine (SVM), and generalized additive model (GAM) model. The distributed lag nonlinear model (DLNM) is used to set the warning threshold to transform the data and establish the warning model. According to the DLNM model, the cumulative lag effect of meteorological factors is explored. There is a cumulative lag effect between air temperature and PM2.5, which reaches the maximum when the lag is 3 days and 5 days, respectively. If the low temperature and high environmental pollutants (PM2.5) continue to influence for a long time, the death risk of respiratory diseases will continue to rise, and the early warning model based on DLNM has better performance.


Introduction
In recent years, with the frequent occurrence of extreme weather and the continuous increase of air pollution, meteorological-related diseases have increased yearly (Weichenthal et al. 2016). Air pollution and extreme temperature bring serious life threats to sensitive people (Wang and Shi 2019). Extreme temperature, as a type of unsuitable air temperature (the temperature corresponding to the minimum risk of death is called the suitable temperature, and the other are unsuitable temperature), is the main environmental risk factor for circulatory and respiratory diseases. A new international study published in the British journal "The Lancet Planet Health" (2021, as cited in Refernce News 2021) reveals that more than 5 million deaths worldwide each year are linked to unusually cold or hot weather caused by climate change. Globally, 9.4% of annual deaths can be attributed to "unsuitable" temperatures. According to the Global Burden of Disease 2019, unsuitable temperatures rank 8th as the risk factor for death among Chinese residents. Air pollution (gases and particulate matter) is a high-risk factor for death  (GBD) estimates that in 2019, 4.6 million people died prematurely worldwide due to outdoor air pollution, with PM2.5 being the main dangerous component of air pollution. Huang et al. (2019) collected PM2.5 and PM1.0 samples and analyzed their characteristics and effects on lung injury in mice. After experiments, it was found that PM2.5 pollution concentration had a significant positive correlation with increased morbidity and mortality of respiratory diseases (Huang et al. 2019). Moreover, some experiments show that PM2.5 can promote the activation of inflammatory reactions in mice (Xing et al. 2019) and can also reduce human lung compliance (Yang et al. 2017), which is related to the risk of respiratory diseases, decreased lung function, and acute exacerbation (Anenberg et al. 2018). Previous studies have also found that weather and its changes will affect human health, especially because the too high or too low temperature will adversely affect human health, and the elderly are more vulnerable to cold (Anenberg et al. 2018). In fact, exposure to air pollutants and meteorological factors will increase both the risk of death from various diseases (Markandya and Chiabai 2009;Lu et al. 2015;Shah et al. 2017;Anwar et al. 2019;Silva et al. 2020) and their synergistic effect on human health (Analitis et al. 2018;Li et al. 2017). Therefore, timely intervention is essential to better predict and warn about respiratory disease death. Zhong (2020), Gao (2021), and other scholars predict and warn against infectious diseases closely related to meteorological factors and air pollution and use Gradient Boosting Decision Tree (GBDT), support vector regression (SVR), random forest (RF), and other machine learning algorithms to warn diseases. However, the research data are mostly binary or multi-classification for a specific disease. In addition, some studies use the Spearman correlation matrix and generalized additive models (GAM) to establish prediction models, but it is challenging to characterize the specific intensity of influencing factors (Gasparrini and Antonio 2014;Zhou 2005;Curriero 2002;Gan 2020;Costa, and de Hollander 2020).
Therefore, according to the existing research and based on a large amount of environmental monitoring data, the age of the research subjects is divided. The regression model is established by combining the machine learning methods XGBoost (eXtreme Gradient Boosting), SVM (support vector machine), and GAM, and the early warning threshold is set by using the distributed lag nonlinear model (DLNM) to transform the data and establish the early warning model. This study explores the effects of temperature and PM2.5 concentration in people of different times and ages, which can provide guidance for exploring and formulating adaptation strategies for meteorological changes in respiratory diseases and provide support and reference for clinical intervention.
The rest of the article is arranged as follows: The second part explains and analyzes the sources of the data and their statistical characteristics. The third part introduces the research method and model setting of this paper. "Results" reports empirical findings; The fifth part further analyzes the influence of meteorological factors and atmospheric pollution on respiratory disease deaths and the results of early warning models. Finally, the theoretical and empirical analysis of the paper is combined to draw conclusions, and corresponding feasibility suggestions are put forward.

Data
The data in this paper comes from the National Population Health Science Data Center "Death Causes and Environmental Monitoring Data of a Certain Area in China from 2014 to 2018" (Chinese Center for Disease Control and Prevention 2021). This paper includes the death toll of respiratory diseases in 1826 days in this area from 2014 to 2018, and the indicators include population age distribution, meteorological factors, and air pollution data. Specifically, it includes 30 variables like death date; age; sex; accompanying diseases; daily average temperature (°C); maximum temperature (°C); minimum temperature (°C); average relative humidity (%); nitrogen dioxide (NO 2 (μg/m 3 )); particulate matter with median aerodynamic diameter ≤ 10 μm, PM10 (μg/m 3 ); fine particulate matter with median aerodynamic diameter ≤ 2.5 μm, PM2.5 (μg/m 3 ); sulfur dioxide (SO 2 (μg/ m 3 )) and carbon monoxide (CO (mg/m 3 )); and ozone O 3 (μg/ m 3 ). At the same time, considering that the concentration of air pollutants and the influence of air temperature are quite different under different seasonal conditions in this area, this study further divides the research period into cold and warm seasons (cold season: October-March of the following year; warm season: April-September), in order to explore the influence of air pollutant (PM2.5) concentration and temperature change on the death risk of respiratory diseases in cold and warm seasons.

Statistical analysis method
The data are analyzed by SPSS 25.0, and the distribution of respiratory deaths, meteorological factors, and air pollution data are obtained.
In this paper, the Spearman correlation analysis is used to analyze the correlation between meteorological factors, air pollution, and respiratory deaths; the test level is 0.05. Additionally, the study period is divided into cold and warm seasons, and whether there are significant differences in death among different age groups is also discussed.

Hysteresis effect based on DLNM
The influence of meteorological factors and air pollutants on human health lags behind, and the relationship between them and death is nonlinear (Lee et al. 2019;Zhang et al. 2019). In this study, the distributed lag nonlinear model (DLNM) is used to fit the relationship between the daily death toll of respiratory diseases and meteorological factors and air pollution. Based on the DLNM model, the lag dimension is added to the exposure-response relationship by constructing a cross basis . The nonlinear relationship between meteorological factors, air pollution, and respiratory disease death and the lag effect of meteorological factors and air pollution are analyzed by using the cross-basis function (Fu et al. 2021). The correlation between PM2.5 and air temperature and respiratory disease death risk is quantitatively analyzed. The control variables include long-term time trend, humidity, day of the week effect, and air pressure.
The calculation formula of the DLNM model in this paper is as follows: Because death from respiratory diseases is a small probability event, it obeys approximate Poisson distribution. The variable names and their meanings in the formula are as follows: NS is a natural spline function. In the DLNM model, we need to choose the appropriate basis function according to the relationship between the independent variable and dependent variable and the distribution of the hysteresis effect. The natural spline function is the common basis function. Df is a degree of freedom, and the confounding factors considered in modeling include average relative humidity (rh), average temperature (tmean), average air pressure (press) and long-term time trend (days), PM2.5 (PM2.5concentration), E[Yt](the estimated death number of respiratory diseases per day), α (intercept) and β (vector parameter), and dow (weekly variable). Using "dlnm" and "ggplot2" packets in R language, the relative risk degree is obtained, and the hysteresis effect is visually analyzed.

Construction of prediction and early warning model
On the basis of the previous study, this study, based on the generalized additive model (GAM model) and two machine learning methods of XGBoost and SVM, constructs an early warning model. (1)

GAM model
Firstly, the GAM model is constructed using the "mgcv" (generalized additive models with integrated smoothness estimation) package in R language to detect whether there is a nonlinear relationship between explanatory and effectual variables. The effects of relative humidity, mean air pressure, long-term time trend, and week effect are controlled, and the differences of effects are compared by age group stratification. Secondly, this paper introduces the tensor product smoothing function ti (Ma et al. 2021), establishes the interaction term between air temperature and PM2.5, constructs a binary response surface model, and obtains a three-dimensional spatial diagram of the synergistic effect of average air temperature and PM2.5 on the number of respiratory deaths. The specific model (2) is as follows: Variable names and their meanings in formula (2): Y t is the number of respiratory system deaths on the t day, and the data of death numbers obey Poisson distribution); tmean is the average temperature on the t day, pm2.5 is the concentration of PM2.5; S is the smoothing function. According to AIC (Akaike information criterion), the time trend df is 7, the relative humidity df is 3, and the average air pressure df is 3.

Machine learning algorithms
(a) XGBoost model XGBoost, as a kind of boosting algorithm family GBDT (Gradient Boosting Decision Tree) of algorithm framework, is used to build a supervised regression model. When the objective problem is a regression problem, the objective function consists of two parts: loss function l and regularization term Ω: l y i ,ŷ i represents the loss function, l is the number of samples, y i is the true value, and yˆi is the predicted value; is a regular term, k is the number of iterations, and f k is the iteration error. The regular term is expressed as follows: λ and γ are super parameters, T is the number of leaf nodes, and w i is the node value. XGBoost can avoid over-fitting and reduce calculation because of its regularization, parameter adjustment, 50% cross-validation, and feature sub-sampling. Therefore, it has great advantages in dealing with multivariate nonlinear regression problems.
After model simulation, this paper uses RMSE (root mean square error) and MAE (mean absolute error) to test the fitting effect of the model. The closer RMSE and MAE statistics are to 0, the better the fitting effect of the model is. The specific steps of model construction in this paper are as follows: Firstly, the data is preprocessed to separate the test set and training set, and then the super parameters of the XGBoost model are adjusted by the grid search algorithm. Secondly, the number of trees is determined, and after optimization, the most suitable tree is 230 trees. Then, the minimum value γ of the loss function is adjusted. The larger the γ value, the more conservative the model is. Appropriate γ value can reduce the risk of over-fitting. Finally, the learning rate of XGBoost is adjusted, and the efficiency and adaptability of the learning rate control model (Ji et al. 2022) have been tested by 50% cross-test in each step. After the above steps, the optimal model hyper-parameters are obtained. In this paper, 230 decision trees are constructed, and the grid search algorithm calculates the optimal parameters of the XGBoost model.
At present, SVM has been widely used in bioinformatics, portrait recognition, and air pollution early warning. SVM is a generalized linear classifier with both sparsity and robustness (Kryshev et al. 2022). SVM is supervised learning, and learning strategy is the principle of maximizing the distance between different categories, where interval maximization refers to the distance between the separable decision hyperplane and the nearest training sample. For nonlinear classification, the basic idea of SVM is to upgrade the dimension of training samples through kernel function, that is, to map samples from lowdimensional to high-dimensional feature space, to construct an optimal classification hyperplane in high-dimensional space and realize effective classification of samples.
Its optimization function is maxL 2 where L is the distance between positive and negative decision hyperplanes; ‖�� ⃗ w‖ is the normal vector perpendicular to the decision hyperplane. At present, the optimization function is concave function. To solve the model more conveniently, the optimization function is transformed into a convex function: In order to prevent over-fitting, penalty coefficient and tolerance error are introduced to weigh the accuracy and precision, that is, where C is the penalty coefficient, and i is the tolerance error. When entering high-dimensional space, kernel function should be introduced for transformation, that is, causes where K() is the kernel function.
Evaluation of prediction and early warning model This paper will compare the prediction fitting effects of different models by comparing root mean square error (RMSE) and mean absolute error (MAE) and drawing fitting curves. The different formulas are as follows: where m is the sample size, x (i) is the true value, and y (i) is the predicted value. By comparing the prediction results of three algorithms, the prediction model combined with the analysis results of the DLNM model is determined for early warning, which can warn the influence of meteorological factors and air pollution on death (Fig. 1).

Descriptive analysis
From 2014 to 2018, the average daily death in this area is 102.98 ± 22.08 cases, including 58.00 ± 12.77 cases for men and 44.97 ± 11.39 cases for women. There is a significant difference between men and women (p < 0.05), and men are more sensitive to meteorological factors and air pollution than women. The death number from respiratory diseases is 25,971, and the daily death from respiratory diseases is 14.22 ± 6.196. The average daily death of men is 58.01 ± 12.77, and that of women is 44.97 ± 11.388. Similarly, there is a significant difference between men and women (p < 0.05). Men are more sensitive to meteorological factors and air pollution than women (Table 1).
The descriptive analysis results of different meteorological factors and air quality data are shown in Through analysis, it is found that both meteorological factors and air pollution show certain seasonal changes, and the specific indicators are shown in Table 3.
At the same time, the death toll of the respiratory system also shows obvious seasonal changes, with a low death toll in the warm season and the high death toll in the cold season, and the death toll changes significantly with seasons. Moreover, the number of deaths in men is higher than in women, and a comparative analysis of the differences among age groups shows that the mortality rate of people over 65 years old is the highest, as shown in Table 4.
In the warm season, the correlation coefficient between maximum temperature and death is 0.098, which is significant at the level of 1%, as shown in Table 5.
Among the environmental monitoring factors, the correlation coefficient between PM2.5 and the death toll of respiratory diseases is 0.123, which is significant at the level of 1%, as shown in Table 6.
In the cold season, the temperature is treated positively (reciprocal) to keep the index direction consistent. The relevant statistical results show that among the meteorological factors, the correlation coefficient between the minimum temperature and the death toll is 0.145, which is significant at the level of 1%; The correlation coefficient of mean pressure is 0.163, which is significant at the level of 1%, as shown in Table 7.
Among the air pollution factors, the correlation coefficient between PM2.5 and the death toll of respiratory diseases is 0.153, which is significant at the level of 1%. The correlation coefficient of CO is 0.168, which is significant at the level of 1%, as shown in Table 8.
Combined with the results of descriptive analysis and correlation analysis, it can be seen that compared with other factors, temperature and PM2.5 have the most significant impact on the number of deaths. Therefore, in the subsequent model construction, the design will be based on air temperature and PM2.5.

Accumulation and hysteresis effects
Based on the statistical description, this study sets the lag days as 7 days, 14 days, and 30 days in the GLNM model, analyzes the cumulative and lag effect models, and draws the contour images of lag time-relative risk-air temperature and PM2.5 data change (Fig. 2). Through the scatter diagram and linear analysis of air temperature, PM2.5, and respiratory death, the residual values are between [−5.7921 to 3.6198] and [−5.7560 to 5.4676]. Under the influence of the long-term lag effect, low temperature, high temperature, and high concentration of PM2.5 can increase the death of respiratory diseases. The relative risk under high and low temperatures is high, and the relative risk under high temperatures will decrease rapidly with the delay time. The impact is shorter, and the relative risk under low temperatures decreases more slowly and has a longer impact on people (Fig. 3); PM2.5 increases rapidly with the delay time.

Constructing a prediction model
The data from 2014 to 2017 are selected as the training set and the data from 2018 as the test set. The training set is used to estimate the model parameters, and the test set is used to evaluate the model performance. The fitting effect of the GAM model is shown in Fig. 4. Using the grid search method, the super optimal parameters of the SVM model are obtained, and the SVM model with a penalty coefficient of 9.9 and kernel coefficient of 0.01 is constructed. The prediction effect is shown in Fig. 5, and the fitting situation is good, but for some data, the prediction result is lower than the actual result. Through grid search and manual parameter adjustment, the XGBoost model with tree 240, depth 3, learning rate 0.8, and minimum loss of leaf nodes 0.04 is set. Its prediction effect is shown in Fig. 6, and its classification effect is better than SVM; RMSE is 4.79, and MAE is 1.96.

Comparison of prediction results
According to the benchmark regression results, the GAM model makes stratified prediction and  visualization of cold and warm periods for people aged 15-65 and over 65, as shown in Fig. 4. In the cold period, when the concentration of PM2.5 is 50 (μg/m 3 ) and the average temperature is 20 (°C), the death toll of people aged 15-65 is the highest, while for people over 65 years old, the death toll is the highest when the concentration of PM2.5 is 150 (μg/m 3 ), and the average temperature is 0 (°C). Compared with the cold period, the number of deaths in the warm period is less, but the number of deaths increases with the increase of PM2.5 and average temperature. In addition, the number of deaths shows a downward trend when the average temperature is 15-20 (°C), which indicates that 15-20 (°C) is the average temperature suitable for people with respiratory diseases.
The prediction results of respiratory deaths show that the XGBoost integrated learning framework has the best prediction effect (Table 9). The prediction results of the following two machine models are compared, in which blue is the actual death toll and yellow is the death toll  predicted by different models. The prediction results of the SVM model are lower than the actual results, and the XGBoost model has the highest coincidence degree, which is better than the SVM prediction results.

Constructing an early warning model
According to previous studies, meteorological factors and air pollution have a lag effect on the incidence and death of respiratory diseases, and the cumulative lag effect of respiratory deaths is the largest from 3 to 14 days. According to DLNM results, the number of deaths on different dates under respiratory diseases when the relative risk is greater than 1 in the cumulative lag effect of 7 days is screened. For respiratory diseases, the average temperature is higher than − 4.5 and less than 32.9 (°C), and PM2.5 is greater than 4 (μg/m 3 ) for 1823 days. When the average temperature is 31.5 °C, the cumulative 7-day relative risk degrees are 1.14, 1.03, 0.97, 0.97, 1.00, 1.00, and 0.98, respectively, showing a downward trend. When the concentration of PM2.5 is equal to 4 (μg/m 3 ), the cumulative 7-day relative risks are 0. 93, 0.92, 0.95, 0.99, 1.01, 0.95, and 0.82, respectively, showing an upward trend. According to this content substitution model, the number of deaths on different dates will be screened out, and the continuous data will be transformed into two classification data. The content substitution model will get the ROC curve, and the sensitivity, specificity, and area under the curve will be used to describe the early warning model results. Comparing the results of this model with the binary data without lag day conversion, the threshold value of this model is better.
The results show that the classification early warning model constructed for the cumulative lag effect of maximum air temperature is better than the two-classification data without cumulative lag transformation, and the area, sensitivity, and specificity under the curve are all greater than 0.9 (Table 10). For the binary data of PM2.5 and air temperature without hysteresis threshold, the effect is very good, and the area under the curve, sensitivity, and specificity are all greater than 0.9.
To analyze which model can better train the data used in this article, XGBoost and SVM models are selected to train the data. From the prediction results of the prediction model (Table 9), the prediction accuracy of XGBoost is higher than that of SVM, and the predicted respiratory deaths are more accurate. From the construction of the two models, SVM is a single model, and XGBoost is an integrated model that obtains a strong learner by

Discussion
This study discusses the influence of meteorological factors and air pollution on respiratory disease death and the show that PM 2.5 and air temperature are nonlinearly related to respiratory disease death. Many epidemiological studies have confirmed that PM2.5 pollution is significantly related to respiratory morbidity and mortality, especially for people suffering from lung diseases. This is consistent with the results of this study, that is, patients with respiratory diseases are vulnerable to air pollutants and have a higher risk of death.

Influence of meteorological factors and air pollution on death from respiratory diseases
This study finds that the effects of PM2.5 and temperature on the death of respiratory diseases reach the maximum when the lag is 5 days and 3 days, respectively. The possible reason is that air pollutants adhere to respiratory mucosa or deposit at the bottom of the lung after entering the human body, which is time-consuming and delayed due to inflammatory reaction and oxidative stress reactions. Compared with the warm season, the mortality rate in the cold season is higher, which may be because the intensification of cold stimulation can cause bronchoconstriction or changes in the lung immune system, which makes the human body more susceptible to virus and bacterial infection, thus causing irreversible organic changes to the lung. As for the difference between different subgroups, it can be explained by the age structure. Because of poor physiological adaptability, the elderly are more vulnerable to this change than the young. In addition, this study finds that male mortality is higher.

Prediction and early warning model
For respiratory diseases, it is very necessary to predict the incidence and death trend of respiratory diseases and carry out an early intervention. Based on meteorological data, this paper uses the GAM model, SVM model, and XGBoost integrated learning framework to predict and warn against the death of respiratory diseases. Comparing the RMSE of training set and test set shows that XGBoost predicts and fits well. XGBoost ensemble learning framework integrates a variety of algorithms, which takes less time and is efficient Fig. 6 Prediction effect of the XGBoost model and can establish a practical and effective prediction model. For the early warning model, the variables whose relative risk degree is greater than 1 calculated by cumulative lag effect are screened and then transformed into binary variables according to the screened results. The sensitivity, specificity, and area under the curve of the transformed results are calculated, and it is found that the early warning model of circulatory system diseases has better early warning results for variables screened by the hysteresis effect. For air temperature, through AUC comparison, compared with the threshold without hysteresis screening, the threshold based on cumulative hysteresis effect division is better. In cold seasons, with low relative humidity and high concentrations of air pollution particles, we should pay attention to the elderly over 65 years old with respiratory diseases in advance to reduce the harm caused by meteorological changes.

Conclusion and prospect
Based on the number of deaths combined with statistical analysis, it is found that the changes in temperature and PM2.5 significantly impact the death of respiratory diseases.
According to the DLNM model, the cumulative lag effect of meteorological factors is explored. There is a cumulative lag effect between air temperature and PM2.5, which reaches the maximum when the lag is 3 days and 5 days, respectively. If the low temperature and high environmental pollutants (PM2.5) continue to influence for a long time, the death risk of respiratory diseases will continue to rise, and the early warning model based on DLNM will have better performance.
In future research, if the sample size is large, more complex multidimensional samples can be constructed, and a convolution neural network (CNN) or deep learning LSTM model can be used to achieve high-accuracy early warning and forecasting.

Declarations
Ethical approval All co-authors give consent that there is no unethical experiment conducted in this research.
Consent to participate All co-authors give consent to participate in this manuscript.

Consent for publication
All co-authors consent to publish this article upon acceptance.

Competing interests
The authors declare no competing interests.