Study of the relationship between occurrence of Kawasaki disease and air pollution in Chengdu by parametric and semi-parametric models

Kawasaki disease (KD) is a pediatric vasculitis of unknown etiology which is mainly associated with the development of coronary artery aneurysms. The etiology of KD seems to be multifactorial, but there is rare research on the association between KD and potential environmental risk factors. So, we would like to examine the correlation between KD and potential environmental risk factors in West China. We included KD patients in Chengdu from 2015 to 2021 and analyzed the correlation between air pollution indexes and climate condition indexes. The autocorrelation of the data was eliminated by first-order difference, the risk factors were screened by stepwise regression with AIC criterion, and the multiple regression model was established. Random forest and Winsorize were used to test the robustness of the screening results, and it was found that particulate matter with a diameter less than or equal to 2.5 μm (PM2.5) had a significant positive effect on the incidence of KD. In addition, several variables were positively correlated with KD incidence, but not statistically significant. The GAM model was used to explore the nonlinear correlation between PM2.5 and KD incidence. The results showed that PM2.5 concentration was positively correlated with KD incidence, and the effects varied among different concentration levels of PM2.5. Fisher’s exact test was used to explore the influence of PM2.5 on the incidence of coronary tumors. It is found that PM2.5 may be a risk factor for it. This study suggested that exposure to high concentrations of PM2.5 may significantly increase the risk of KD. The evidence for the association between other environmental factors and KD incidence, as well as the association between PM2.5 and coronary tumors, was limited and needed further verification.


Introduction
Kawasaki disease (KD) is an acute systemic medium and small vasculitis syndrome of unknown etiology that commonly occurs in infants and children under 5 years of age.The disease was first proposed by Tomisaku Kawasaki in Japan in 1967 (Kawasaki et al. 1974), and the first case outside Japan was found in Hawaii in 1976 (Melish et al. 1976).KD is initially characterized by high fever, rash, waxberry tongue, hard swelling of hands and feet, membranous peeling of finger ends, and non-suppurative lymph node enlargement.In the absence of specific tests, diagnosis continues to rely on identifying the main clinical manifestations and excluding other clinically similar entities with known etiologies (McCrindle et al. 2017).KD has become the leading cause of acquired heart disease in children in developed countries and is becoming a major factor of acquired heart disease in developing countries (Lin et al. 2017).The critical complication of KD are coronary arteries.About 20% of untreated children with KD may develop coronary artery aneurysms (de La Harpe et al. 2019;Newburger et al. 2016), and even thrombosis may occur.This can lead to myocardial ischemia or even myocardial infarction, which is associated with sudden death in young adults today.Therefore, KD is also known as children's "coronary heart disease." It has been more than 50 years since the discovery of KD, but the etiology is still unclear.The etiology widely mentioned at present mainly involves genetics, infectious diseases, and autoimmune pathogenesis.In terms of genetics, studies have shown that people who have siblings with KD are ten times more likely to develop KD than the general population.And the prevalence of coronary abnormalities in KD patients whose parents have a history of KD is twice than that of all KD patients 1 month after onset (Fujita et al. 1989;Uehara et al. 2003).With the development of gene technology, multiple susceptibility genes have been identified through gene-wide association studies (GWAS) and linkage studies (GWLS), including enhanced T cell activation (ITPKC, ORAI1, STIM1), dysregulated B cell signaling (CD40, BLK, FCGR2A), reduced apoptosis (CASP3), and altered transforming growth factor-β signaling (TGFB2, TGFBR2, MMP, SMAD) (Kumrah et al. 2020).In addition, KD overlaps with other infectious diseases in clinical manifestations and has similar time aggregation with other viral diseases in autumn and winter (Burns et al. 2013;Greco et al. 2015), which supports that the pathogenesis of KD is related to infection.Current researches on KD infection mainly focus on superantigens, bacterial toxins, or a viral etiology characterized by intracytoplasmic inclusion bodies in KD tissues (Rowley 2011;Saeki and Ishihara 2014).Furthermore, some analyses of KD data have shown that abnormal immune response to infectious agents plays a key role in the occurrence of the disease (Greco et al. 2015), and its pathogenesis is similar to that of autoimmune diseases (Guo et al. 2015).However, there is no consistent and significant conclusion in the above studies with different data and experiments, which makes the research on the etiology and inducement of KD still ongoing.
With the increasing attention to environmental issues, the correlation between the environment and diseases has also attracted people's attention.Studies have shown that 70-90% of the disease risk is caused by environmental differences (Chen et al. 2021;Chowdhury et al. 2022;Hindorff et al. 2009;Lichtenstein et al. 2000;Willett 2002).Recently, the relationship between air pollution and KD has attracted more and more attention.Some researchers contend that exposure to air pollution is associated with an increased risk of KD hospital admission (Jung et al. 2017;Kwon et al. 2022;Yorifuji et al. 2018).In order to analyze the association between KD and prenatal exposure to ambient and industrial air pollution, a population-based cohort study has been conducted, and a positive correlation has been found (Buteau et al. 2020).Besides, a small number of studies examined the correlation between short-term pollution exposure in childhood and the incidence of KD, but no significant environmental indexes have been found (Lin et al. 2017;Oh et al. 2021;Yorifuji et al. 2018;Zeft et al. 2016).It is worth noting that the odds ratio (OR) of children exposed to suspended particulate matter density greater than 25 μg/ m 3 to a density lower than 20 μg/m 3 getting KD is 1.41, but the confidence interval is too wide to be significant (Yorifuji et al. 2018).Therefore, our study evaluated the correlation between the incidence of KD and ambient air pollution and climate conditions in Chengdu during the 7 years from 2015 to 2021 and hoped to find out the pollution factors related to the incidence of KD.This paper contributes to the KD research in at least three aspects.First, we collect data from a typical megacity in West China, where people and the government have great concern about air pollution, and the number of KD diseases gradually increases year by year.Second, we use appropriate techniques to detrend the time series data for both air pollution and the number of KD diseases, in order to avoid spurious correlations between them.Third, we build not only a linear model but also a nonlinear model to depict the relationship between the number of KD diseases and air pollution, consistent results have been shown by both models and the nonlinear model gives much more details about their relationship.

Data source
This study used the inpatient data of Chengdu Women's and Children's Central Hospital, University of Electronic Science and Technology of China, from January 2015 to September 2021.The data contain variables for the patient's gender, age, admission date, discharge date, and discharge diagnosis, where the discharge diagnosis is ranked from 1 to 5 by the importance of the different symptoms at admission.The types of KD in the data include conventional cutaneous mucocutaneous lymph node syndrome and specific intravenous immunoglobulin (IVIG) nonresponsive KD, and the discharge diagnosis of cases in the data includes at least one of the two types.In this study, we focused on patients whose rank-one discharge diagnosis is one of the above two KD types.We used the date of admission as a counting standard and then built a regression model on it with indexes related to air pollution and climate conditions as explanatory variables.
The climate condition index data used in this study were obtained from NASA POWER (https:// power.larc.nasa.gov/).According to the geographical location of Chengdu, the longitude was set as 104.0266, and the latitude was set as 30.7066.At the same time, the observation span was selected consistent with that of KD cases, from September 2015 to September 2021.Climate condition indexes include relative humidity, 2 m above the ground (RH 2M ), corrected precipitation, 2 m above the ground (PCOR), and temperature, 2 m above the ground (TEMP).
The air pollution index data used in this study were downloaded from the Environment Big Data (https:// www.ebd120.com/), which were obtained from the China Environmental Monitoring Station and the real-time release platform of ambient air quality in various provinces and cities.The data contained the daily mean values of six common air pollutants, including particulate matter with a diameter less than or equal to 2.5 μm (PM 2.5 ), particulate matter with a diameter less than or equal to 10 μm (PM 10 ), sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), and ozone (O 3 ).The data of these pollutants were obtained by averaging the measured values of 12 fixed stations in the city (Junping Street, Dashi West Road, Jinquan Lianghe, Shahepu, Sanwayao, Shilidian, Longquanyi District Government, Checheng East 7 Road, Qingbaijiang Technician Branch, Science City, Huayang, Jinbo Road).

Statistical analysis
In order to explore the exact and detailed relationship between the incidence of KD, air pollution, and climate conditions, three methods including a parametric regression model, semi-parametric generalized additive model (GAM) with semi-parameters, and random forest were used for analysis.In the data, since the number of KD cases was relatively sparse in the daily data and some of them were 0, we used the monthly KD cases for analysis.At the same time, such treatment was also beneficial to eliminate the influence caused by time gap between illness and admission.
Since the KD cases, climate condition index data, and air pollution index data used were all time series data, it was necessary to proceed autocorrelation test on variables before establishing an appropriate regression model.Therefore, the D-W test was applied to test the autocorrelation of index variables, and the results were found to be significant (p < 0.001).Therefore, before the regression analysis, we adopted the first-order difference method to eliminate the autocorrelation existing in the index variables, and the results of the regression analysis also showed the linear correlation between the growth rate of KD incidence and the growth rate of climate and pollution indexes.In addition, since there may be a correlation between different index variables, which means multicollinearity and KD incidence obviously could not be correlated with all index variables, the selection of index variables was also an important task of regression analysis.Therefore, we selected the index variables by minimizing the Akaike information criterion (AIC), and the specific process was realized by stepwise regression.In addition, some necessary but non-significant index variables remained for regression due to some environment research background, and the baseline regression model was obtained by the final variable selection result.
Meanwhile, in order to test the robustness of the model, we applied Winsorize to all the continuous index variables in the original data.Winsorize selected the two-sided 10% processing, which meant the first 5% and last 5% quantile of the data were all replaced by the values of 5% and 95% of the original data.A regression model was established with the processed data to inspect whether the regression coefficient and its significance were obviously different compared with the results of the baseline regression.This treatment can eliminate the influence of extreme outliers on the model, and by comparing the models before and after the treatment, we can test whether the model is affected by outliers, thus testing the robustness of the model.
In order to verify the accuracy of the regression model and better interpret the influence of index variables on KD, random forest and Generalized Additive Model (GAM) were adopted.First, the random forest model was established, and its variable importance was calculated to verify whether the index variables selected by the regression model were consistent with other methods.The number of decision trees for each fitting random forest was 500, and the maximum depth was 5. A total of 1000 repetitions of fitting were carried out to calculate the importance of variables according to the increase in Mean squared error (MSE) in each repetition.Then, based on the structure of the baseline regression model, a GAM model was further built to reveal the nonlinear relationship between index variables and the incidence of KD.More specifically, the nonlinear index variable was PM 2.5 , which was the specific one in the baseline model.The fitting spline bases were selected as thin plate regression spline, and the dimension of the bases was set to 5. According to Liang et al. (2017) and Liu et al. (2012), we chose the cross-validation method to evaluate the performance of GAM.The data was randomly allocated into five equal periods, namely CV1, CV2, CV3, CV4, and CV5.For each test, one period was removed, and GAM was established based on the remaining periods.Then the removed period was treated as a test set to prove the accuracy of the model, and a total of five rounds proceeded.
In addition, coronary artery aneurysms, which often accompany with KD, were also worth being investigated.Based on the conclusion of the above analysis, we also tried to test the correlation between the PM 2.5 concentration and its morbidity.Cases with any discharge diagnosis including coronary aneurysms in the data were taken as the research objects, which were counted by four quarters.With the median as the boundary, patients were divided into two groups, which named the quarter with high incidence and the quarter with low incidence.Similarly, PM 2.5 concentration was represented by the mean of each quarter and divided into a high PM 2.5 concentration group and a low PM 2.5 concentration group, using the median as the threshold.A 2 × 2 contingency table was established for the partitioned data.Since the sample size is 27, Fisher's exact test was selected to test the correlation between the PM 2.5 concentration and the morbidity of coronary artery aneurysms.
The statistical tests were two-sided, and p-values < 0.05 were considered as statistically significant.All models were performed using R software (Version 4.2.0,R Foundation for Statistical Computing, Vienna, Austria) with the GAM model using the "mgcv" package and the Winsorize approach using the "DescTools" package.Besides, the random forest is implemented by using the "randomForest" package, and Fisher's exact test can be carried out by "stats" package.

Descriptive analysis
Table 1 summarizes the monthly descriptive statistics of KD cases, air pollution indexes, and climate condition indexes.There is no missing data in KD cases and climate condition indexes, and a small amount of missing data in air pollution indexes.The mean value of each month is used for interpolation.The study period is from January 2015 to September 2021 (2464 days); a total of 3036 KD cases have been recorded, age from 38 days to 13 years (median: 2 years), and boys account for 66.27%.The average monthly temperature, humidity, and precipitation are 17.08 °C, 75.16%, and 111.72 mm, respectively, reflecting the typical subtropical monsoon climate of Chengdu.The average monthly concentrations of PM 2.5 , PM 10 , SO 2 , NO 2 , CO, and O 3 are 49.09μg/ m 3 , 80.44 μg/m 3 , 9.90 μg/m 3 , 42.79 μg/m 3 , 0.87 μg/m 3 , and 66.12 μg/m 3 .The T-test, which divides the research period into two sections equally, shows that except for O 3 , the other five air pollutants have a trend of decreasing year by year, but still seriously exceed the standard compared with the World Health Organization Air Quality Guideline (Hoffmann et al. 2021) newly issued by WTO in 2021.index variables may be unreliable, and their significance may not also reflect the truth.

Regression analysis
In order to reduce the influence of multicollinearity, stepwise regression was applied to fit the model.Table 2 shows the results of stepwise regression with some necessary but non-significant index variables (RH 2M , PORC, TEMP) remained, in which multicollinearity has been greatly reduced.It can be found that the increment of PM 2.5 has a statistically significant positive effect on the increment of KD.In other words, for each 1 μg/m 3 increase in PM 2.5 concentration, the monthly increment of KD increased by 0.22 cases on average.Meanwhile, the increment of precipitation, relative humidity, and temperature also have a positive effect on KD cases increment to some extent, although these were not statistically significant.
Furthermore, in order to test the robustness of the model, Table 3 shows the regression results after bilateral 10%Winsorize treatment for all index variables.The results show that the increment of PM 2.5 is still significantly and positively correlated to the increment of KD patients after Winsorize, and the corrected precipitation and temperature also maintain a positive influence on KD increment.While the effect of relative humidity increment turns from positive to negative, it is still statistically insignificant with the correction of precipitation and temperature.This also means that the established regression model has a certain robustness.

Results of random forest and GAM
Table 4 shows the counts of different index variables ranked first in the variable importance table among 1000 random forest results, where the variable importance is calculated by an increase in MSE.The significant index variable PM 2.5 in baseline regression model appeared for a total of 1000 times.In other words, PM 2.5 is the index variable with the highest importance in the prediction of the increment of KD, while the occurrence times of other index variables were 0. Fig. 2 shows the average importance of different index variables in 1000 random forests Fig. 3 shows the nonlinear relationship between the increment of PM 2.5 and the increment of KD by fitting a GAM model.It can be found that the increment of PM 2.5 and the increment of KD are positively correlated on the whole, which is consistent with the result of baseline linear regression.When the increment of PM 2.5 is below 0, the KD increment is negative, but the confidence interval is too wide, including a few positive values.When the increment of PM 2.5 is between 0 and 10, the KD increment gradually increases from 0, and the confidence interval still contains  0. The KD increment is more than 0 and is statistically significant when PM 2.5 increment is between 10 and 20.When the increment of PM 2.5 is greater than 20, the increment of KD starts to decrease slightly, but the confidence interval starts to expand.Fig. 4 shows the five GAMs fitted by removing one period.Table 5 shows the cross-validation performance of GAM, whose prediction is done on the removed period.It can be found that every CV model's result in Fig. 4 is almost consistent, similar to the model in Fig. 3, and the scatter points in Fig. 4 are distributed around the edge of the line, which indicates the accuracy of the GAM.The deviance explained in Table 5 is in the range from 8.05 to 13.88%, which means the increment of PM 2.5 can influence the increment of KD, though the effect may not be major.Due to the RMSE between 7.77 and 11.59, it needs more collected data and further research to improve the result.

Results of Fisher's exact test
Table 6 shows the number of cases of coronary aneurysms from the first quarter of 2015 to the third quarter of 2021, the mean concentration of PM 2.5 , and their median.A total of 27 quarters of data were included.The median concentration of PM 2.5 is 49.60 μg/m 3 , and the median of quarterly recorded coronary artery aneurysm cases is four.
Table 7 shows the contingency table of the number of patients with coronary aneurysms and levels of PM 2.5 concentration counted quarterly using median as the threshold, as well as the odds ratio and 95% confidence interval of Fisher's exact test.It can be found the odds ratio is greater than 1, indicating that PM 2.5 is a risk factor for coronary aneurysms.However, the confidence interval is too wide, which means the result is not statistically significant.Therefore, more data and follow-up studies may be needed to confirm this result.

Discussion
Our study demonstrates that (1) the incidence of KD increased significantly in the high-increasing season of PM 2.5 in Chengdu.(2) In addition, PM 2.5 may be a risk factor for coronary artery aneurysm.(3) Although increments of precipitation, relative humidity, temperature, SO 2 , and CO have a positive effect on KD, there is no statistical significance.
KD is most common in boys under 5 years old, and the results of this study are consistent with previous results (Burns et al. 2000).This may be related to various phylogenetic immaturity in children under 5 years of age.
As mentioned in the "Introduction" section, although the etiology of KD is still unknown, it is increasingly believed that it is a clinical syndrome resulting from abnormal activation of the autoimmune system caused by one or more widespread infectious agents on the basis of a certain genetic predisposition.Exposure to pollution stimulates the body's inflammatory response and may even lead to cardiovascular disease (Bayram et al. 2001).Exposure to particulate matter (PM) air pollution is associated with cardiovascular morbidity and mortality, according to a 2010 statement from the American Heart Association (Brook et al. 2010).PM 2.5 is the main component and the main toxic component of atmospheric particulate matter.Through fitting the model by stepwise regression and testing the robustness of the model, this study found that the prevalence of KD in children in Chengdu is positively correlated with PM 2.5 (Tables 2,  3), suggesting that the incidence of KD may be related to air pollution.This conclusion is further supported by the variable importance of the random forest model (Table 4, Fig. 2), and the nonlinear relationship between KD increment and PM 2.5 increment is further investigated through the GAM model (Fig. 3).We hypothesize the reason may be that (1) children's respiratory system and immune system is not mature.(2) Long-term exposure to PM 2.5 may stimulate a series of inflammatory reactions, cause vascular endothelial cell dysfunction, trigger a storm of inflammatory factors and coronary artery endothelial injury, and then lead to KD. Coronary artery aneurysms develop in 20% of untreated children (de La Harpe et al. 2019;Newburger et al. 2016), but there are scarcely any researches have reported the relationship between coronary aneurysms and air pollution.We  14 (0.195, 6.71) first found that PM 2.5 may be a risk factor for coronary aneurysms due to the odds ratio is 1.14 (Table 6), which further indicates that KD may be related to air pollution.Although the cause of KD is unclear, recent studies have found that air pollution, extreme temperature, and other factors may be the risk factors of KD (Jung et al. 2017).Our study suggested that the increment of precipitation, relative humidity, and temperature have a positive effect on KD cases increment, although these were not statistically significant (Table 2).To some extent, this result supports previous studies.Rowley et al. have found that high temperature would promote vascular endothelial cells to release inflammatory mediators, and vascular endothelial dysfunction might aggravate the risk of vascular diseases (Rowley et al. 2000).Previous studies have shown that KD may be related to various respiratory virus infections (Wang et al. 2005).Large humidity or extreme temperature may improve the spread of infectious agents or its activity that could increase the risk of KD in terms of the infection hypothesis.The high incidence season of KD in mainland China is spring and summer, but this study found that the incidence season of KD in Chengdu is mainly winter and spring, which is consistent with the increasing season of PM 2.5 and similar to the epidemic season of KD in the USA.
Previous studies in Japan found that SO 2 and CO may have a potential impact on KD, but no statistical significance was found (Rodo et al. 2014), and this study also confirmed this result.SO 2 and CO may be important for cardiovascular disease, but they are not explanatory risk factors for Kawasaki disease at present.
There are some limitations to this study.First, we use air pollution and temperature data from fixed-site monitoring stations as proxies for individual exposure, so measurement errors are inevitable.In addition, the time of KD onset used in the study is the time of admission, which may lead to a mismatch between exposure time and data.Third, our analysis also finds some factors that have a positive correlation to KD incidence, but this result lacks robust evidence.Finally, the association between air pollution, climate conditions, and KD occurrence, as well as its potential pathological mechanism, requires further study with a large sample size and data accuracy.

Conclusion
In summary, as the first study to evaluate the association between PM 2.5 and KD incidence in Mainland China, this investigation showed that exposure to PM 2.5 may be a risk factor for KD in Chengdu, China.The relationship between coronary aneurysms and the environment also requires further study with a large sample size and data accuracy.

Fig. 1
Fig. 1 shows a diagram of the correlation coefficient between each variable, where darker colors indicate a stronger correlation, and it can be found that there is a strong correlation among the pollution indicators.Due to the high correlation among air pollution index variables, which means multicollinearity, the coefficient estimation of the corresponding

Fig. 1
Fig. 1 Diagram of the correlation coefficient between each variable

Fig. 2
Fig. 2 Average variable importance of different indicator variables

Fig. 3
Fig. 3 GAM result of PM 2.5 and KD

Table 2
The regression results of some index variables and the monthly increment of KD, in which the indexes were selected and some necessary but non-significant, were retained

Table 3
The regression results of some index variables and monthly increments of KD, where the indexes are Winsorized by 10% *p-value ≤ 0.05

Table 4
The number of times that different index variables ranked first in the variable importance table among 1000 random forest results

Table 5
The cross-validation (CV) performance of GAM